转载-reduceByKey和groupByKey的区别

原文链接-https://www.cnblogs.com/0xcafedaddy/p/7625358.html

先来看一下在PairRDDFunctions.scala文件中reduceByKey和groupByKey的源码

/**

 * Merge the values for each key using an associative reduce function. This will also perform

 * the merging locally on each mapper before sending results to a reducer, similarly to a

 * "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/

 * parallelism level.

 */

def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {

  reduceByKey(defaultPartitioner(self), func)

}

/**

 * Group the values for each key in the RDD into a single sequence. Allows controlling the

 * partitioning of the resulting key-value pair RDD by passing a Partitioner.

 * The ordering of elements within each group is not guaranteed, and may even differ

 * each time the resulting RDD is evaluated.

 *

 * Note: This operation may be very expensive. If you are grouping in order to perform an

 * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]

 * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.

 *

 * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any

 * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].

 */

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {

  // groupByKey shouldn't use map side combine because map side combine does not

  // reduce the amount of data shuffled and requires all map side data be inserted

  // into a hash table, leading to more objects in the old gen.

  val createCombiner = (v: V) => CompactBuffer(v)

  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v

  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2

  val bufs = combineByKey[CompactBuffer[V]](

    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine=false)

  bufs.asInstanceOf[RDD[(K, Iterable[V])]]

}

通过源码可以发现:

reduceByKey：reduceByKey会在结果发送至reducer之前会对每个mapper在本地进行merge，有点类似于在MapReduce中的combiner。这样做的好处在于，在map端进行一次reduce之后，数据量会大幅度减小，从而减小传输，保证reduce端能够更快的进行结果计算。

groupByKey：groupByKey会对每一个RDD中的value值进行聚合形成一个序列(Iterator)，此操作发生在reduce端，所以势必会将所有的数据通过网络进行传输，造成不必要的浪费。同时如果数据量十分大，可能还会造成OutOfMemoryError。

通过以上对比可以发现在进行大量数据的reduce操作时候建议使用reduceByKey。不仅可以提高速度，还是可以防止使用groupByKey造成的内存溢出问题。

转载-reduceByKey和groupByKey的区别的更多相关文章

reduceByKey和groupByKey的区别
先来看一下在PairRDDFunctions.scala文件中reduceByKey和groupByKey的源码 /** * Merge the values for each key using a ...
spark:reducebykey与groupbykey的区别
从源码看: reduceBykey与groupbykey: 都调用函数combineByKeyWithClassTag[V]((v: V) => v, func, func, partition ...
reduceByKey和groupByKey区别与用法
在spark中,我们知道一切的操作都是基于RDD的.在使用中,RDD有一种非常特殊也是非常实用的format——pair RDD,即RDD的每一行是(key, value)的格式.这种格式很像Pyth ...
【spark】常用转换操作：reduceByKey和groupByKey
1.reduceByKey(func) 功能: 使用 func 函数合并具有相同键的值. 示例: val list = List("hadoop","spark" ...
spark RDD，reduceByKey vs groupByKey
Spark中有两个类似的api,分别是reduceByKey和groupByKey.这两个的功能类似,但底层实现却有些不同,那么为什么要这样设计呢?我们来从源码的角度分析一下. 先看两者的调用顺序(都 ...
转载>>C# Invoke和BeginInvoke区别和使用场景
转载>>C# Invoke和BeginInvoke区别和使用场景一.为什么Control类提供了Invoke和BeginInvoke机制? 关于这个问题的最主要的原因已经是dotnet程 ...
【Spark算子】：reduceByKey、groupByKey和combineByKey
在spark中,reduceByKey.groupByKey和combineByKey这三种算子用的较多,结合使用过程中的体会简单总结: 我的代码实践:https://github.com/wwcom ...
spark新能优化之reduceBykey和groupBykey的使用
val counts = pairs.reduceByKey(_ + _) val counts = pairs.groupByKey().map(wordCounts => (wordCoun ...
【转载】strlen与sizeof区别
自己小结: sizeof使用时,若是数组变量,则是数组变量占的大小 char a[10]; sizeof(a)=10 若是指针,则为指针大小,数组变量作为函数参数传递时,会退化成指针,且函数内是不知道 ...

随机推荐

MVC WebAPI框架里设置异常返回格式统一
webApi里设置全局异常返回格式今天为了设置api返回格式统一,在网上找了一推资料,各种资料参差不齐的,最后自己捣鼓,终于弄出来了,直接上代码 /// <summary> /// 消息代 ...
sublime中编辑服务器上的文件
背景:公司项目需要进行构建编译,在服务器上速度比较快,所以需要将sublime和linux中的文件相关联. 参考资料:http://zyan.cc/samba_linux_windows/ 主要有两步 ...
c++ sort
老是搞混 return bool eg. bool cmp(node a,node b) { if (a.score==b.score) ; else return a.score>b.scor ...
increment/decrement/dereference操作符
标题以上分别对于++/--/* #include <iostream> #include <cstddef> using namespace std; class INT { ...
Proxy代理模式
https://www.cnblogs.com/vincentzh/p/5988145.html https://www.cnblogs.com/wrbxdj/p/5267370.html(不错)
hdu 4352 "XHXJ's LIS"（数位DP+状压DP+LIS）
传送门参考博文: [1]:http://www.voidcn.com/article/p-ehojgauy-ot.html 题解: 将数字num字符串化: 求[L,R]区间最长上升子序列长度为 K ...
Error:Failed to resolve: :Base:
这个问题是变通了一下,原来是: //implementation(name: "Base", ext: "aar") 修改成: implementation f ...
bitmap的使用
https://blog.csdn.net/csdnsevenn/article/details/82230049 使用bitmap来解决: 2的32次方大概是42亿个数,所以这么多数中,存在的为1, ...
以太网 ------ Auto-Negotiation（自动协商）
说起自动协商(Auto-negotiation),我想很多人都不会陌生.当你把你PC机器上的网卡通过一段双绞线连接到某个交换机的某个端口的时候,如果你的网卡和交换机都支持自动协商功能的话,一件有趣的事 ...
一键开启MacOS HiDPI
完整文件下载:一键开启MacOS HiDPI 引言作为一个黑苹果用户,追求黑果的体验是当然的,当各个硬件都驱动完善后,要做的就是细节的优化了,毕竟装上是拿来用的,可不能因为体验差苦了自己啊.机器毕竟 ...

转载-reduceByKey和groupByKey的区别

转载-reduceByKey和groupByKey的区别的更多相关文章

随机推荐

热门专题