先来看一下在PairRDDFunctions.scala文件中reduceByKey和groupByKey的源码

/**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
reduceByKey(defaultPartitioner(self), func)
} /**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*
* Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKey[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine=false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

通过源码可以发现:

reduceByKey:reduceByKey会在结果发送至reducer之前会对每个mapper在本地进行merge,有点类似于在MapReduce中的combiner。这样做的好处在于,在map端进行一次reduce之后,数据量会大幅度减小,从而减小传输,保证reduce端能够更快的进行结果计算。

groupByKey:groupByKey会对每一个RDD中的value值进行聚合形成一个序列(Iterator),此操作发生在reduce端,所以势必会将所有的数据通过网络进行传输,造成不必要的浪费。同时如果数据量十分大,可能还会造成OutOfMemoryError。

通过以上对比可以发现在进行大量数据的reduce操作时候建议使用reduceByKey。不仅可以提高速度,还是可以防止使用groupByKey造成的内存溢出问题。

reduceByKey和groupByKey的区别的更多相关文章

  1. 转载-reduceByKey和groupByKey的区别

    原文链接-https://www.cnblogs.com/0xcafedaddy/p/7625358.html 先来看一下在PairRDDFunctions.scala文件中reduceByKey和g ...

  2. spark:reducebykey与groupbykey的区别

    从源码看: reduceBykey与groupbykey: 都调用函数combineByKeyWithClassTag[V]((v: V) => v, func, func, partition ...

  3. reduceByKey和groupByKey区别与用法

    在spark中,我们知道一切的操作都是基于RDD的.在使用中,RDD有一种非常特殊也是非常实用的format——pair RDD,即RDD的每一行是(key, value)的格式.这种格式很像Pyth ...

  4. 【spark】常用转换操作:reduceByKey和groupByKey

    1.reduceByKey(func) 功能: 使用 func 函数合并具有相同键的值. 示例: val list = List("hadoop","spark" ...

  5. spark RDD,reduceByKey vs groupByKey

    Spark中有两个类似的api,分别是reduceByKey和groupByKey.这两个的功能类似,但底层实现却有些不同,那么为什么要这样设计呢?我们来从源码的角度分析一下. 先看两者的调用顺序(都 ...

  6. 【Spark算子】:reduceByKey、groupByKey和combineByKey

    在spark中,reduceByKey.groupByKey和combineByKey这三种算子用的较多,结合使用过程中的体会简单总结: 我的代码实践:https://github.com/wwcom ...

  7. spark新能优化之reduceBykey和groupBykey的使用

    val counts = pairs.reduceByKey(_ + _) val counts = pairs.groupByKey().map(wordCounts => (wordCoun ...

  8. scala flatmap、reduceByKey、groupByKey

    1.test.txt文件中存放 asd sd fd gf g dkf dfd dfml dlf dff gfl pkdfp dlofkp // 创建一个Scala版本的Spark Context va ...

  9. 32、reduceByKey和groupByKey对比

    一.groupByKey 1.图解 val counts = pairs.groupByKey().map(wordCounts => (wordCounts._1, wordCounts._2 ...

随机推荐

  1. Codeforces 359D Pair of Numbers | 二分+ST表+gcd

    题面: 给一个序列,求最长的合法区间,合法被定义为这个序列的gcd=区间最小值 输出最长合法区间个数,r-l长度 接下来输出每个合法区间的左端点 题解: 由于区间gcd满足单调性,所以我们可以二分区间 ...

  2. 2017-7-18-每日博客-关于Linux基本命令CnetOS7系统基本操作命令.doc

    1.root/下 cat  anaconda-ks.cfg 确定是否装base软件组 yum groupinstall base  安装base组ifconfig 命令就可以使用了或者使用ip add ...

  3. POJ 2676 数独+dfs深搜

    数独 #include "cstdio" #include "cstring" #include "cstdlib" #include &q ...

  4. foj Problem 2107 Hua Rong Dao

    Problem 2107 Hua Rong Dao Accept: 503    Submit: 1054Time Limit: 1000 mSec    Memory Limit : 32768 K ...

  5. python常用20库

    python核心库和统计 简述 1. Requests.最着名的http库由kenneth reitz编写.这是每个python开发人员必备的. 2. Scrapy.如果您参与webscraping, ...

  6. Linux内核实践之tasklet机制【转】

    转自:http://blog.csdn.net/bullbat/article/details/7423321 版权声明:本文为博主原创文章,未经博主允许不得转载. 作者:bullbat 源代码分析与 ...

  7. Selenium2+python自动化10-登录案例【转载】

    前言 前面几篇都是讲一些基础的定位方法,没具体的案例,小伙伴看起来比较枯燥,有不少小伙伴给小编提建议以后多出一些具体的案例.本篇就是拿部落论坛作为测试项目,写一个简单的登录测试脚本. 在写登录脚本的时 ...

  8. leetcode-Min Cost Climbing Stairs

    题目: On a staircase, the i-th step has some non-negative cost cost[i] assigned (0 indexed). Once you ...

  9. Centos的APK解包打包签名

    http://www.v5b7.com/other/apk.html vi /etc/profile PATH=/usr/local/mysql/bin:/usr/local/mysql/lib:/u ...

  10. HDU 17新生赛 身份证验证【模拟】

    身份证验证 Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submi ...