spark RDD，reduceByKey vs groupByKey

Spark中有两个类似的api，分别是reduceByKey和groupByKey。这两个的功能类似，但底层实现却有些不同，那么为什么要这样设计呢？我们来从源码的角度分析一下。

先看两者的调用顺序（都是使用默认的Partitioner，即defaultPartitioner）

所用spark版本：spark2.1.0

先看reduceByKey

Step1

  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {

    reduceByKey(defaultPartitioner(self), func)

  }

Setp2

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {

    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

  }

Setp3

def combineByKeyWithClassTag[C](

      createCombiner: V => C,

      mergeValue: (C, V) => C,

      mergeCombiners: (C, C) => C,

      partitioner: Partitioner,

      mapSideCombine: Boolean = true,

      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {

    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0

    if (keyClass.isArray) {

      if (mapSideCombine) {

        throw new SparkException("Cannot use map-side combining with array keys.")

      }

      if (partitioner.isInstanceOf[HashPartitioner]) {

        throw new SparkException("HashPartitioner cannot partition array keys.")

      }

    }

    val aggregator = new Aggregator[K, V, C](

      self.context.clean(createCombiner),

      self.context.clean(mergeValue),

      self.context.clean(mergeCombiners))

    if (self.partitioner == Some(partitioner)) {

      self.mapPartitions(iter => {

        val context = TaskContext.get()

        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))

      }, preservesPartitioning = true)

    } else {

      new ShuffledRDD[K, V, C](self, partitioner)

        .setSerializer(serializer)

        .setAggregator(aggregator)

        .setMapSideCombine(mapSideCombine)

    }

  }

姑且不去看方法里面的细节，我们会只要知道最后调用的是combineByKeyWithClassTag这个方法。这个方法有两个参数我们来重点看一下，

def combineByKeyWithClassTag[C](

      createCombiner: V => C,

      mergeValue: (C, V) => C,

      mergeCombiners: (C, C) => C,

      partitioner: Partitioner,

      mapSideCombine: Boolean = true,

      serializer: Serializer = null)

首先是partitioner参数，这个即是RDD的分区设置。除了默认的defaultPartitioner，Spark还提供了RangePartitioner和HashPartitioner外，此外用户也可以自定义partitioner。通过源码可以发现如果是HashPartitioner的话，那么是会抛出一个错误的。

然后是mapSideCombine参数，这个参数正是reduceByKey和groupByKey最大不同的地方，它决定是是否会先在节点上进行一次Combine操作，下面会有更具体的例子来介绍。

然后是groupByKey

Step1

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {

    groupByKey(defaultPartitioner(self))

  }

Step2

  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {

    // groupByKey shouldn't use map side combine because map side combine does not

    // reduce the amount of data shuffled and requires all map side data be inserted

    // into a hash table, leading to more objects in the old gen.

    val createCombiner = (v: V) => CompactBuffer(v)

    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v

    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2

    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](

      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)

    bufs.asInstanceOf[RDD[(K, Iterable[V])]]

  }

Setp3

def combineByKeyWithClassTag[C](

      createCombiner: V => C,

      mergeValue: (C, V) => C,

      mergeCombiners: (C, C) => C,

      partitioner: Partitioner,

      mapSideCombine: Boolean = true,

      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {

    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0

    if (keyClass.isArray) {

      if (mapSideCombine) {

        throw new SparkException("Cannot use map-side combining with array keys.")

      }

      if (partitioner.isInstanceOf[HashPartitioner]) {

        throw new SparkException("HashPartitioner cannot partition array keys.")

      }

    }

    val aggregator = new Aggregator[K, V, C](

      self.context.clean(createCombiner),

      self.context.clean(mergeValue),

      self.context.clean(mergeCombiners))

    if (self.partitioner == Some(partitioner)) {

      self.mapPartitions(iter => {

        val context = TaskContext.get()

        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))

      }, preservesPartitioning = true)

    } else {

      new ShuffledRDD[K, V, C](self, partitioner)

        .setSerializer(serializer)

        .setAggregator(aggregator)

        .setMapSideCombine(mapSideCombine)

    }

  }

结合上面reduceByKey的调用链，可以发现最终其实都是调用combineByKeyWithClassTag这个方法的，但调用的参数不同。

reduceByKey的调用

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

groupByKey的调用

combineByKeyWithClassTag[CompactBuffer[V]](

      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)

正是两者不同的调用方式导致了两个方法的差别，我们分别来看

reduceByKey的泛型参数直接是[V]，而groupByKey的泛型参数是[CompactBuffer[V]]。这直接导致了reduceByKey和groupByKey的返回值不同，前者是RDD[(K, V)]，而后者是RDD[(K, Iterable[V])]
然后就是mapSideCombine=false了，这个mapSideCombine参数的默认是true的。这个值有什么用呢，上面也说了，这个参数的作用是控制要不要在map端进行初步合并（Combine）。可以看看下面具体的例子。

从功能上来说，可以发现ReduceByKey其实就是会在每个节点先进行一次合并的操作，而groupByKey没有。

这么来看ReduceByKey的性能会比groupByKey好很多，因为有些工作在节点已经处理了。那么groupByKey为什么存在，它的应用场景是什么呢？我也不清楚，如果观看这篇文章的读者知道的话不妨在评论里说出来吧。非常感谢！