和一般RDD最大的不同就是有两个泛型参数, [K, V]表示pair的概念
关键的function是, combineByKey, 所有pair相关操作的抽象

combine是这样的操作, Turns an RDD[(K, V)] into a result of type RDD[(K, C)]
其中C有可能只是简单类型, 但经常是seq, 比如(Int, Int) to (Int, Seq[Int])

下面来看看combineByKey的参数,
首先需要用户自定义一些操作,
createCombiner: V => C, C不存在的情况下, 比如通过V创建seq C
mergeValue: (C, V) => C, 当C已经存在的情况下, 需要merge, 比如把item V加到seq C中, 或者叠加 
mergeCombiners: (C, C) => C,  合并两个C
partitioner: Partitioner, Shuffle时需要的Partitioner
mapSideCombine: Boolean = true, 为了减小传输量, 很多combine可以在map端先做, 比如叠加, 可以先在一个partition中把所有相同的key的value叠加, 再shuffle
serializerClass: String = null, 传输需要序列化, 用户可以自定义序列化类

/**
* Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
* Import `org.apache.spark.SparkContext._` at the top of your program to use these functions.
*/
class PairRDDFunctions[K: ClassManifest, V: ClassManifest](self: RDD[(K, V)])
extends Logging
with SparkHadoopMapReduceUtil
with Serializable { /**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
* Note that V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*/
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializerClass: String = null): RDD[(K, C)] = {
    val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners) //1.Aggregator
    //RDD本身的partitioner和传入的partitioner相等时, 即不需要重新shuffle, 直接map即可
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true) //2. mapPartitions, map端直接调用combineValuesByKey
} else if (mapSideCombine) { //如果需要mapSideCombine
val combined = self.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true) //先在partition内部做mapSideCombine
val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner).setSerializer(serializerClass) //3. ShuffledRDD, 进行shuffle
partitioned.mapPartitions(aggregator.combineCombinersByKey, preservesPartitioning = true) //Shuffle完后, 在reduce端再做一次combine, 使用combineCombinersByKey
} else {
// Don't apply map-side combiner.和上面的区别就是不做mapSideCombine
// A sanity check to make sure mergeCombiners is not defined.
assert(mergeCombiners == null)
val values = new ShuffledRDD[K, V, (K, V)](self, partitioner).setSerializer(serializerClass)
values.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true)
}
}
}

1. Aggregator

在combineByKey中, 首先创建Aggregator, 其实在Aggregator中封装了两个函数,
combineValuesByKey, 用于处理将V加入到C的case, 比如加入一个item到一个seq里面, 用于map端
combineCombinersByKey, 用于处理两个C合并, 比如两个seq合并, 用于reduce端

case class Aggregator[K, V, C] (
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C) { def combineValuesByKey(iter: Iterator[_ <: Product2[K, V]]) : Iterator[(K, C)] = {
val combiners = new JHashMap[K, C]
for (kv <- iter) {
val oldC = combiners.get(kv._1)
if (oldC == null) {
combiners.put(kv._1, createCombiner(kv._2))
} else {
combiners.put(kv._1, mergeValue(oldC, kv._2))
}
}
combiners.iterator
} def combineCombinersByKey(iter: Iterator[(K, C)]) : Iterator[(K, C)] = {
val combiners = new JHashMap[K, C]
iter.foreach { case(k, c) =>
val oldC = combiners.get(k)
if (oldC == null) {
combiners.put(k, c)
} else {
combiners.put(k, mergeCombiners(oldC, c))
}
}
combiners.iterator
}
}

2. mapPartitions

mapPartitions其实就是使用MapPartitionsRDD
做的事情就是对当前partition执行map函数f, Iterator[T] => Iterator[U]
比如, 执行combineValuesByKey: Iterator[_ <: Product2[K, V]] to Iterator[(K, C)]

  /**
* Return a new RDD by applying a function to each partition of this RDD.
*/
def mapPartitions[U: ClassManifest](f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] =
new MapPartitionsRDD(this, sc.clean(f), preservesPartitioning)
 

private[spark]
class MapPartitionsRDD[U: ClassManifest, T: ClassManifest](
prev: RDD[T],
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false)
extends RDD[U](prev) { override val partitioner =
if (preservesPartitioning) firstParent[T].partitioner else None override def getPartitions: Array[Partition] = firstParent[T].partitions override def compute(split: Partition, context: TaskContext) =
f(firstParent[T].iterator(split, context)) // 对于map,就是调用f

3. ShuffledRDD

Shuffle实际上是由系统的shuffleFetcher完成的, Spark的抽象封装非常的好
所以在这里看不到Shuffle具体是怎么样做的, 这个需要分析到shuffleFetcher时候才能看到 
因为每个shuffle是有一个全局的shuffleid的
所以在compute里面, 你只是看到用BlockStoreShuffleFetcher根据shuffleid和partitionid直接fetch到shuffle过后的数据

/**
* The resulting RDD from a shuffle (e.g. repartitioning of data).
* @param prev the parent RDD.
* @param part the partitioner used to partition the RDD
* @tparam K the key class.
* @tparam V the value class.
*/
class ShuffledRDD[K, V, P <: Product2[K, V] : ClassManifest](
@transient var prev: RDD[P],
part: Partitioner)
extends RDD[P](prev.context, Nil) {
  override val partitioner = Some(part)
//ShuffleRDD会进行repartition, 所以从Partitioner中取出新的part数目
  //并用Array.tabulate动态创建相应个数的ShuffledRDDPartition
override def getPartitions: Array[Partition] = {
Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i))
} override def compute(split: Partition, context: TaskContext): Iterator[P] = {
val shuffledId = dependencies.head.asInstanceOf[ShuffleDependency[K, V]].shuffleId
SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context.taskMetrics,
SparkEnv.get.serializerManager.get(serializerClass))
}
}

ShuffledRDDPartition没啥区别, 一样只是记录id

private[spark] class ShuffledRDDPartition(val idx: Int) extends Partition {
override val index = idx
override def hashCode(): Int = idx
}

下面再来看一下, 如果使用combineByKey来实现其他的操作的,

group

group是比较典型的例子, (Int, Int) to (Int, Seq[Int])
由于groupByKey不使用map side combine, 因为这样也无法减少传输空间, 所以不需要实现mergeCombiners

  /**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
def createCombiner(v: V) = ArrayBuffer(v) //创建seq
def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v //添加item到seq
val bufs = combineByKey[ArrayBuffer[V]](
createCombiner _, mergeValue _, null, partitioner, mapSideCombine=false)
bufs.asInstanceOf[RDD[(K, Seq[V])]]
}

reduce

reduce是更简单的一种情况, 只是两个值合并成一个值, (Int, Int V) to (Int, Int C), 比如叠加
所以createCombiner很简单, 就是直接返回v
而mergeValue和mergeCombiners逻辑是相同的, 没有区别

  /**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = {
combineByKey[V]((v: V) => v, func, func, partitioner)
}

Spark源码分析 -- PairRDD的更多相关文章

  1. Spark源码分析 – 汇总索引

    http://jerryshao.me/categories.html#architecture-ref http://blog.csdn.net/pelick/article/details/172 ...

  2. Spark源码分析 – Shuffle

    参考详细探究Spark的shuffle实现, 写的很清楚, 当前设计的来龙去脉 Hadoop Hadoop的思路是, 在mapper端每次当memory buffer中的数据快满的时候, 先将memo ...

  3. Spark源码分析 – Dependency

    Dependency 依赖, 用于表示RDD之间的因果关系, 一个dependency表示一个parent rdd, 所以在RDD中使用Seq[Dependency[_]]来表示所有的依赖关系 Dep ...

  4. Spark源码分析(三)-TaskScheduler创建

    原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3879151.html 在SparkContext创建过程中会调用createTaskScheduler函 ...

  5. Spark源码分析环境搭建

    原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3868718.html 本文主要分享一下如何构建Spark源码分析环境.以前主要使用eclipse来阅读源 ...

  6. Spark源码分析之Spark Shell(下)

    继上次的Spark-shell脚本源码分析,还剩下后面半段.由于上次涉及了不少shell的基本内容,因此就把trap和stty放在这篇来讲述. 上篇回顾:Spark源码分析之Spark Shell(上 ...

  7. Spark源码分析之Spark-submit和Spark-class

    有了前面spark-shell的经验,看这两个脚本就容易多啦.前面总结的Spark-shell的分析可以参考: Spark源码分析之Spark Shell(上) Spark源码分析之Spark She ...

  8. 【转】Spark源码分析之-deploy模块

    原文地址:http://jerryshao.me/architecture/2013/04/30/Spark%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90%E4%B9%8B- ...

  9. Spark源码分析:多种部署方式之间的区别与联系(转)

    原文链接:Spark源码分析:多种部署方式之间的区别与联系(1) 从官方的文档我们可以知道,Spark的部署方式有很多种:local.Standalone.Mesos.YARN.....不同部署方式的 ...

随机推荐

  1. Django--缓存、信号、序列化

    一 Django的缓存机制 1.1 缓存介绍 1.缓存的简介 在动态网站中,用户所有的请求,服务器都会去数据库中进行相应的增,删,查,改,渲染模板,执行业务逻辑,最后生成用户看到的页面. 当一个网站的 ...

  2. jquery插件-table转Json数据插件

    使 用开源插件Table-to-json: 官方地址:http://lightswitch05.github.io/table-to-json/ 功能说明:将js对象table转换成javascrip ...

  3. linux udhcpc 后无法自动设置网卡ip

    arm 主板用 udhcpc 获取租赁的空闲的ip后,并没有直接设置在网卡上. 查了一下相关原因,是因为虽然已经获取了ip, 但是并没有通过脚本去设置这个IP. 在 busybox 里面有相关的脚本要 ...

  4. /proc/interrupts 和 /proc/stat 查看中断的情况

    在/proc文件系统下,又两个文件提供了中断的信息. /proc/interrupts 文件中列出当前系统使用的中断的情况,所以某个中断处理没有安装,是不会显示的.哪怕之前安装过,被卸载了. 从左到右 ...

  5. MongoDB-Elasticsearch 实时数据导入

    时间  2017-09-18 栏目 MongoDB 原文   http://blog.csdn.net/liangxw1/article/details/78019356 5 ways to sync ...

  6. 使用OpenFace进行人脸识别(2)

    http://blog.csdn.net/u011531010/article/details/52270023 http://www.vccoo.com/v/2ed520 第一步 在 openfac ...

  7. erlang-sunface的博客地址

    erlang-sunface的博客地址: http://blog.csdn.net/abv123456789/article/category/2206185

  8. WPF 在TextBox失去焦点时检测数据,出错重新获得焦点解决办法

    WPF 在TextBox失去焦点时检测数据,出错重新获得焦点解决办法 在WPF的TextBox的LostFocus事件中直接使用Focus()方法会出现死循环的问题 正确的使用方式有2中方法: 方法一 ...

  9. 常用的jQuery前端技巧收集

    调试时巧用console.log(),这比用alert()方便多了. jquery易错点:元素拼接的时候,元素还未添加到DOM,就用该预添加元素操作. ajax动态获取的数据,还没有装载html元素, ...

  10. 神经网络Batch Normalization——学习笔记

    训练神经网络的过程,就是在求未知参数(权重).让网络搭建起来,得到理想的结果. 分类-监督学习. 反向传播求权重:每一层在算偏导数.局部梯度,链式法则. 激活函数: sigmoid仅中间段趋势良好 对 ...