Transformation处理的数据为Key-Value形式的算子大致能够分为:输入分区与输出分区一对一、聚集、连接操作。

输入分区与输出分区一对一

mapValues

mapValues:针对(Key,Value)型数据中的Value进行Map操作,而不正确Key进行处理。



方框代表RDD分区。a=>a+2代表仅仅对( V1。 1)数据中的1进行加2操作,返回结果为3。

源代码:

  /**
* Pass each value in the key-value pair RDD through a map function without changing the keys;
* this also retains the original RDD's partitioning.
*/
def mapValues[U](f: V => U): RDD[(K, U)] = {
val cleanF = self.context.clean(f)
new MapPartitionsRDD[(K, U), (K, V)](self,
(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },
preservesPartitioning = true)
}

单个RDD或两个RDD聚集

(1)combineByKey

combineByKey是对单个Rdd的聚合。相当于将元素为(Int。Int)的RDD转变为了(Int,Seq[Int])类型元素的RDD。

定义combineByKey算子的说明例如以下:

  • createCombiner: V => C。 在C不存在的情况下,如通过V创建seq C。
  • mergeValue:(C, V) => C, 当C已经存在的情况下。须要merge,如把item V加到seq

    C中,或者叠加。
  • mergeCombiners:(C,C) => C,合并两个C。
  • partitioner: Partitioner(分区器),Shuffle时须要通过Partitioner的分区策略进行分区。

  • mapSideCombine: Boolean=true, 为了减小传输量,非常多combine能够在map端先做。比如, 叠加能够先在一个partition中把全部同样的Key的Value叠加, 再shuffle。

  • serializerClass:String=null,传输须要序列化,用户能够自己定义序列化类。



方框代表RDD分区。 通过combineByKey,将(V1,2)、 (V1,1)数据合并为(V1,Seq(2,1))。

源代码:

  /**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
* Note that V and C can be different -- for example, one might group an RDD of type
* (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
*
* - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
* - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
* - `mergeCombiners`, to combine two C's into a single one.
*
* In addition, users can control the partitioning of the output RDD, and whether to perform
* map-side aggregation (if a mapper can produce multiple items with the same key).
*/
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
} /**
* Simplified version of combineByKey that hash-partitions the output RDD.
*/
def combineByKey[C](createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)] = {
combineByKey(createCombiner, mergeValue, mergeCombiners, new HashPartitioner(numPartitions))
}

(2)reduceByKey

reduceByKey是更简单的一种情况。仅仅是两个值合并成一个值,所以createCombiner非常easy,就是直接返回v。而mergeValue和mergeCombiners的逻辑同样。没有差别。



方框代表RDD分区。 通过用户自己定义函数(A。B)=>(A+B)。将同样Key的数据(V1,2)、(V1,1)的value相加。结果为(V1,3)。

源代码:

  /**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = {
combineByKey[V]((v: V) => v, func, func, partitioner)
} /**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = {
reduceByKey(new HashPartitioner(numPartitions), func)
} /**
* Merge the values for each key using an associative reduce function. This will also perform
* the merging locally on each mapper before sending results to a reducer, similarly to a
* "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
reduceByKey(defaultPartitioner(self), func)
}

(3)partitionBy

partitionBy函数对RDD进行分区操作。

假设原有RDD的分区器和现有分区器(partitioner)一致,则不重分区,假设不一致,则相当于依据分区器生成一个新的ShuffledRDD。



方框代表RDD分区。

通过新的分区策略将原来在不同分区的V1、 V2数据都合并到了一个分区。

源代码:

  /**
* Return a copy of the RDD partitioned using the specified partitioner.
*/
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
}

(4)cogroup

cogroup函数将两个RDD进行协同划分。

对在两个RDD中的Key-Value类型的元素,每一个RDD同样Key的元素分别聚合为一个集合,而且返回两个RDD中相应Key的元素集合的迭代器(K, (Iterable[V], Iterable[w]))。当中,Key和Value,Value是两个RDD下同样Key的两个数据集合的迭代器所构成的元组。



慷慨框代表RDD。慷慨框内的小方框代表RDD中的分区。 将RDD1中的数据(U1,1)、(U1,2)和RDD2中的数据(U1,2)合并为(U1,((1,2),(2)))。

源代码:

  /**
* For each key k in `this` or `other1` or `other2` or `other3`,
* return a resulting RDD that contains a tuple with the list of values
* for that key in `this`, `other1`, `other2` and `other3`.
*/
def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
other2: RDD[(K, W2)],
other3: RDD[(K, W3)],
partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
cg.mapValues { case Array(vs, w1s, w2s, w3s) =>
(vs.asInstanceOf[Iterable[V]],
w1s.asInstanceOf[Iterable[W1]],
w2s.asInstanceOf[Iterable[W2]],
w3s.asInstanceOf[Iterable[W3]])
}
} /**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W]))] = {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
cg.mapValues { case Array(vs, w1s) =>
(vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])
}
} /**
* For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
* tuple with the list of values for that key in `this`, `other1` and `other2`.
*/
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner)
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = {
if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
throw new SparkException("Default partitioner cannot partition array keys.")
}
val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
cg.mapValues { case Array(vs, w1s, w2s) =>
(vs.asInstanceOf[Iterable[V]],
w1s.asInstanceOf[Iterable[W1]],
w2s.asInstanceOf[Iterable[W2]])
}
} /**
* For each key k in `this` or `other1` or `other2` or `other3`,
* return a resulting RDD that contains a tuple with the list of values
* for that key in `this`, `other1`, `other2` and `other3`.
*/
def cogroup[W1, W2, W3](other1: RDD[(K, W1)], other2: RDD[(K, W2)], other3: RDD[(K, W3)])
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = {
cogroup(other1, other2, other3, defaultPartitioner(self, other1, other2, other3))
} /**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))] = {
cogroup(other, defaultPartitioner(self, other))
} /**
* For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
* tuple with the list of values for that key in `this`, `other1` and `other2`.
*/
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)])
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = {
cogroup(other1, other2, defaultPartitioner(self, other1, other2))
} /**
* For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
* list of values for that key in `this` as well as `other`.
*/
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))] = {
cogroup(other, new HashPartitioner(numPartitions))
} /**
* For each key k in `this` or `other1` or `other2`, return a resulting RDD that contains a
* tuple with the list of values for that key in `this`, `other1` and `other2`.
*/
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int)
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = {
cogroup(other1, other2, new HashPartitioner(numPartitions))
} /**
* For each key k in `this` or `other1` or `other2` or `other3`,
* return a resulting RDD that contains a tuple with the list of values
* for that key in `this`, `other1`, `other2` and `other3`.
*/
def cogroup[W1, W2, W3](other1: RDD[(K, W1)],
other2: RDD[(K, W2)],
other3: RDD[(K, W3)],
numPartitions: Int)
: RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = {
cogroup(other1, other2, other3, new HashPartitioner(numPartitions))
}

连接

(1)join

join对两个须要连接的RDD进行cogroup函数操作。cogroup操作之后形成的新RDD,对每一个key下的元素进行笛卡尔积操作,返回的结果再展平。相应Key下的全部元组形成一个集合,最后返回RDD[(K。(V。W))]。

join的本质是通过cogroup算子先进行协同划分。再通过flatMapValues将合并的数据打散。



对两个RDD的join操作示意图。 慷慨框代表RDD。小方框代表RDD中的分区。

函数对拥有同样Key的元素(比如V1)为Key,以做连接后的数据结果为(V1,(1,1))和(V1,(1,2))。

源代码:

  /**
* Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
* pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
* (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
*/
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
this.cogroup(other, partitioner).flatMapValues( pair =>
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
)
}

(2)leftOuterJoin和rightOuterJoin

LeftOuterJoin(左外连接)和RightOuterJoin(右外连接)相当于在join的基础上先推断一側的RDD元素是否为空。假设为空,则填充为空。 假设不为空,则将数据进行连接运算,并返回结果。

源代码:

  /**
* Perform a left outer join of `this` and `other`. For each element (k, v) in `this`, the
* resulting RDD will either contain all pairs (k, (v, Some(w))) for w in `other`, or the
* pair (k, (v, None)) if no elements in `other` have key k. Uses the given Partitioner to
* partition the output RDD.
*/
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
this.cogroup(other, partitioner).flatMapValues { pair =>
if (pair._2.isEmpty) {
pair._1.iterator.map(v => (v, None))
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
}
}
} /**
* Perform a right outer join of `this` and `other`. For each element (k, w) in `other`, the
* resulting RDD will either contain all pairs (k, (Some(v), w)) for v in `this`, or the
* pair (k, (None, w)) if no elements in `this` have key k. Uses the given Partitioner to
* partition the output RDD.
*/
def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
: RDD[(K, (Option[V], W))] = {
this.cogroup(other, partitioner).flatMapValues { pair =>
if (pair._1.isEmpty) {
pair._2.iterator.map(w => (None, w))
} else {
for (v <- pair._1.iterator; w <- pair._2.iterator) yield (Some(v), w)
}
}
}

转载请注明作者Jason Ding及其出处

GitCafe博客主页(http://jasonding1354.gitcafe.io/)

Github博客主页(http://jasonding1354.github.io/)

CSDN博客(http://blog.csdn.net/jasonding1354)

简书主页(http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

Google搜索jasonding1354进入我的博客主页

【Spark】RDD操作具体解释3——键值型Transformation算子的更多相关文章

  1. 【Spark】RDD操作具体解释2——值型Transformation算子

    处理数据类型为Value型的Transformation算子能够依据RDD变换算子的输入分区与输出分区关系分为下面几种类型: 1)输入分区与输出分区一对一型 2)输入分区与输出分区多对一型 3)输入分 ...

  2. Learning Spark中文版--第四章--使用键值对(2)

    Actions Available on Pair RDDs (键值对RDD可用的action)   和transformation(转换)一样,键值对RDD也可以使用基础RDD上的action(开工 ...

  3. Learning Spark中文版--第四章--使用键值对(1)

      本章介绍了如何使用键值对RDD,Spark中很多操作都基于此数据类型.键值对RDD通常在聚合操作中使用,而且我们经常做一些初始的ETL(extract(提取),transform(转换)和load ...

  4. 键值对的算子讲解 PairRDDFunctions

    1:groupByKey def groupByKey(): RDD[(K, Iterable[V])] 根据key进行聚集,value组成一个列表,没有进行聚集,所以在有shuffle操作时候避免使 ...

  5. key-value键值型数据库:Redis

    key-value键值型数据库:Redis redis Redis是in-memory型(内存型)的键值数据库,数据在磁盘上是持久的,键类型是字符串,值类型是字符串.字符串集合(Set).sorted ...

  6. 【Spark】RDD操作具体解释4——Action算子

    本质上在Actions算子中通过SparkContext运行提交作业的runJob操作,触发了RDD DAG的运行. 依据Action算子的输出空间将Action算子进行分类:无输出. HDFS. S ...

  7. Spark RDD 操作

    1. Spark RDD 创建操作 1.1 数据集合   parallelize 可以创建一个能够并行操作的RDD.其函数定义如下: ) scala> sc.defaultParallelism ...

  8. Hadoop概念学习系列之谈hadoop/spark里为什么都有,键值对呢?(四十)

    很少有人会这样来自问自己?只知道,以键值对的形式处理数据并输出结果,而没有解释为什么要以键值对的形式进行. 包括hadoop的mapreduce里的键值对,spark里的rdd里的map等. 这是为什 ...

  9. 【原】Learning Spark (Python版) 学习笔记(二)----键值对、数据读取与保存、共享特性

    本来应该上周更新的,结果碰上五一,懒癌发作,就推迟了 = =.以后还是要按时完成任务.废话不多说,第四章-第六章主要讲了三个内容:键值对.数据读取与保存与Spark的两个共享特性(累加器和广播变量). ...

随机推荐

  1. java基础学习总结——哈希编码

    一.哈希编码

  2. MVC扩展Filter,通过继承HandleErrorAttribute,使用log4net或ELMAH组件记录服务端500错误、HttpException、Ajax异常等

    □ 接口 public interface IExceptionFilter{    void OnException(ExceptionContext filterContext);} Except ...

  3. Office Word等双击空白处的“隐藏的模块中的编译错误:MTW5”解决

    Microsoft Visual Basic for Applications 隐藏的模块中的编译错误:MTW5. ...

  4. 在使用完全拷贝过来的类文件(带xib文件)时,要及时修改 File's Owner

  5. C#编程(五)----流程控制

    流控制(控制语句) 程序的代码不是按照从上往下执行的,是按照控制语句执行的. 条件语句 C#中有两个控制语句:if语句还有switch语句 1.if语句   C#中继承了C和C++中的if语句,语法直 ...

  6. 使用HTML5画柱状图

    柱状图在很多应用中都比较常见,例如投票结果的统计分析,企业销售数据的统计分析等等.    需求分析:  一个柱状图一般包含以下几部分:  1.标题  2.横坐标(含标题)  3.竖坐标 (含标题.刻度 ...

  7. Mysql中与时间相关的统计分析

    最近项目需要统计一段日期范围内,根据每分钟.几分钟.每天分别统计汇总某些事件/指标的发生总次数,平均发生次数,因此总结了Mysql中与时间处理.统计相关的资料. 按分钟统计某一时间段内的数据 SELE ...

  8. 让Mac风扇面对PD不再疯狂

    对于所有喜欢Mac操作系统的用户来说,如果办公环境必须使用Windows及Windows程序,那一定会非常崩溃,因为你很可能使用了Parallels Desktop来运行你的Windows虚拟机,那么 ...

  9. NLP国内研究方向机构导师

    基础研究 词法与句法分析:李正华.陈文亮.张民(苏州大学) 语义分析:周国栋.李军辉(苏州大学) 篇章分析:王厚峰.李素建(北京大学) 语言认知模型:王少楠,宗成庆(中科院自动化研究所) 语言表示与深 ...

  10. tensorflow 之常见模块conv,bn...实现

    使用tensorflow时,会发现tf.nn,tf.layers, tf.contrib模块有很多功能是重复的,尤其是卷积操作,在使用的时候,我们可以根据需要现在不同的模块.但有些时候可以一起混用. ...