一、RDD概述
     1、什么是RDD
          RDD(Resilient Distributed Dataset)叫做弹性分布式数据集,是Spark中最基本的数据抽象,它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点:自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中,后续的查询能够重用工作集,这极大地提升了查询速度。
     2、RDD属性
     (1)、 一组分片(Partition),即数据集的基本组成单位。对于RDD来说,每个分片都会被一个计算任务处理,并决定并行计算的粒度。用户可以在创建RDD时指定RDD的分片个数,如果没有指定,那么就会采用默认值。默认值就是程序所分配到的CPU Core的数目。
     (2)、 一个计算每个分片的函数。Spark中RDD的计算是以分片为单位的,每个RDD都会实现compute函数以达到这个目的。compute函数会对迭代器进行复合,不需要保存每次计算的结果。
     (3)、RDD之间的依赖关系。RDD的每次转换都会生成一个新的RDD,所以RDD之间就会形成类似于流水线一样的前后依赖关系。在部分分区数据丢失时,Spark可以通过这个依赖关系重新计算丢失的分区数据,而不是对RDD的所有分区进行重新计算。
     (4)、一个Partitioner,即RDD的分片函数。当前Spark中实现了两种类型的分片函数,一个是基于哈希的HashPartitioner,另外一个是基于范围的RangePartitioner。只有对于于key-value的RDD,才会有Partitioner,非key-value的RDD的Parititioner的值是None。Partitioner函数不但决定了RDD本身的分片数量,也决定了parent RDD Shuffle输出时的分片数量。
     (5)、一个列表,存储存取每个Partition的优先位置(preferred location)。对于一个HDFS文件来说,这个列表保存的就是每个Partition所在的块的位置。按照“移动数据不如移动计算”的理念,Spark在进行任务调度的时候,会尽可能地将计算任务分配到其所要处理数据块的存储位置。
     3、创建RDD
          (1)、由一个已经存在的Scala集合创建。
                         val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8))
          (2)、由外部存储系统的数据集创建,包括本地的文件系统,还有所有Hadoop支持的数据集,比如HDFS、Cassandra、HBase等
                         val rdd2 = sc.textFile("hdfs://hadoop141:8020/words.txt")
          (3)、查看该rdd的分区数量,默认是程序所分配的cpu core的数量,也可以在创建的时候指定
                         rdd1.partitions.length
                     创建的时候指定分区数量:
                         val rdd1 = sc.parallelize(Array(1,2,3.4),3)
二、RDD编程API---包含两种算子
     1、Transformation
          RDD中的所有转换都是延迟加载的,也就是说,它们并不会直接计算结果。相反的,它们只是记住这些应用到基础数据集(例如一个文件)上的转换动作。只有当发生一个要求返回结果给Driver的动作时,这些转换才会真正运行。这种设计让Spark更加有效率地运行。
     2、常用的Transformation操作:
          (1)map(func):返回一个新的RDD,该RDD由每一个输入的元素经过func函数转换后组成。
          (2)filter(func):返回一个新的RDD,该RDD由每一个输入的元素经过func函数计算后返回为true的输入元素组成。
          (3)sortBy(func,[ascending], [numTasks]):返回一个新的RDD,输入元素经过func函数计算后,按照指定的方式进行排序。(默认方式为false,升序;true是降序)
val rdd1 = sc.parallelize(List(,,,,,,,,,))
val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x,true)
val rdd3 = rdd2.filter(_>)
val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x+"",true)
val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x.toString,true)
          (4)flatMap(func):类似于map,但是每一个输入元素可以被映射为0或多个输出元素(所以func应该返回一个序列,而不是单一元素)。类似于先map,然后再flatten。
val rdd4 = sc.parallelize(Array("a b c", "d e f", "h i j"))
rdd4.flatMap(_.split(' ')).collect
------------------------------------------------------------------
val rdd5 = sc.parallelize(List(List("a b c", "a b b"),List("e f g", "a f g"), List("h i j", "a a b")))
rdd5.flatMap(_.flatMap(_.split(" "))).collect
          (5)union:求并集,注意类型要一致
          (6)intersection:求交集
          (7)distinct:去重
val rdd6 = sc.parallelize(List(,,,))
val rdd7 = sc.parallelize(List(,,,))
val rdd8 = rdd6.union(rdd7)
rdd8.distinct.sortBy(x=>x).collect
--------------------------------------------
val rdd9 = rdd6.intersection(rdd7)
          (8)join、leftOuterJoin、rightOuterJoin
val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
--------------------------------------------------------------------------
val rdd3 = rdd1.join(rdd2).collect
rdd3: Array[(String, (Int, Int))] = Array((tom,(,)), (jerry,(,)))
---------------------------------------------------------------------------
val rdd3 = rdd1.leftOuterJoin(rdd2).collect
rdd3: Array[(String, (Int, Option[Int]))] = Array((tom,(,Some())), (jerry,(,Some())), (kitty,(,None)))
---------------------------------------------------------------------------
val rdd3 = rdd1.rightOuterJoin(rdd2).collect
rdd3: Array[(String, (Option[Int], Int))] = Array((tom,(Some(),)), (jerry,(Some(),)), (shuke,(None,)))
          (9)groupByKey([numTasks]):在一个(K,V)的RDD上调用,返回一个(K, Iterator[V])的RDD----只针对数据是对偶元组的
val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
val rdd3 = rdd1 union rdd2
val rdd4 = rdd3.groupByKey.collect
rdd4: Array[(String, Iterable[Int])] = Array((tom,CompactBuffer(, )), (shuke,CompactBuffer()), (kitty,CompactBuffer()), (jerry,CompactBuffer(, )))
-----------------------------------------------------------------------------------
val rdd5 = rdd4.map(x=>(x._1,x._2.sum))
rdd5: Array[(String, Int)] = Array((tom,), (shuke,), (kitty,), (jerry,))
                groupBy:传入一个参数的函数,按照传入的参数为key,返回一个新的RDD[(K, Iterable[T])],value是所有可以相同的传入数据组成的迭代器。
                              以下为源码:
/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*/
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {
  groupBy[K](f, defaultPartitioner(this))
}

                具体代码案例:
scala> val rdd1=sc.parallelize(List(("a",1,2),("b",1,1),("a",4,5)))
rdd1: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[47] at parallelize at <console>:24
 
scala> rdd1.groupBy(_._1).collect
res18: Array[(String, Iterable[(String, Int, Int)])] = Array((a,CompactBuffer((a,1,2), (a,4,5))), (b,CompactBuffer((b,1,1))))
          (10)reduceByKey(func,[numTasks]):在一个(K,V)的RDD上调用,返回一个(K,V)的RDD,使用指定的reduce函数,将相同key的值聚合到一起,与groupByKey类似,reduce任务的个数可以通过第二个可选的参数来设置。
val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
val rdd3 = rdd1 union rdd2
val rdd6 = rdd3.reduceByKey(_+_).collect
rdd6: Array[(String, Int)] = Array((tom,), (shuke,), (kitty,), (jerry,))
          (11)cogroup(otherDataset, [numTasks]):在类型为(K,V)和(K,W)的RDD上调用,返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD
val rdd1 = sc.parallelize(List(("tom", ), ("tom", ), ("jerry", ), ("kitty", )))
val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))
val rdd3 = rdd1.cogroup(rdd2).collect
rdd3: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((tom,(CompactBuffer(, ),CompactBuffer())), (jerry,(CompactBuffer(),CompactBuffer())), (shuke,(CompactBuffer(),CompactBuffer())), (kitty,(CompactBuffer(),CompactBuffer())))
----------------------------------------------------------------------------------------
val rdd4 = rdd3.map(x=>(x._1,x._2._1.sum+x._2._2.sum))
rdd4: Array[(String, Int)] = Array((tom,), (jerry,), (shuke,), (kitty,))
          (12)cartesian(otherDataset )笛卡尔积
val rdd1 = sc.parallelize(List("tom", "jerry"))
val rdd2 = sc.parallelize(List("tom", "kitty", "shuke"))
val rdd3 = rdd1.cartesian(rdd2).collect
rdd3: Array[(String, String)] = Array((tom,tom), (tom,kitty), (tom,shuke), (jerry,tom), (jerry,kitty), (jerry,shuke))
     2、Action
          一旦触发,就会执行一个任务
 
三、RDD编程----高级API
     1、
          mapPartitions:针对每个分区进行操作,源码如下:要求传入一个Iterator,并且返回一个Iterator
/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
          mapPartitionsWithIndex:针对每个partition操作,把每个partition中的分区号和对应的值拿出来。是Transformation
          (1)源码:
/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD,且key没有被修饰的情况下,可设为true。非K-V型RDD一般不存在分区器;K-V RDD key被修改后,元素将不再满足分区器的分区要求。这些情况下,须设为false,表示返回的RDD没有被分区器分过区。
*/
def mapPartitionsWithIndex[U: ClassTag](-------要求传入一个函数
f: (Int, Iterator[T]) => Iterator[U],------函数要求传入两个参数
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
preservesPartitioning)
}
     (2)代码实例:
(1)首先自定义一个函数,符合mapPartitionsWithIndex参数要求的函数
scala> val func = (index : Int,iter : Iterator[Int]) => {
| iter.toList.map(x=>"[PartID:" + index + ",val:" + x + "]").iterator
| }
func: (Int, Iterator[Int]) => Iterator[String] = <function2>
(2)定义一个算子,分区数为2
scala> val rdd1 = sc.parallelize(List(,,,,,,,,),)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:
(3)调用方法,传入自定义的函数
scala> rdd1.mapPartitionsWithIndex(func).collect
res0: Array[String] = Array([PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:])
     2、aggregate:聚合操作,是Action
          (1)源码
/**
* Aggregate the elements of each partition, and then the results for all the partitions, using
* given combine functions and a neutral "zero value". This function can return a different result
* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U
* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are
* allowed to modify and return their first argument instead of creating a new U to avoid memory
* allocation.
将RDD中元素聚集,须提供0初值(因为累积元素,所有要提供累积的初值)。先在分区内依照seqOp函数聚集元素(把T类型元素聚集为U类型的分区“结果”),再在分区间按照combOp函数聚集分区计算结果,最后返回这个结果
*
* @param zeroValue the initial value for the accumulated result of each partition for the
* `seqOp` operator, and also the initial value for the combine results from
* different partitions for the `combOp` operator - this will typically be the
* neutral element (e.g. `Nil` for list concatenation or `0` for summation)
* @param seqOp an operator used to accumulate results within a partition
* @param combOp an associative operator used to combine results from different partitions
第一个参数是初始值, 第二个参数:是两个函数[每个函数都是2个参数(第一个参数:先对个个分区进行合并, 第二个:对个个分区合并后的结果再进行合并), 输出一个参数]
*/
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {
// Clone the zero value since we will also be serializing it as part of tasks
var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())
val cleanSeqOp = sc.clean(seqOp)
val cleanCombOp = sc.clean(combOp)
val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)
val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)
sc.runJob(this, aggregatePartition, mergeResult)
jobResult
}
          (2)代码实例:
scala> val rdd1 = sc.parallelize(List(,,,,,,,,), )
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:
//这里先对连个分区分别进行相加,然后两个的分区相加后的结果再相加得出最后的结果
scala> rdd1.aggregate()(_+_,_+_)
res0: Int =
//先对每个分区比较求出最大值,然后每个分区求出的最大值再相加得出最后的结果
scala> rdd1.aggregate()(math.max(_,_),_+_)
res1: Int =
//这里需要注意,初始值是每次都要参与运算的,例如下面的代码:分区1是1,2,3,4;初始值为5,则他们比较最大值就是5,分区2是5,6,7,8,9;初始值为5,则他们比较结果最大值就是9;然后再相加,这里初始值也要参与运算,5+(5+9)=19
scala> rdd1.aggregate()(math.max(_,_),_+_)
res0: Int =
-----------------------------------------------------------------------------------------------
scala> val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:
//这里需要注意,由于每个分区计算是并行计算,所以计算出的结果有先后顺序,所以结果会出现两种情况:如下
scala> rdd2.aggregate("")(_+_,_+_)
res0: String = defabc scala> rdd2.aggregate("")(_+_,_+_)
res2: String = abcdef
//这里的例子更能说明上面提到的初始值参与计算的问题,我们可以看到初始值=号参与了三次计算
scala> rdd2.aggregate("=")(_+_,_+_)
res0: String = ==def=abc
--------------------------------------------------------------------------------------
scala> val rdd3 = sc.parallelize(List("","","",""),)
rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd3.aggregate("")((x,y)=>math.max(x.length,y.length).toString,_+_)
res1: String = scala> rdd3.aggregate("")((x,y)=>math.max(x.length,y.length).toString,_+_)
res3: String =
-------------------------------------------------------------------------------------------
scala> val rdd4 = sc.parallelize(List("","","",""),)
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:
//这里需要注意:第一个分区加上初始值元素为"","12","23",两两比较,最小的长度为1;第二个分区加上初始值元素为"","345","",两两比较,最小的长度为0
scala> rdd4.aggregate("")((x,y)=>math.min(x.length,y.length).toString,_+_)
res4: String = scala> rdd4.aggregate("")((x,y)=>math.min(x.length,y.length).toString,_+_)
res9: String =
------------------------------------------------------------------------------------
//注意与上面的例子的区别,这里定义的rdd里的元素的顺序跟上面不一样,导致结果不一样
scala> val rdd5 = sc.parallelize(List("","","",""),)
rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd5.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y)
res1: String =
     3、aggregateByKey:按照key值进行聚合
//定义RDD
scala> val pairRDD = sc.parallelize(List( ("cat",), ("cat", ), ("mouse", ),("cat", ), ("dog", ), ("mouse", )), )
pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:
//自定义方法,用于传入mapPartitionsWithIndex
scala> val func=(index:Int,iter:Iterator[(String, Int)])=>{
| iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
| }
func: (Int, Iterator[(String, Int)]) => Iterator[String] = <function2>
//查看分区情况
scala> pairRDD.mapPartitionsWithIndex(func).collect
res2: Array[String] = Array([partID:, val: (cat,)], [partID:, val: (cat,)], [partID:, val: (mouse,)], [partID:, val: (cat,)], [partID:, val: (dog,)], [partID:, val: (mouse,)])
//注意:初始值为0和其他值的区别
scala> pairRDD.aggregateByKey()(_+_,_+_).collect
res4: Array[(String, Int)] = Array((dog,), (cat,), (mouse,)) scala> pairRDD.aggregateByKey()(_+_,_+_).collect
res5: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))
//下面三个的区别:,第一个比较好理解,由于初始值为0,所以每个分区输出不同动物中个数最多的那个,然后在累加
scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect
res6: Array[(String, Int)] = Array((dog,), (cat,), (mouse,)) //下面两个:由于有初始值,就需要考虑初始值参与计算,这里第一个分区的元素为("cat",2), ("cat", 5), ("mouse", 4),初始值是10,不同动物之间两两比较value的大小,都需要将初始值加入比较,所以第一个分区输出为("cat", 10), ("mouse", 10);第二个分区同第一个分区,输出结果为(dog,12), (cat,12), (mouse,10);所以最后累加的结果为(dog,12), (cat,22), (mouse,20),注意最后的对每个分区结果计算的时候,初始值不参与计算
scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect
res7: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))
//这个和上面的类似
scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect
res8: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))
     4、coalesce:返回一个新的RDD
          重新给RDD的元素分区。
          当适当缩小分区数时,如1000->100,spark会把之前的10个分区当作一个分区,并行度变为100,不会引起数据shuffle。
          当严重缩小分区数时,如1000->1,运算时的并行度会变成1。为了避免并行效率低下问题,可将shuffle设为true。shuffle之前的运算和之后的运算分为不同stage,它们的并行度分别为1000,1。
          当把分区数增大时,必会存在shuffle,shuffle须设为true。
          
          partitionBy:按照传入的参数进行分区,传入的参数为分区的实例对象,可以传入之定义分区的实例或者默认的HashPartitioner;源码如下:
/**
* Return a copy of the RDD partitioned using the specified partitioner.
*/
def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {
if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
if (self.partitioner == Some(partitioner)) {
self
} else {
new ShuffledRDD[K, V, V](self, partitioner)
}
} repartition:返回一个新的RDD
按指定分区数重新分区RDD,存在shuffle。
当指定的分区数比当前分区数目少时,考虑使用coalesce,这样能够避免shuffle。
scala> val rdd1 = sc.parallelize(Array(,,,,,,,),)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd2 = rdd1.repartition()
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[] at repartition at <console>: scala> rdd2.partitions.length
res0: Int = scala> val rdd3 = rdd2.coalesce(,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[] at coalesce at <console>: scala> rdd3.partitions.length
res1: Int =
     5、collectAsMap:将RDD转换成Map(注意RDD的数据应为对偶元组)
scala> val rdd1 = sc.parallelize(List(("a", ), ("b", ),("c", ),("d", ),("e", )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd1.collectAsMap
res3: scala.collection.Map[String,Int] = Map(e -> , b -> , d -> , a -> , c -> )
     6、combineByKey:和reduceByKey的效果相同,reduceByKey底层就是调用combineByKey
          ()、源码
/**
* Generic function to combine the elements for each key using a custom set of aggregation
* functions. This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
*
* @see [[combineByKeyWithClassTag]]
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,
partitioner, mapSideCombine, serializer)(null)
} /**
* Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.
* This method is here for backward compatibility. It does not provide combiner
* classtag information to the shuffle.
*
* @see [[combineByKeyWithClassTag]]
*/
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
numPartitions: Int): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
} ()参数说明:
第一个参数createCombiner: V => C:生成合并器,每组key,取出第一个value的值,然后返回你想合并的类型。
第二个参数mergeValue: (C, V) => C:函数,局部计算
第三个参数mergeCombiners: (C, C) => C:函数,对局部计算的结果再进行计算
()代码实例
//首先声明两个rdd,然后利用zip将两个rdd合并成一个,rdd6
scala> val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd5 = sc.parallelize(List(,,,,,,,,), )
rdd5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd6 = rdd5.zip(rdd4)
rdd6: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[] at zip at <console>: scala> rdd6.collect
res6: Array[(Int, String)] = Array((,dog), (,cat), (,gnu), (,salmon), (,rabbit), (,turkey), (,wolf), (,bear), (,bee)) //我们需要将按照key进行分组合并,相同的key的value都放在List中
//这里我们第一个参数List(_):表示将第一个value取出放进集合中
//第二个参数(x:List[String],y:String)=>x :+ y:表示局部计算,将value加入到List中
//第三个参数(m:List[String],n:List[String])=>m++n:表示对局部的计算结果再进行计算 scala> val rdd7 = rdd6.combineByKey(List(_),(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n)
rdd7: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[] at combineByKey at <console>: scala> rdd7.collect
res7: Array[(Int, List[String])] = Array((,List(dog, cat, turkey)), (,List(wolf, bear, bee, salmon, rabbit, gnu))) //这里第一个参数,可以有另外的写法。如下面的两个
scala> val rdd7 = rdd6.combineByKey(_::List(),(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n).collect
rdd7: Array[(Int, List[String])] = Array((,List(turkey, dog, cat)), (,List(wolf, bear, bee, gnu, salmon, rabbit))) scala> val rdd7 = rdd6.combineByKey(_::Nil,(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n).collect
rdd7: Array[(Int, List[String])] = Array((,List(turkey, dog, cat)), (,List(wolf, bear, bee, gnu, salmon, rabbit)))
     7、countByKey、countByValue:按照key或者value计算出现的次数
scala> val rdd1 = sc.parallelize(List(("a", ), ("b", ), ("b", ), ("c", ), ("c", )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd1.countByKey
res8: scala.collection.Map[String,Long] = Map(a -> , b -> , c -> ) scala> rdd1.countByValue
res9: scala.collection.Map[(String, Int),Long] = Map((c,) -> , (a,) -> , (b,) -> , (c,) -> ) 
     8、filterByRange
scala> val rdd1 = sc.parallelize(List(("e", ), ("c", ), ("d", ), ("c", ), ("a", ),("b",)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:
//注意:这里传入的参数,是左闭右闭的区间
scala> val rdd2 = rdd1.filterByRange("b","d")
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[] at filterByRange at <console>: scala> rdd2.collect
res10: Array[(String, Int)] = Array((c,), (d,), (c,), (b,))
     9、flatMapValues:对values进行处理,类似flatMap,会将key和每一个分出来的value组成映射
scala> val rdd3 = sc.parallelize(List(("a", "1 2"), ("b", "3 4")))
rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd4 = rdd3.flatMapValues(_.split(" "))
rdd4: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[] at flatMapValues at <console>: scala> rdd4.collect
res11: Array[(String, String)] = Array((a,), (a,), (b,), (b,))
mapValues:不改变key,只针对传入的键值对的value进行计算,类似于map;注意与上面的flatMapValues的区别,它不会改变传入的key-value对,只是将value按照传入的函数进行处理;
scala> val rdd3 = sc.parallelize(List(("a",(,)),("b",(,))))
rdd3: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd3.mapValues(x=>x._1 + x._2).collect
res34: Array[(String, Int)] = Array((a,), (b,))
------------------------------------------------------------------------
如果使用flatMapValues,结果如下,它将value全部拆开跟key组成映射
scala> rdd3.flatMapValues(x=>x + "").collect
res36: Array[(String, Char)] = Array((a,(), (a,), (a,,), (a,), (a,)), (b,(), (b,), (b,,), (b,), (b,)))
     10、foldByKey:根据key分组,对每一组的value进行计算
scala> val rdd1 = sc.parallelize(List("dog", "wolf", "cat", "bear"), )
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd2 = rdd1.map(x=>(x.length,x))
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[] at map at <console>: scala> rdd2.collect
res12: Array[(Int, String)] = Array((,dog), (,wolf), (,cat), (,bear))
-----------------------------------------------------------------------------
scala> val rdd3 = rdd2.foldByKey("")(_+_)
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[] at foldByKey at <console>: scala> rdd3.collect
res13: Array[(Int, String)] = Array((,bearwolf), (,dogcat)) scala> val rdd3 = rdd2.foldByKey(" ")(_+_).collect
rdd3: Array[(Int, String)] = Array((," bear wolf"), (," dog cat"))
-----------------------------------------------------------------------------
//进行wordcout的计算
val rdd = sc.textFile("hdfs://node-1.itcast.cn:9000/wc").flatMap(_.split(" ")).map((_, ))
rdd.foldByKey()(_+_)
     11、keyBy:以传入的参数作为key,生成新的RDD
scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>: scala> val rdd2 = rdd1.keyBy(_.length)
rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[] at keyBy at <console>: scala> rdd2.collect
res14: Array[(Int, String)] = Array((,dog), (,salmon), (,salmon), (,rat), (,elephant))
     12、keys、values:取出rdd的key或者value,生成新的RDD
scala> val rdd1 = sc.parallelize(List(("e", ), ("c", ), ("d", ), ("c", ), ("a", )))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>: scala> rdd1.keys.collect
res16: Array[String] = Array(e, c, d, c, a) scala> rdd1.values.collect
res17: Array[Int] = Array(, , , , )

列举spark所有算子的更多相关文章

  1. Spark RDD概念学习系列之Spark的算子的分类(十一)

    Spark的算子的分类 从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...

  2. Spark RDD概念学习系列之Spark的算子的作用(十四)

    Spark的算子的作用 首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...

  3. Spark操作算子本质-RDD的容错

    Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...

  4. Spark RDD算子介绍

    Spark学习笔记总结 01. Spark基础 1. 介绍 Spark可以用于批处理.交互式查询(Spark SQL).实时流处理(Spark Streaming).机器学习(Spark MLlib) ...

  5. Spark常用算子-KeyValue数据类型的算子

    package com.test; import java.util.ArrayList; import java.util.List; import java.util.Map; import or ...

  6. Spark常用算子-value数据类型的算子

    package com.test; import java.util.ArrayList; import java.util.Arrays; import java.util.Iterator; im ...

  7. spark常用算子总结

    算子分为value-transform, key-value-transform, action三种.f是输入给算子的函数,比如lambda x: x**2 常用算子: keys: 取pair rdd ...

  8. spark过滤算子+StringIndexer算子出发的一个逻辑bug

    问题描述: 在一段spark机器学习的程序中,同时用到了Filter算子和StringIndexer算子,其中StringIndexer在前,filter在后,并且filter是对stringinde ...

  9. java实现spark常用算子之Union

    import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.a ...

随机推荐

  1. Filter用户例子

    用Filter防止用户访问一些未被授权的资源,比如一个用户未登录就不允许访问网站的某些页面,并将页面重定向到需要用户登录的页面,下面是一个相关的例子: package com.drp.util.fil ...

  2. zoj 2524 并查集裸

    Description There are so many different religions in the world today that it is difficult to keep tr ...

  3. windows下配置下burpsuite的小方法。

    1.下载破解版burpsuite和正版burpsuite. 2.安装正版burpsuite(免费版) 3.打开安装路径 4.把破解版的burp拷贝到安装路径下 5.该路径下应该有个burpsuite_ ...

  4. 【Jest】笔记一:环境配置

    一.开发环境 Mac node.js:v9.9.0  下载链接:http://nodejs.cn/download/ VScode 下载链接:https://code.visualstudio.com ...

  5. anki_vector SDK源码解析(教程)

    一:最近anki vector robot开放了Python SDK,我听到的第一时间就赶快上网查了查,先抛几个官网重要链接吧: Python编程API手册及环境搭建等: https://sdk-re ...

  6. 使用iSCSI服务部署网络存储

  7. MySQL Hardware--网络测试

    使用Ping测试丢包 ## ping测试 ## -c 100表示100次 ping -c 100 192.168.1.2 输出结果: ping -c 100 192.168.1.2 PING 192. ...

  8. Dubbo 入门学习笔记

    项目结构 模块介绍: DubboAPI    ----API接口 DubboConsumer ----消费者 DubboProvider ----生产者 DubboAPI  Service 提供的接口 ...

  9. hanlp 加载远程词库示例

    说明 ·目前的实现方式是以远程词库的内容重新构建CustomDictionary.trie,demo主要是为了实现同步远程词库,对性能暂不作考虑,对性能要求要以CustomDictionary.dat ...

  10. Handlebars.js registerHelper

    Handlebars.registerHelper('link', function (text, url) { text = Handlebars.Utils.escapeExpression(te ...