列举spark所有算子

一、RDD概述

1、什么是RDD

RDD（Resilient Distributed Dataset）叫做弹性分布式数据集，是Spark中最基本的数据抽象，它代表一个不可变、可分区、里面的元素可并行计算的集合。RDD具有数据流模型的特点：自动容错、位置感知性调度和可伸缩性。RDD允许用户在执行多个查询时显式地将工作集缓存在内存中，后续的查询能够重用工作集，这极大地提升了查询速度。

2、RDD属性

（1）、一组分片（Partition），即数据集的基本组成单位。对于RDD来说，每个分片都会被一个计算任务处理，并决定并行计算的粒度。用户可以在创建RDD时指定RDD的分片个数，如果没有指定，那么就会采用默认值。默认值就是程序所分配到的CPU Core的数目。

（2）、一个计算每个分片的函数。Spark中RDD的计算是以分片为单位的，每个RDD都会实现compute函数以达到这个目的。compute函数会对迭代器进行复合，不需要保存每次计算的结果。

（3）、RDD之间的依赖关系。RDD的每次转换都会生成一个新的RDD，所以RDD之间就会形成类似于流水线一样的前后依赖关系。在部分分区数据丢失时，Spark可以通过这个依赖关系重新计算丢失的分区数据，而不是对RDD的所有分区进行重新计算。

（4）、一个Partitioner，即RDD的分片函数。当前Spark中实现了两种类型的分片函数，一个是基于哈希的HashPartitioner，另外一个是基于范围的RangePartitioner。只有对于于key-value的RDD，才会有Partitioner，非key-value的RDD的Parititioner的值是None。Partitioner函数不但决定了RDD本身的分片数量，也决定了parent RDD Shuffle输出时的分片数量。

（5）、一个列表，存储存取每个Partition的优先位置（preferred location）。对于一个HDFS文件来说，这个列表保存的就是每个Partition所在的块的位置。按照“移动数据不如移动计算”的理念，Spark在进行任务调度的时候，会尽可能地将计算任务分配到其所要处理数据块的存储位置。

3、创建RDD

（1）、由一个已经存在的Scala集合创建。

val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8))

（2）、由外部存储系统的数据集创建，包括本地的文件系统，还有所有Hadoop支持的数据集，比如HDFS、Cassandra、HBase等

val rdd2 = sc.textFile("hdfs://hadoop141:8020/words.txt")

（3）、查看该rdd的分区数量，默认是程序所分配的cpu core的数量，也可以在创建的时候指定

rdd1.partitions.length

创建的时候指定分区数量：

val rdd1 = sc.parallelize(Array(1,2,3.4),3)

二、RDD编程API---包含两种算子

1、Transformation

RDD中的所有转换都是延迟加载的，也就是说，它们并不会直接计算结果。相反的，它们只是记住这些应用到基础数据集（例如一个文件）上的转换动作。只有当发生一个要求返回结果给Driver的动作时，这些转换才会真正运行。这种设计让Spark更加有效率地运行。

2、常用的Transformation操作：

（1）map（func）：返回一个新的RDD，该RDD由每一个输入的元素经过func函数转换后组成。

（2）filter（func）：返回一个新的RDD，该RDD由每一个输入的元素经过func函数计算后返回为true的输入元素组成。

（3）sortBy（func，[ascending], [numTasks]）：返回一个新的RDD，输入元素经过func函数计算后，按照指定的方式进行排序。（默认方式为false，升序；true是降序）

val rdd1 = sc.parallelize(List(,,,,,,,,,))

val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x,true)

val rdd3 = rdd2.filter(_>)

val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x+"",true)

val rdd2 = sc.parallelize(List(,,,,,,,,,)).map(_*).sortBy(x=>x.toString,true)

（4）flatMap（func）：类似于map，但是每一个输入元素可以被映射为0或多个输出元素（所以func应该返回一个序列，而不是单一元素）。类似于先map，然后再flatten。

val rdd4 = sc.parallelize(Array("a b c", "d e f", "h i j"))

rdd4.flatMap(_.split(' ')).collect

------------------------------------------------------------------

val rdd5 = sc.parallelize(List(List("a b c", "a b b"),List("e f g", "a f g"), List("h i j", "a a b")))

rdd5.flatMap(_.flatMap(_.split(" "))).collect

（5）union：求并集，注意类型要一致

（6）intersection：求交集

（7）distinct：去重

val rdd6 = sc.parallelize(List(,,,))

val rdd7 = sc.parallelize(List(,,,))

val rdd8 = rdd6.union(rdd7)

rdd8.distinct.sortBy(x=>x).collect

--------------------------------------------

val rdd9 = rdd6.intersection(rdd7)

（8）join、leftOuterJoin、rightOuterJoin

val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))

val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))

--------------------------------------------------------------------------

val rdd3 = rdd1.join(rdd2).collect

rdd3: Array[(String, (Int, Int))] = Array((tom,(,)), (jerry,(,)))

---------------------------------------------------------------------------

val rdd3 = rdd1.leftOuterJoin(rdd2).collect

rdd3: Array[(String, (Int, Option[Int]))] = Array((tom,(,Some())), (jerry,(,Some())), (kitty,(,None)))

---------------------------------------------------------------------------

val rdd3 = rdd1.rightOuterJoin(rdd2).collect

rdd3: Array[(String, (Option[Int], Int))] = Array((tom,(Some(),)), (jerry,(Some(),)), (shuke,(None,)))

（9）groupByKey（[numTasks]）：在一个(K,V)的RDD上调用，返回一个(K, Iterator[V])的RDD----只针对数据是对偶元组的

val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))

val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))

val rdd3 = rdd1 union rdd2

val rdd4 = rdd3.groupByKey.collect

rdd4: Array[(String, Iterable[Int])] = Array((tom,CompactBuffer(, )), (shuke,CompactBuffer()), (kitty,CompactBuffer()), (jerry,CompactBuffer(, )))

-----------------------------------------------------------------------------------

val rdd5 = rdd4.map(x=>(x._1,x._2.sum))

rdd5: Array[(String, Int)] = Array((tom,), (shuke,), (kitty,), (jerry,))

groupBy：传入一个参数的函数，按照传入的参数为key，返回一个新的RDD[(K, Iterable[T])]，value是所有可以相同的传入数据组成的迭代器。

以下为源码：

/**

* Return an RDD of grouped items. Each group consists of a key and a sequence of elements

* mapping to that key. The ordering of elements within each group is not guaranteed, and

* may even differ each time the resulting RDD is evaluated.

* @note This operation may be very expensive. If you are grouping in order to perform an

* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`

* or `PairRDDFunctions.reduceByKey` will provide much better performance.

def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {

groupBy[K](f, defaultPartitioner(this))

}

具体代码案例：

scala> val rdd1=sc.parallelize(List(("a",1,2),("b",1,1),("a",4,5)))

rdd1: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[47] at parallelize at <console>:24

scala> rdd1.groupBy(_._1).collect

res18: Array[(String, Iterable[(String, Int, Int)])] = Array((a,CompactBuffer((a,1,2), (a,4,5))), (b,CompactBuffer((b,1,1))))

（10）reduceByKey（func，[numTasks]）：在一个(K,V)的RDD上调用，返回一个(K,V)的RDD，使用指定的reduce函数，将相同key的值聚合到一起，与groupByKey类似，reduce任务的个数可以通过第二个可选的参数来设置。

val rdd1 = sc.parallelize(List(("tom", ), ("jerry", ), ("kitty", )))

val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))

val rdd3 = rdd1 union rdd2

val rdd6 = rdd3.reduceByKey(_+_).collect

rdd6: Array[(String, Int)] = Array((tom,), (shuke,), (kitty,), (jerry,))

（11）cogroup（otherDataset, [numTasks]）：在类型为(K,V)和(K,W)的RDD上调用，返回一个(K,(Iterable<V>,Iterable<W>))类型的RDD

val rdd1 = sc.parallelize(List(("tom", ), ("tom", ), ("jerry", ), ("kitty", )))

val rdd2 = sc.parallelize(List(("jerry", ), ("tom", ), ("shuke", )))

val rdd3 = rdd1.cogroup(rdd2).collect

rdd3: Array[(String, (Iterable[Int], Iterable[Int]))] = Array((tom,(CompactBuffer(, ),CompactBuffer())), (jerry,(CompactBuffer(),CompactBuffer())), (shuke,(CompactBuffer(),CompactBuffer())), (kitty,(CompactBuffer(),CompactBuffer())))

----------------------------------------------------------------------------------------

val rdd4 = rdd3.map(x=>(x._1,x._2._1.sum+x._2._2.sum))

rdd4: Array[(String, Int)] = Array((tom,), (jerry,), (shuke,), (kitty,))

（12）cartesian（otherDataset ）笛卡尔积

val rdd1 = sc.parallelize(List("tom", "jerry"))

val rdd2 = sc.parallelize(List("tom", "kitty", "shuke"))

val rdd3 = rdd1.cartesian(rdd2).collect

rdd3: Array[(String, String)] = Array((tom,tom), (tom,kitty), (tom,shuke), (jerry,tom), (jerry,kitty), (jerry,shuke))

2、Action

一旦触发，就会执行一个任务

三、RDD编程----高级API

1、

mapPartitions:针对每个分区进行操作，源码如下：要求传入一个Iterator，并且返回一个Iterator

/**

* Return a new RDD by applying a function to each partition of this RDD.

*

* `preservesPartitioning` indicates whether the input function preserves the partitioner, which

* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.

*/

def mapPartitions[U: ClassTag](

    f: Iterator[T] => Iterator[U],

    preservesPartitioning: Boolean = false): RDD[U] = withScope {

  val cleanedF = sc.clean(f)

  new MapPartitionsRDD(

    this,

    (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),

    preservesPartitioning)

}

mapPartitionsWithIndex：针对每个partition操作，把每个partition中的分区号和对应的值拿出来。是Transformation

（1）源码：

/**

* Return a new RDD by applying a function to each partition of this RDD, while tracking the index

* of the original partition.

*

* `preservesPartitioning` indicates whether the input function preserves the partitioner, which

* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.

preservesPartitioning表示返回RDD是否留有分区器。仅当RDD为K-V型RDD，且key没有被修饰的情况下，可设为true。非K-V型RDD一般不存在分区器；K-V RDD key被修改后，元素将不再满足分区器的分区要求。这些情况下，须设为false，表示返回的RDD没有被分区器分过区。

*/

def mapPartitionsWithIndex[U: ClassTag](-------要求传入一个函数

    f: (Int, Iterator[T]) => Iterator[U],------函数要求传入两个参数

    preservesPartitioning: Boolean = false): RDD[U] = withScope {

  val cleanedF = sc.clean(f)

  new MapPartitionsRDD(

    this,

    (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),

    preservesPartitioning)

}

（2）代码实例：

（1）首先自定义一个函数，符合mapPartitionsWithIndex参数要求的函数

scala> val func = (index : Int,iter : Iterator[Int]) => {

     | iter.toList.map(x=>"[PartID:" + index + ",val:" + x + "]").iterator

     | }

func: (Int, Iterator[Int]) => Iterator[String] = <function2>

(2)定义一个算子，分区数为2

scala> val rdd1 = sc.parallelize(List(,,,,,,,,),)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:

（3）调用方法，传入自定义的函数

scala> rdd1.mapPartitionsWithIndex(func).collect

res0: Array[String] = Array([PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:], [PartID:,val:])

2、aggregate：聚合操作，是Action

（1）源码

/**

* Aggregate the elements of each partition, and then the results for all the partitions, using

* given combine functions and a neutral "zero value". This function can return a different result

* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U

* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are

* allowed to modify and return their first argument instead of creating a new U to avoid memory

* allocation.

将RDD中元素聚集，须提供0初值（因为累积元素，所有要提供累积的初值）。先在分区内依照seqOp函数聚集元素（把T类型元素聚集为U类型的分区“结果”），再在分区间按照combOp函数聚集分区计算结果，最后返回这个结果

*

* @param zeroValue the initial value for the accumulated result of each partition for the

*                  `seqOp` operator, and also the initial value for the combine results from

*                  different partitions for the `combOp` operator - this will typically be the

*                  neutral element (e.g. `Nil` for list concatenation or `0` for summation)

* @param seqOp an operator used to accumulate results within a partition

* @param combOp an associative operator used to combine results from different partitions

第一个参数是初始值, 第二个参数:是两个函数[每个函数都是2个参数(第一个参数:先对个个分区进行合并, 第二个:对个个分区合并后的结果再进行合并), 输出一个参数]

*/

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {

  // Clone the zero value since we will also be serializing it as part of tasks

  var jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())

  val cleanSeqOp = sc.clean(seqOp)

  val cleanCombOp = sc.clean(combOp)

  val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)

  val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)

  sc.runJob(this, aggregatePartition, mergeResult)

  jobResult

}

（2）代码实例：

scala> val rdd1 = sc.parallelize(List(,,,,,,,,), )

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:

//这里先对连个分区分别进行相加，然后两个的分区相加后的结果再相加得出最后的结果

scala> rdd1.aggregate()(_+_,_+_)

res0: Int =

//先对每个分区比较求出最大值，然后每个分区求出的最大值再相加得出最后的结果

scala> rdd1.aggregate()(math.max(_,_),_+_)

res1: Int =

//这里需要注意，初始值是每次都要参与运算的，例如下面的代码：分区1是1,2,3,4；初始值为5，则他们比较最大值就是5，分区2是5,6,7,8,9；初始值为5，则他们比较结果最大值就是9；然后再相加，这里初始值也要参与运算，5+（5+9）=19

scala> rdd1.aggregate()(math.max(_,_),_+_)

res0: Int =

-----------------------------------------------------------------------------------------------

scala> val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),)

rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

//这里需要注意，由于每个分区计算是并行计算，所以计算出的结果有先后顺序，所以结果会出现两种情况：如下

scala> rdd2.aggregate("")(_+_,_+_)

res0: String = defabc                                                                                                                    

scala> rdd2.aggregate("")(_+_,_+_)

res2: String = abcdef

//这里的例子更能说明上面提到的初始值参与计算的问题，我们可以看到初始值=号参与了三次计算

scala> rdd2.aggregate("=")(_+_,_+_)

res0: String = ==def=abc

--------------------------------------------------------------------------------------

scala> val rdd3 = sc.parallelize(List("","","",""),)

rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

scala> rdd3.aggregate("")((x,y)=>math.max(x.length,y.length).toString,_+_)

res1: String =                                                                

scala> rdd3.aggregate("")((x,y)=>math.max(x.length,y.length).toString,_+_)

res3: String =

-------------------------------------------------------------------------------------------

scala> val rdd4 = sc.parallelize(List("","","",""),)

rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

//这里需要注意：第一个分区加上初始值元素为"","12","23",两两比较，最小的长度为1；第二个分区加上初始值元素为"","345","",两两比较，最小的长度为0

scala> rdd4.aggregate("")((x,y)=>math.min(x.length,y.length).toString,_+_)

res4: String =                                                                

scala> rdd4.aggregate("")((x,y)=>math.min(x.length,y.length).toString,_+_)

res9: String =

------------------------------------------------------------------------------------

//注意与上面的例子的区别，这里定义的rdd里的元素的顺序跟上面不一样，导致结果不一样

scala> val rdd5 = sc.parallelize(List("","","",""),)

rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

scala> rdd5.aggregate("")((x,y)=>math.min(x.length,y.length).toString,(x,y)=>x+y)

res1: String =

3、aggregateByKey：按照key值进行聚合

//定义RDD

scala> val pairRDD = sc.parallelize(List( ("cat",), ("cat", ), ("mouse", ),("cat", ), ("dog", ), ("mouse", )), )

pairRDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:

//自定义方法，用于传入mapPartitionsWithIndex

scala> val func=(index:Int,iter:Iterator[(String, Int)])=>{

     | iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator

     | }

func: (Int, Iterator[(String, Int)]) => Iterator[String] = <function2>

//查看分区情况

scala> pairRDD.mapPartitionsWithIndex(func).collect

res2: Array[String] = Array([partID:, val: (cat,)], [partID:, val: (cat,)], [partID:, val: (mouse,)], [partID:, val: (cat,)], [partID:, val: (dog,)], [partID:, val: (mouse,)])

//注意：初始值为0和其他值的区别

scala> pairRDD.aggregateByKey()(_+_,_+_).collect

res4: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))               

scala> pairRDD.aggregateByKey()(_+_,_+_).collect

res5: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))

//下面三个的区别：，第一个比较好理解，由于初始值为0，所以每个分区输出不同动物中个数最多的那个，然后在累加

scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect

res6: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))

//下面两个：由于有初始值，就需要考虑初始值参与计算，这里第一个分区的元素为("cat",2), ("cat", 5), ("mouse", 4)，初始值是10，不同动物之间两两比较value的大小，都需要将初始值加入比较，所以第一个分区输出为("cat", 10), ("mouse", 10)；第二个分区同第一个分区，输出结果为(dog,12), (cat,12), (mouse,10)；所以最后累加的结果为(dog,12), (cat,22), (mouse,20)，注意最后的对每个分区结果计算的时候，初始值不参与计算

scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect

res7: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))

//这个和上面的类似

scala> pairRDD.aggregateByKey()(math.max(_,_),_+_).collect

res8: Array[(String, Int)] = Array((dog,), (cat,), (mouse,))

4、coalesce：返回一个新的RDD

重新给RDD的元素分区。

当适当缩小分区数时，如1000->100，spark会把之前的10个分区当作一个分区，并行度变为100，不会引起数据shuffle。

当严重缩小分区数时，如1000->1，运算时的并行度会变成1。为了避免并行效率低下问题，可将shuffle设为true。shuffle之前的运算和之后的运算分为不同stage，它们的并行度分别为1000,1。

当把分区数增大时，必会存在shuffle，shuffle须设为true。

partitionBy：按照传入的参数进行分区，传入的参数为分区的实例对象，可以传入之定义分区的实例或者默认的HashPartitioner;源码如下：

/**

* Return a copy of the RDD partitioned using the specified partitioner.

*/

def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {

  if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {

    throw new SparkException("HashPartitioner cannot partition array keys.")

  }

  if (self.partitioner == Some(partitioner)) {

    self

  } else {

    new ShuffledRDD[K, V, V](self, partitioner)

  }

}

          repartition：返回一个新的RDD

               按指定分区数重新分区RDD，存在shuffle。

               当指定的分区数比当前分区数目少时，考虑使用coalesce，这样能够避免shuffle。

scala> val rdd1 = sc.parallelize(Array(,,,,,,,),)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:

scala> val rdd2 = rdd1.repartition()

rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[] at repartition at <console>:

scala> rdd2.partitions.length

res0: Int = 

scala> val rdd3 = rdd2.coalesce(,true)

rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[] at coalesce at <console>:

scala> rdd3.partitions.length

res1: Int =

5、collectAsMap：将RDD转换成Map（注意RDD的数据应为对偶元组）

scala> val rdd1 = sc.parallelize(List(("a", ), ("b", ),("c", ),("d", ),("e", )))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:

scala> rdd1.collectAsMap

res3: scala.collection.Map[String,Int] = Map(e -> , b -> , d -> , a -> , c -> )

6、combineByKey：和reduceByKey的效果相同，reduceByKey底层就是调用combineByKey

          （）、源码

/**

* Generic function to combine the elements for each key using a custom set of aggregation

* functions. This method is here for backward compatibility. It does not provide combiner

* classtag information to the shuffle.

*

* @see [[combineByKeyWithClassTag]]

*/

def combineByKey[C](

    createCombiner: V => C,

    mergeValue: (C, V) => C,

    mergeCombiners: (C, C) => C,

    partitioner: Partitioner,

    mapSideCombine: Boolean = true,

    serializer: Serializer = null): RDD[(K, C)] = self.withScope {

  combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,

    partitioner, mapSideCombine, serializer)(null)

}

/**

* Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.

* This method is here for backward compatibility. It does not provide combiner

* classtag information to the shuffle.

*

* @see [[combineByKeyWithClassTag]]

*/

def combineByKey[C](

    createCombiner: V => C,

    mergeValue: (C, V) => C,

    mergeCombiners: (C, C) => C,

    numPartitions: Int): RDD[(K, C)] = self.withScope {

  combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)

}

          （）参数说明：

                    第一个参数createCombiner: V => C：生成合并器，每组key，取出第一个value的值，然后返回你想合并的类型。

                         第二个参数mergeValue: (C, V) => C：函数，局部计算

                         第三个参数mergeCombiners: (C, C) => C：函数，对局部计算的结果再进行计算

          （）代码实例

//首先声明两个rdd，然后利用zip将两个rdd合并成一个，rdd6

scala> val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )

rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

scala> val rdd5 = sc.parallelize(List(,,,,,,,,), )

rdd5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[] at parallelize at <console>:

scala> val rdd6 = rdd5.zip(rdd4)

rdd6: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[] at zip at <console>:

scala> rdd6.collect

res6: Array[(Int, String)] = Array((,dog), (,cat), (,gnu), (,salmon), (,rabbit), (,turkey), (,wolf), (,bear), (,bee))

//我们需要将按照key进行分组合并，相同的key的value都放在List中

//这里我们第一个参数List(_)：表示将第一个value取出放进集合中

//第二个参数(x:List[String],y:String)=>x :+ y：表示局部计算，将value加入到List中

//第三个参数(m:List[String],n:List[String])=>m++n：表示对局部的计算结果再进行计算

scala> val rdd7 = rdd6.combineByKey(List(_),(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n)

rdd7: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[] at combineByKey at <console>:

scala> rdd7.collect

res7: Array[(Int, List[String])] = Array((,List(dog, cat, turkey)), (,List(wolf, bear, bee, salmon, rabbit, gnu)))

//这里第一个参数，可以有另外的写法。如下面的两个

scala> val rdd7 = rdd6.combineByKey(_::List(),(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n).collect

rdd7: Array[(Int, List[String])] = Array((,List(turkey, dog, cat)), (,List(wolf, bear, bee, gnu, salmon, rabbit)))

scala> val rdd7 = rdd6.combineByKey(_::Nil,(x:List[String],y:String)=>x :+ y,(m:List[String],n:List[String])=>m++n).collect

rdd7: Array[(Int, List[String])] = Array((,List(turkey, dog, cat)), (,List(wolf, bear, bee, gnu, salmon, rabbit)))

7、countByKey、countByValue：按照key或者value计算出现的次数

scala> val rdd1 = sc.parallelize(List(("a", ), ("b", ), ("b", ), ("c", ), ("c", )))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:

scala> rdd1.countByKey

res8: scala.collection.Map[String,Long] = Map(a -> , b -> , c -> )           

scala> rdd1.countByValue

res9: scala.collection.Map[(String, Int),Long] = Map((c,) -> , (a,) -> , (b,) -> , (c,) -> )

8、filterByRange

scala> val rdd1 = sc.parallelize(List(("e", ), ("c", ), ("d", ), ("c", ), ("a", ),("b",)))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:

//注意：这里传入的参数，是左闭右闭的区间

scala> val rdd2 = rdd1.filterByRange("b","d")

rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[] at filterByRange at <console>:

scala> rdd2.collect

res10: Array[(String, Int)] = Array((c,), (d,), (c,), (b,))

9、flatMapValues：对values进行处理，类似flatMap，会将key和每一个分出来的value组成映射

scala> val rdd3 = sc.parallelize(List(("a", "1 2"), ("b", "3 4")))

rdd3: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[] at parallelize at <console>:

scala> val rdd4 = rdd3.flatMapValues(_.split(" "))

rdd4: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[] at flatMapValues at <console>:

scala> rdd4.collect

res11: Array[(String, String)] = Array((a,), (a,), (b,), (b,))

          mapValues：不改变key，只针对传入的键值对的value进行计算，类似于map；注意与上面的flatMapValues的区别，它不会改变传入的key-value对，只是将value按照传入的函数进行处理；

scala> val rdd3 = sc.parallelize(List(("a",(,)),("b",(,))))

rdd3: org.apache.spark.rdd.RDD[(String, (Int, Int))] = ParallelCollectionRDD[] at parallelize at <console>:

scala> rdd3.mapValues(x=>x._1 + x._2).collect

res34: Array[(String, Int)] = Array((a,), (b,))

------------------------------------------------------------------------

如果使用flatMapValues，结果如下，它将value全部拆开跟key组成映射

scala> rdd3.flatMapValues(x=>x + "").collect

res36: Array[(String, Char)] = Array((a,(), (a,), (a,,), (a,), (a,)), (b,(), (b,), (b,,), (b,), (b,)))

10、foldByKey：根据key分组，对每一组的value进行计算

scala> val rdd1 = sc.parallelize(List("dog", "wolf", "cat", "bear"), )

rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

scala> val rdd2 = rdd1.map(x=>(x.length,x))

rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[] at map at <console>:

scala> rdd2.collect

res12: Array[(Int, String)] = Array((,dog), (,wolf), (,cat), (,bear))

-----------------------------------------------------------------------------

scala> val rdd3 = rdd2.foldByKey("")(_+_)

rdd3: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[] at foldByKey at <console>:

scala> rdd3.collect

res13: Array[(Int, String)] = Array((,bearwolf), (,dogcat))

scala> val rdd3 = rdd2.foldByKey(" ")(_+_).collect

rdd3: Array[(Int, String)] = Array((," bear wolf"), (," dog cat"))

-----------------------------------------------------------------------------

//进行wordcout的计算

val rdd = sc.textFile("hdfs://node-1.itcast.cn:9000/wc").flatMap(_.split(" ")).map((_, ))

rdd.foldByKey()(_+_)

11、keyBy：以传入的参数作为key，生成新的RDD

scala> val rdd1 = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )

rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[] at parallelize at <console>:

scala> val rdd2 = rdd1.keyBy(_.length)

rdd2: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[] at keyBy at <console>:

scala> rdd2.collect

res14: Array[(Int, String)] = Array((,dog), (,salmon), (,salmon), (,rat), (,elephant))

12、keys、values：取出rdd的key或者value，生成新的RDD

scala> val rdd1 = sc.parallelize(List(("e", ), ("c", ), ("d", ), ("c", ), ("a", )))

rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[] at parallelize at <console>:

scala> rdd1.keys.collect

res16: Array[String] = Array(e, c, d, c, a)

scala> rdd1.values.collect

res17: Array[Int] = Array(, , , , )

列举spark所有算子的更多相关文章

Spark RDD概念学习系列之Spark的算子的分类（十一）
Spark的算子的分类从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...
Spark RDD概念学习系列之Spark的算子的作用（十四）
Spark的算子的作用首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...
Spark操作算子本质-RDD的容错
Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...
Spark RDD算子介绍
Spark学习笔记总结 01. Spark基础 1. 介绍 Spark可以用于批处理.交互式查询(Spark SQL).实时流处理(Spark Streaming).机器学习(Spark MLlib) ...
Spark常用算子-KeyValue数据类型的算子
package com.test; import java.util.ArrayList; import java.util.List; import java.util.Map; import or ...
Spark常用算子-value数据类型的算子
package com.test; import java.util.ArrayList; import java.util.Arrays; import java.util.Iterator; im ...
spark常用算子总结
算子分为value-transform, key-value-transform, action三种.f是输入给算子的函数,比如lambda x: x**2 常用算子: keys: 取pair rdd ...
spark过滤算子+StringIndexer算子出发的一个逻辑bug
问题描述: 在一段spark机器学习的程序中,同时用到了Filter算子和StringIndexer算子,其中StringIndexer在前,filter在后,并且filter是对stringinde ...
java实现spark常用算子之Union
import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.a ...

随机推荐

cocos creator 刚体卡顿问题（边界会卡住）
**问题描述:**在项目开发中,使用到了刚体, 在搭建地图过程中,发现两个相邻的砖块,即使贴合的再紧密,但星星人在上面走动的时候还是会有很大概率发生卡顿(被两个刚体的边界处卡住).为了解决这个问题,我 ...
Elasticsearch2.3.4使用手册（使用存储过程做增量同步的探索）
一.工具安装访问官网https://www.elastic.co/downloads/elasticsearch和http://xbib.org/repository/org/xbib/elasti ...
HTML5制作网页（2）
<!DOCTYPE html><html> <head> <meta charset="UTF-8"> <title> ...
java面试总躲不过的并发（二）：volatile原理 + happens-before原则
一.happens-before原则同一个线程中的,前面的操作 happens-before 后续的操作.(即单线程内按代码顺序执行.但是,在不影响在单线程环境执行结果的前提下,编译器和处理器可以进 ...
mysql的事务和数据库锁的关系
数据库加事务并不是数据就安全来了,事务和锁要分析清楚和配合使用问题背景处于对高并发的秒杀环节的理解整理如下: 秒杀的时候高并发主要注意1.在秒杀的情况下,肯定不能如此高频率的去读写数据库,会严重造成 ...
将多张图片打包成zip包，一起上传
1.前端页面 <div class="mod-body" id="showRW" style="text-align: center;font- ...
创建一个dynamics 365 CRM online plugin (十) - Isolation mode or trust mode
Isolation Mode 也被称作为Plugin Trust CRM里面有两种plugin trust / isolation mode 1. Full Trust 只在OP系统中可使用,没有限制 ...
Centos7安装Docker CE
每次安装Docker都要去找文档,或者每次安装的都不一样,还是要好好管理自己的这些东西,下次用的时候可以省很多的时间 Docker的早期版本称为docker或docker-engine:现在的 ...
cmake add_custom_command 使用
cmake add_custom_command 使用今天整理编译工程,想在编译工程前面用tolua生成c文件, 使用命令add_custom_command后,附加的命令并不执行,如下: add_ ...
如何长期试用Beyond Compare 4
打开Beyond Compare 4,发现已经过了试用期我们可以点击立即购买,购买相关的Beyond Compare 4产品,如果你已经有密钥了,可以选择使用密钥如果还想继续试用,则找到自 ...

列举spark所有算子

列举spark所有算子的更多相关文章

随机推荐

热门专题