map(func)

/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] 
map(func) Return a new distributed dataset formed by passing each element of the source through a function func

将原RDD中的每一个元素经过func函数映射为一个新的元素形成一个新的RDD。

示例:

其中sc.parallelize第二个参数标识RDD的分区数量

val rdd = sc.parallelize(1 to 9,2)
val rdd1=rdd.map(x=>x+1)
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.take(20)
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9) scala> val rdd1=rdd.map(x=>x+1)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[4] at map at <console>:26 scala> rdd1.take(20)
res5: Array[Int] = Array(2, 3, 4, 5, 6, 7, 8, 9, 10)

filter(func)

/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/

def filter(f: T => Boolean): RDD[T]

filter(func) Return a new dataset formed by selecting those elements of the source on which func returns true. 

原RDD中通过func函数结果为true的元素转换成一个新的RDD。

val rdd = sc.parallelize(1 to 9,2)
val rdd1 = rdd.filter(_>=5)
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24 scala> val rdd1 = rdd.filter(_>=5)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[8] at filter at <console>:26 scala> rdd1.take(10)
res13: Array[Int] = Array(5, 6, 7, 8, 9)

flatMap(func)

/**
* Return a new RDD by first applying a function to all elements of this
* RDD, and then flattening the results.
*/
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). 

和map类似,但是每一个元素可能被映射为0个或多个元素(func函数应该返回一个Seq,而不是单个的元素);实际上就是先进行map,然后再进行一次平滑(flat)处理。

val rdd = sc.parallelize(1 to 3,2)
val rdd1 = rdd.flatMap( _ to 5)
scala> val rdd = sc.parallelize(1 to 3,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> val rdd1 = rdd.flatMap( _ to 5)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[10] at flatMap at <console>:26 scala> rdd1.take(100)
res14: Array[Int] = Array(1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5)

mapPartitions(func)

/**
* Return a new RDD by applying a function to each partition of this RDD.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. 

和map类似,该函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。

计算每一个partition中元素个数

def countPartitionEle(it : Iterator[Int]) = {
var result = List[Int]()
var i = 0
while(it.hasNext){
i += 1
it.next
}
result.::(i).iterator//::在列表开头增加元素i,元素i必须用小括号包含,然后创建一个迭代器
} val rdd = sc.parallelize(1 to 10, 3)
val rdd1 = rdd.mapPartitions(countPartitionEle(_))
rdd1.take(10)
scala> def countPartitionEle(it : Iterator[Int]) = {
| var result = List[Int]()
| var i = 0
| while(it.hasNext){
| i += 1
| it.next
| }
| result.::(i).iterator
| }
countPartitionEle: (it: Iterator[Int])Iterator[Int] scala> val rdd = sc.parallelize(1 to 10, 3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> val rdd1 = rdd.mapPartitions(countPartitionEle(_))
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[7] at mapPartitions at <console>:30 scala> rdd1.take(10)
res8: Array[Int] = Array(3, 3, 4)

mapPartitionsWithIndex(func)

/**
* Return a new RDD by applying a function to each partition of this RDD, while tracking the index
* of the original partition.
*
* `preservesPartitioning` indicates whether the input function preserves the partitioner, which
* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.
*/
def mapPartitionsWithIndex[U: ClassTag](
f: (Int, Iterator[T]) => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U]

mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T. 

和mapPartitions类似,也是针对每个分区处理,但是func函数需要两个入参,第一个表示partition分区索引,第二个入参表示每个分区的迭代器。

def func(index :Int, it : Iterator[Int]) = {
var result = List[String]()
var i = ""
while(it.hasNext){
i += it.next + ","
}
result.::(i.dropRight(1) + " at partition "+index+".").iterator
} val rdd = sc.parallelize(1 to 10, 3)
val rdd1 = rdd.mapPartitionsWithIndex((x,it) => func(x,it))
rdd1.take(3)
scala> def func(index :Int, it : Iterator[Int]) = {
| var result = List[String]()
| var i = ""
| while(it.hasNext){
| i += it.next + ","
| }
| result.::(i.dropRight(1) + " at partition "+index+".").iterator
| }
func: (index: Int, it: Iterator[Int])Iterator[String] scala> val rdd = sc.parallelize(1 to 10, 3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> val rdd1 = rdd.mapPartitionsWithIndex((x,it) => func(x,it))
rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at mapPartitionsWithIndex at <console>:28 scala> rdd1.take(3)
res2: Array[String] = Array(1,2,3 at partition 0. 4,5,6 at partition 1. 7,8,9,10 at partition 2.)

sample(withReplacement, fraction, seed)

/**
* Return a sampled subset of this RDD.
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be >= 0
* @param seed seed for the random number generator
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T]

sample(withReplacement, fraction, seed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed. 

对原RDD进行采样,其中withReplacement表示是否有放回的抽样,fraction表示采样大小是原RDD的百分比,seed表示随机数生成器

scala> val rdd = sc.parallelize(1 to 10, 3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> val rdd1 = rdd.sample(true,0.5,0)
rdd1: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[4] at sample at <console>:26 scala> val rdd2 = rdd.sample(false,0.5,0)
rdd2: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[5] at sample at <console>:26 scala> rdd1.collect
res3: Array[Int] = Array(2) scala> rdd2.collect
res4: Array[Int] = Array(1, 2, 4, 5, 6, 9) scala> val rdd1 = rdd.sample(true,0.5,1)
rdd1: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[7] at sample at <console>:26 scala> rdd1.collect
res6: Array[Int] = Array(1, 3, 7, 7, 8, 8, 9, 10)

union(otherDataset)

/**
* Return the union of this RDD and another one. Any identical elements will appear multiple
* times (use `.distinct()` to eliminate them).
*/
def union(other: RDD[T]): RDD[T]

union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument. 

将两个RDD合并成一个RDD,相同的元素可能出现多次,可以使用distinct去重。

val rdd1 = sc.parallelize(1 to 5,2)
val rdd2 = sc.parallelize(1 to 5,3)
val rdd3 = sc.parallelize(2 to 8,3)
val rdd = rdd1.union(rdd2).union(rdd3)
rdd.collect
rdd.distinct.collect
scala> val rdd1 = sc.parallelize(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(1 to 5,3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> val rdd3 = sc.parallelize(2 to 8,3)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24 scala> val rdd = rdd1.union(rdd2).union(rdd3)
rdd: org.apache.spark.rdd.RDD[Int] = UnionRDD[12] at union at <console>:30 scala> rdd.collect
res7: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 6, 7, 8) scala> rdd.distinct.collect
res8: Array[Int] = Array(8, 1, 2, 3, 4, 5, 6, 7)

intersection(otherDataset)

/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* Note that this method performs a shuffle internally.
*/
def intersection(other: RDD[T]): RDD[T]

intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument. 

两个RDD共同的元素组合一个新的RDD

val rdd1 = sc.parallelize(1 to 5,2)
val rdd2 = sc.parallelize(4 to 8,3)
val rdd = rdd1.intersection(rdd2)
rdd.collect
scala> val rdd1 = sc.parallelize(1 to 5,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(4 to 8,3)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at <console>:24 scala> val rdd = rdd1.intersection(rdd2)
rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at intersection at <console>:28 scala> rdd.collect
res9: Array[Int] = Array(4, 5)

scala> rdd.partitions.length
  res10: Int = 3

/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did. Performs a hash partition across the cluster
*
* Note that this method performs a shuffle internally.
*
* @param numPartitions How many partitions to use in the resulting RDD
*/
def intersection(other: RDD[T], numPartitions: Int): RDD[T]

def intersection(other: RDD[T]): RDD[T],numPartitions表示结果RDD的分区数量

scala> val rdd = rdd1.intersection(rdd2,1)
rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at intersection at <console>:28 scala> rdd.partitions.length
res12: Int = 1

/**
* Return the intersection of this RDD and another one. The output will not contain any duplicate
* elements, even if the input RDDs did.
*
* Note that this method performs a shuffle internally.
*
* @param partitioner Partitioner to use for the resulting RDD
*/
def intersection(
other: RDD[T],
partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

自定义分区

自定义分区类必须继承Partitioner,方法numPartitions设置分区数量,getPartition获取分区索引。

class MyPartitioner(numParts:Int) extends org.apache.spark.Partitioner{
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
key.toString.toInt%numPartitions
}
}
val rdd1 = sc.parallelize(1 to 15,2)
val rdd2 = sc.parallelize(5 to 25,2)
val rdd = rdd1.intersection(rdd2,new MyPartitioner(5))
rdd.collect
rdd.partitions.length
def func(index :Int, it : Iterator[Int]) = {
var result = List[String]()
var i = ""
while(it.hasNext){
i += it.next + ","
}
result.::(i.dropRight(1) + " at partition "+index+".").iterator
} val rdd3 = rdd.mapPartitionsWithIndex((x,it) => func(x,it))
rdd3.collect
scala> val rdd1 = sc.parallelize(1 to 15,2)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[54] at parallelize at <console>:27 scala> val rdd2 = sc.parallelize(5 to 25,2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at <console>:27 scala> val rdd = rdd1.intersection(rdd2,newMyPartitioner(5))
<console>:31: error: not found: value newMyPartitioner
val rdd = rdd1.intersection(rdd2,newMyPartitioner(5))
^ scala> val rdd = rdd1.intersection(rdd2,new MyPartitioner(5))
rdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[61] at intersection at <console>:32 scala> rdd.collect
res25: Array[Int] = Array(15, 10, 5, 11, 6, 7, 12, 13, 8, 14, 9) scala> rdd.partitions.length
res26: Int = 5 scala> def func(index :Int, it : Iterator[Int]) = {
| var result = List[String]()
| var i = ""
| while(it.hasNext){
| i += it.next + ","
| }
| result.::(i.dropRight(1) + " at partition "+index+".").iterator
| }
func: (index: Int, it: Iterator[Int])Iterator[String] scala> val rdd3 = rdd.mapPartitionsWithIndex((x,it) => func(x,it))
rdd3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at mapPartitionsWithIndex at <console>:36 scala> rdd3.collect
res27: Array[String] = Array(15,10,5 at partition 0., 11,6 at partition 1., 7,12 at partition 2., 13,8 at partition 3., 14,9 at partition 4.)

distinct([numTasks])

distinct([numTasks]) Return a new dataset that contains the distinct elements of the source dataset.

使用原RDD中的元素组成一个没有重复元素的RDD

/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

numPartitions表示结果RDD的分区数量

val a = Array(1,1,1,2,2,3,4,5)
val rdd = sc.parallelize(a,2)
rdd.collect
val rdd1 = rdd.distinct(1)
rdd1.collect
rdd1.partitions.length
scala> val a = Array(1,1,1,2,2,3,4,5)
a: Array[Int] = Array(1, 1, 1, 2, 2, 3, 4, 5) scala> val rdd = sc.parallelize(a)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at parallelize at <console>:26 scala> val rdd = sc.parallelize(a,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[43] at parallelize at <console>:26 scala> rdd.collect
res13: Array[Int] = Array(1, 1, 1, 2, 2, 3, 4, 5) scala> val rdd1 = rdd.distinct(1)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[46] at distinct at <console>:28 scala> rdd1.collect
res14: Array[Int] = Array(4, 1, 3, 5, 2) scala> rdd1.partitions.length
res15: Int = 1

/**
* Return a new RDD containing the distinct elements in this RDD.
*/
def distinct(): RDD[T]

distinct(numPartitions: Int),不同的是结果RDD中partition数量依赖父RDD。

scala> val rdd1 = rdd.distinct()
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[49] at distinct at <console>:28 scala> rdd1.partitions.length
res16: Int = 2 scala> rdd1.collect
res17: Array[Int] = Array(4, 2, 1, 3, 5)

keyBy(func)

/**
* Creates tuples of the elements in this RDD by applying `f`.
*/
def keyBy[K](f: T => K): RDD[(K, T)]

使用func为RDD每一个元素创建一个key-value对元素

val rdd = sc.parallelize(1 to 9 ,2)
val rdd1 = rdd.keyBy(_%3)
rdd1.collect
scala> val rdd = sc.parallelize(1 to 9 ,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> val rdd1 = rdd.keyBy(_%3)
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[1] at keyBy at <console>:26 scala> rdd1.collect
res0: Array[(Int, Int)] = Array((1,1), (2,2), (0,3), (1,4), (2,5), (0,6), (1,7), (2,8), (0,9))
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupByKey(): RDD[(K, Iterable[V])]
val rdd = sc.parallelize(1 to 9 ,2)
val rdd1 = rdd.keyBy(_%3)
val rdd2 = rdd1.groupByKey()
rdd2.collect
scala> val rdd = sc.parallelize(1 to 9 ,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 scala> val rdd1 = rdd.keyBy(_%3)
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[3] at keyBy at <console>:26 scala> val rdd2 = rdd1.groupByKey()
rdd2: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[4] at groupByKey at <console>:28 scala> rdd2.collect
res1: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (2,CompactBuffer(2, 5, 8)), (1,CompactBuffer(1, 4, 7)))
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with into `numPartitions` partitions. The ordering of elements within
* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*
* Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
*/
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
同groupByKey(),只是指定了分区数量。
scala> val rdd = sc.parallelize(1 to 9 ,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> val rdd1 = rdd.keyBy(_%3)
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[6] at keyBy at <console>:26 scala> val rdd2 = rdd1.groupByKey(3)
rdd2: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[7] at groupByKey at <console>:28 scala> rdd2.partitions.length
res2: Int = 3
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*
* Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
class MyPartitioner(numParts:Int) extends org.apache.spark.Partitioner{
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
key.toString.toInt%numPartitions
}
} val rdd = sc.parallelize(1 to 9 ,2)
val rdd1 = rdd.keyBy(_%3)
rdd1.collect
val rdd2 = rdd1.groupByKey(new MyPartitioner(2))
rdd2.collect

scala> class MyPartitioner(numParts:Int) extends org.apache.spark.Partitioner{
| override def numPartitions: Int = numParts
| override def getPartition(key: Any): Int = {
| key.toString.toInt%numPartitions
| }
| }
defined class MyPartitioner scala> val rdd = sc.parallelize(1 to 9 ,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24 scala> val rdd1 = rdd.keyBy(_%3)
rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[9] at keyBy at <console>:26 scala> val rdd2 = rdd1.groupByKey(new MyPartitioner(2))
rdd2: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[10] at groupByKey at <console>:29 scala> rdd2.collect
res3: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (2,CompactBuffer(2, 5, 8)), (1,CompactBuffer(1, 4, 7))) scala> rdd1.collect
res4: Array[(Int, Int)] = Array((1,1), (2,2), (0,3), (1,4), (2,5), (0,6), (1,7), (2,8), (0,9)) scala> rdd2.partitions.length
res5: Int = 2
 

groupBy(func)

/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]

def func(x:Int) = {x%3}
val rdd = sc.parallelize(1 to 10,2)
val rdd1 = rdd.groupBy(func(_))
rdd1.collect
scala> def func(x:Int) = {x%3}
func: (x: Int)Int scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[64] at parallelize at <console>:27 scala> val rdd1 = rdd.groupBy(func(_))
rdd1: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[66] at groupBy at <console>:31 scala> rdd1.collect
res29: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (2,CompactBuffer(2, 5, 8)), (1,CompactBuffer(1, 4, 7, 10)))
/**
* Return an RDD of grouped elements. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupBy[K](
f: T => K,
numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])]

同groupBy[K](f: T => K),只是指定了分区数量。

def func(x:Int) = {x%3}
val rdd = sc.parallelize(1 to 10,2)
val rdd1 = rdd.groupBy(func(_),3)
rdd1.collect
scala> def func(x:Int) = {x%3}
func: (x: Int)Int scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24 scala> val rdd1 = rdd.groupBy(func(_),3)
rdd1: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[13] at groupBy at <console>:28 scala> rdd1.partitions.length
res6: Int = 3 scala> rdd1.collect
res7: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (1,CompactBuffer(1, 4, 7, 10)), (2,CompactBuffer(2, 5, 8)))
/**
* Return an RDD of grouped items. Each group consists of a key and a sequence of elements
* mapping to that key. The ordering of elements within each group is not guaranteed, and
* may even differ each time the resulting RDD is evaluated.
*
* Note: This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
*/
def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])]
class MyPartitioner(numParts:Int) extends org.apache.spark.Partitioner{
override def numPartitions: Int = numParts
override def getPartition(key: Any): Int = {
key.toString.toInt%numPartitions
}
}
def func(x:Int) = {x%3}
val rdd = sc.parallelize(1 to 10,2)
val rdd1 = rdd.groupBy(func(_),new MyPartitioner(3))
rdd1.collect
rdd1.partitions.length
scala> class MyPartitioner(numParts:Int) extends org.apache.spark.Partitioner{
| override def numPartitions: Int = numParts
| override def getPartition(key: Any): Int = {
| key.toString.toInt%numPartitions
| }
| }
defined class MyPartitioner scala> def func(x:Int) = {x%3}
func: (x: Int)Int scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at <console>:24 scala> val rdd1 = rdd.groupBy(func(_),new MyPartitioner(3))
rdd1: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[16] at groupBy at <console>:29 scala> rdd1.collect
res8: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(3, 6, 9)), (1,CompactBuffer(1, 4, 7, 10)), (2,CompactBuffer(2, 5, 8))) scala> rdd1.partitions.length
res9: Int = 3

reduceByKey(func, [numTasks])

reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument. 
对每一个key的所有value使用func函数进行聚合

/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
val words = Array("one", "two", "two", "three", "three", "three")
val rdd = sc.parallelize(words).map(word => (word, 1))
val rdd1 = rdd.reduceByKey(_ + _)
rdd1.collect
scala> val words = Array("one", "two", "two", "three", "three", "three")
words: Array[String] = Array(one, two, two, three, three, three) scala> val rdd = sc.parallelize(words).map(word => (word, 1))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[18] at map at <console>:26 scala> val rdd1 = rdd.reduceByKey(_ + _)
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[19] at reduceByKey at <console>:28 scala> rdd1.collect
res10: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
同上,只是指定了分区数量
scala> val rdd2 = rdd.reduceByKey(_ + _,3)
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at reduceByKey at <console>:28 scala> rdd2.collect
res11: Array[(String, Int)] = Array((two,2), (one,1), (three,3)) scala> rdd2.partitions.length
res12: Int = 3

/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
同上,使用partitioner自定义分区
												

Spark RDD Transformation 简单用例(一)的更多相关文章

  1. Spark RDD Transformation 简单用例(三)

    cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...

  2. Spark RDD Transformation 简单用例(二)

    aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...

  3. Spark RDD Action 简单用例(一)

    collectAsMap(): Map[K, V] 返回key-value对,key是唯一的,如果rdd元素中同一个key对应多个value,则只会保留一个./** * Return the key- ...

  4. Spark RDD Action 简单用例(二)

    foreach(f: T => Unit) 对RDD的所有元素应用f函数进行处理,f无返回值./** * Applies a function f to all elements of this ...

  5. spark RDD transformation与action函数整理

    1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...

  6. spark rdd Transformation和Action 剖析

    1.看到 这篇总结的这么好, 就悄悄的转过来,供学习 wordcount.toDebugString查看RDD的继承链条 所以广义的讲,对任何函数进行某一项操作都可以认为是一个算子,甚至包括求幂次,开 ...

  7. Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)论文 | ApacheCN

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  8. Apache Spark RDD(Resilient Distributed Datasets)论文

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  9. Spark RDD深度解析-RDD计算流程

    Spark RDD深度解析-RDD计算流程 摘要  RDD(Resilient Distributed Datasets)是Spark的核心数据结构,所有数据计算操作均基于该结构进行,包括Spark ...

随机推荐

  1. C#高级编程四十一天----用户定义的数据类型转换

    用户定义的数据类型转换 C#同意定义自己的 数据类型,这意味着须要某些 工具支持在自己的数据类型间进行数据转换.方法是把数据类型转换定义为相关类的一个成员运算符,数据类型转换必须声明为隐式或者显式,以 ...

  2. iSpy免费的开源视频监控平台

    iSpy包括英文,Deutsch,Español,Française,Italiano和中文的翻译 iSpy是我们免费的开源视频监控平台.iSpy作为安装的Windows应用程序运行,具有完整的本地用 ...

  3. Web 前端面试题整理(不定时更新)

    重要知识需要系统学习.透彻学习,形成自己的知识链.万不可投机取巧,临时抱佛脚只求面试侥幸混过关是错误的! 面试有几点需注意: 面试题目: 根据你的等级和职位的变化,入门级到专家级,广度和深度都会有所增 ...

  4. 防止UI界面被输入法遮挡(画面随输入法自适应)

    应用过Android手机的朋友都知道,有时候在文本框中输入文字后,操作按钮被输入法遮挡了,不得不关闭输入法才可以继续操作. 比如下面这个画面: 画面布局: <?xmlversion=" ...

  5. MySQL(3)-MySQL Workbench

    远程连接mysql不上,CentOS7下的防火墙关闭命令,别光看iptables的状态. # systemctl stop firewalld # systemctl mask firewalld   ...

  6. 【Visual Studio】VS发布应用未能创建默认证书的问题解决方法

    解决方法:点击你创建的项目 右键> 属性>签名>从存储区选择>选择证书这时候显示无可用证书 ,然后我从文件区选择了一个结果,又出现了第二个问题.提示我“签名时出错: 指定了无效 ...

  7. rdlc报表在vs2008下编辑正常,在vs2012上编辑就报错

    最近我们的系统的开发工具由vs2008升级到了2012,由于系统中很多报表都是用rdlc来开发的,今天 遇到有报表需要改动的需求,就直接使用vs2012对rdlc报表进行了编辑,结果改完后,怎么预览报 ...

  8. goaccess生成nginx每日访问纪录

    使用php写的,方便点 <?php // 定义全局参数 $date = date("Ymd"); $day = date("d", strtotime(' ...

  9. FFMPEG中关于ts流的时长估计的实现(转)

    最近在做H.265 编码,原本只是做编码器的实现,但客户项目涉及到ts的封装,搞得我不得不配合了解点ts方面的东西.下面技术文档不错,转一下. ts流中的时间估计 我们知道ts流中是没有时间信息的,我 ...

  10. css组合选择器

    组合选择器:1,后代选择器 .main h2 {...}, 使用空格表示 IE6+2,子选择器 .main>h2 {...}, 使用 > 表示 IE7+3,兄弟选择器 h2+p {...} ...