一、Spark-RDD算子简介

    RDD(Resilient Distributed DataSet)是分布式数据集。RDD是Spark最基本的数据的抽象。
scala中的集合。RDD相当于一个不可变、可分区、里面的元素可以并行计算的集合。 RDD特点:
具有数据流模型的特点
自动容错
位置感知调度
可伸缩性
RDD允许用户在执行多个查询时将工作集缓存在内存中,可以重用工作集,大大的提升了查询速度。 RDD类型分为:
1)Transformation
转换
2)Action
动作

二、RDD创建

    RDD分为两种类型:
1)Transformation(lazy-》懒加载)
2)Action(触发任务)

例子:

    scala> sc.textFile("/root/words.txt")
res8: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25 scala> sc.textFile("/root/words.txt").flatMap(_.split(" "))
res9: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[31] at flatMap at <console>:25 scala> res8
res10: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25 scala> res8.flatMap(_.split(" "))
res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at flatMap at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1))
res12: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
res13: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:27 scala> res13.collect
res14: Array[(String, Int)] = Array((is,1), (love,2), (capital,1), (Beijing,2), (China,2), (I,2), (of,1), (the,1)) scala> val rdd1 = sc.textFile("/root/words.txt")
rdd1: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[39] at textFile at <console>:24 scala> rdd1.count
res15: Long = 3 scala> val list = List(1,3,5,7)
list: List[Int] = List(1, 3, 5, 7) scala> list.map(_ * 100)
res16: List[Int] = List(100, 300, 500, 700) scala> val rdd1 = sc.parallelize(list)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[40] at parallelize at <console>:26 scala> rdd1.map(_ * 100)
res17: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[41] at map at <console>:29 scala> res17.collect
res18: Array[Int] = Array(100, 300, 500, 700) scala> val rdd2 = sc.makeRDD(list)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[42] at makeRDD at <console>:26 scala> val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[43] at parallelize at <console>:24 scala> val rdd2 = rdd1.map(_ * 1000)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[44] at map at <console>:26 scala> rdd2.collect
res19: Array[Int] = Array(1000, 2000, 3000, 4000, 5000)

三、常用Transformation

    Transformation特点:
1)生成新的RDD
2)lazy懒加载 等待处理
3)并不会存储真正的数据,记录了转换关系

1、map(func)
2、flatMap(func)
3、sortby
4、reduceByKey

    scala> sc.textFile("/root/words.txt")
res8: org.apache.spark.rdd.RDD[String] = /root/words.txt MapPartitionsRDD[28] at textFile at <console>:25 scala> res8.flatMap(_.split(" "))
res11: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[32] at flatMap at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1))
res12: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[34] at map at <console>:27 scala> res8.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
res13: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[37] at reduceByKey at <console>:27

5、filter  过滤

    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[48] at parallelize at <console>:24 scala> rdd1.filter(_ % 2 == 0)
res24: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[49] at filter at <console>:27 scala> res24.collect
res25: Array[Int] = Array(2, 4)

6、union  并集

    scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[50] at parallelize at <console>:24 scala> rdd1.union(rdd2)
res26: org.apache.spark.rdd.RDD[Int] = UnionRDD[51] at union at <console>:29 scala> res26.collect
res27: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7) scala> rdd1 union rdd2
res28: org.apache.spark.rdd.RDD[Int] = UnionRDD[52] at union at <console>:29 scala> res28.collect
res29: Array[Int] = Array(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 7)

7、groupByKey  分组

    scala> val rdd3 = sc.parallelize(List(("Tom",18),("John",16),("Tom",20),("Mary",17),("John",23)))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[53] at parallelize at <console>:24 scala> rdd3.groupByKey
res30: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[54] at groupByKey at <console>:27 scala> res30.collect
res31: Array[(String, Iterable[Int])] = Array((John,CompactBuffer(16, 23)), (Tom,CompactBuffer(18, 20)), (Mary,CompactBuffer(17)))

8、intersection  交集

    scala> val rdd1 = sc.parallelize(List(1,2,3,4,5))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[56] at parallelize at <console>:24 scala> rdd1.intersection(rdd2)
res33: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[62] at intersection at <console>:29 scala> res33.collect
res34: Array[Int] = Array(3, 4, 1, 5, 2)

9、join  关联

    scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24 scala> rdd1.join(rdd2)
res39: org.apache.spark.rdd.RDD[(String, (Int, Int))] = MapPartitionsRDD[69] at join at <console>:29 scala> res39.collect
res40: Array[(String, (Int, Int))] = Array((John,(16,26)), (Tom,(18,28)))

10、leftOuterJoin  左连接

保留左侧RDD,右侧如果join上保留,没join上None

    scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24 scala> rdd1.leftOuterJoin(rdd2)
res41: org.apache.spark.rdd.RDD[(String, (Int, Option[Int]))] = MapPartitionsRDD[72] at leftOuterJoin at <console>:29 scala> res41.collect
res42: Array[(String, (Int, Option[Int]))] = Array((John,(16,Some(26))), (Tom,(18,Some(28))), (Mary,(21,None)))

11、rightOuterJoin  右连接

    scala> val rdd1 = sc.parallelize(List(("Tom",18),("John",16),("Mary",21)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[65] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List(("Tom",28),("John",26),("Cat",17)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[66] at parallelize at <console>:24 scala> rdd1.rightOuterJoin(rdd2).collect
res43: Array[(String, (Option[Int], Int))] = Array((John,(Some(16),26)), (Tom,(Some(18),28)), (Cat,(None,17)))

12、cartesian  笛卡尔积

    scala> val rdd1 = sc.parallelize(List("Tom","Mary"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[76] at parallelize at <console>:24 scala> val rdd2 = sc.parallelize(List("John","Joe"))
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[77] at parallelize at <console>:24 scala> rdd1.cartesian(rdd2)
res45: org.apache.spark.rdd.RDD[(String, String)] = CartesianRDD[78] at cartesian at <console>:29 scala> res45.collect
res46: Array[(String, String)] = Array((Tom,John), (Tom,Joe), (Mary,John), (Mary,Joe))

四、常用的Action

1、collect  收集

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.collect
res0: Array[Int] = Array(1, 2, 3, 4)

2、saveAsTextFile(path)  存储文件
三份数据:5B 5B 600B
理想切分:5+5+600=610 610/3 = 203
5B一片
5B一片
203一片
203一片
203一片
1一片

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.saveAsTextFile("/root/RDD1") //查看分区数
scala> rdd1.partitions.length
res3: Int = 4

3、count  计数

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.count
res2: Long = 4

4、reduce  聚合

    scala> val rdd2 = sc.parallelize(List(1,2,3,4),2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 scala> rdd2.partitions.length
res4: Int = 2 scala> rdd2.reduce(_+_)
res5: Int = 10

5、countByKey()  根据key计数

    scala> sc.parallelize(List(("Tom",18),("Tom",28),("John",14),("Mary",16)))
res9: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:25 scala> res9.count
res10: Long = 4 scala> res9.countByKey()
res11: scala.collection.Map[String,Long] = Map(Tom -> 2, Mary -> 1, John -> 1) scala> res9.reduceByKey(_+_)
res12: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at <console>:27 scala> res9.reduceByKey(_+_).collect
res13: Array[(String, Int)] = Array((Tom,46), (Mary,16), (John,14))

6、take(n)  取出多少个元素

    scala> val rdd1 = sc.parallelize(List(1,2,3,4))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd1.take(2)
res15: Array[Int] = Array(1, 2) scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.take(2)
res17: Array[Int] = Array(3, 2)

7、first  返回RDD的第一个元素

    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.first
res18: Int = 3

8、takeOrdered(n)  取出多少个元素 默认正序

    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.takeOrdered(2)
res19: Array[Int] = Array(1, 2)

9、top(n)  倒序排序 取出元素

    scala> val rdd3 = sc.parallelize(List(3,2,8,1,7))
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> rdd3.top(2)
res20: Array[Int] = Array(8, 7)

五、spark高级算子

1、mapPartitionsWithIndex(func)
设置分区,并且查看每个分区中存放的元素
查看每个分区中元素
需要传递函数作为参数
val func = (index:Int,iter:Iterator[(Int)]) => {iter.toList.map(x => "partID:" + index + "," + "datas:" + x + "]").iterator}

    scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24 scala> val func = (index:Int,iter:Iterator[(Int)]) => {iter.toList.map(x => "partID:" + index + "," + "datas:" + x + "]").iterator}
func: (Int, Iterator[Int]) => Iterator[String] = <function2> scala> rdd3.mapPartitionsWithIndex(func)
res21: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at mapPartitionsWithIndex at <console>:29 scala> rdd3.mapPartitionsWithIndex(func).collect
res22: Array[String] = Array(partID:0,datas:1], partID:0,datas:2], partID:0,datas:3], partID:1,datas:4], partID:1,datas:5], partID:1,datas:6], partID:1,datas:7])

2、aggregate
聚合,先局部后全局
max 取最大值
min 取最小值

    scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:24 scala> rdd3.aggregate(0)(_+_,_+_)
res23: Int = 28 scala> rdd3.max
res24: Int = 7 scala> rdd3.min
res25: Int = 1 scala> val rdd3 = sc.parallelize(List(1,2,3,4,5,6,7),2)
rdd3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at <console>:24 scala> rdd3.aggregate(0)(math.max(_,_),_+_)
res29: Int = 10 scala> rdd3.aggregate(10)(math.max(_,_),_+_)
res31: Int = 30 // 1+2+3+20 + 4+5+6+7+20 + 20 = 88
scala> rdd3.aggregate(20)(_+_,_+_)
res32: Int = 88 scala> val rdd4 = sc.parallelize(List("a","b","c","d","e"),2)
rdd4: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[17] at parallelize at <console>:24 scala> rdd4.aggregate("|")(_+_,_+_)
res33: String = ||ab|cde scala> rdd4.aggregate("|")(_+_,_+_)
res34: String = ||cde|ab scala> val rdd5 = sc.parallelize(List("12","23","234","3456"),2)
rdd5: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res35: String = 24 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res36: String = 42 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res37: String = 24 scala> rdd5.aggregate("")((x,y) => math.max(x.length,y.length).toString,(x,y) => x+y)
res38: String = 42 scala> val rdd6 = sc.parallelize(List("12","23","345",""),2)
rdd6: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[19] at parallelize at <console>:24 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res41: String = 01 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res42: String = 10 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res43: String = 01 scala> rdd6.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res48: String = 10 scala> val rdd7 = sc.parallelize(List("12","23","","456"),2)
rdd7: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd7.aggregate("")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res1: String = 11 scala> ("").length
res2: Int = 0 scala> 0.length
<console>:24: error: value length is not a member of Int
0.length
^ scala> 0.toString.length
res5: Int = 1 scala> rdd7.aggregate("0")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res6: String = 011 scala> rdd7.aggregate("0")((x,y) => math.min(x.length,y.length).toString,(x,y) => x+y)
res7: String = 011

3、aggregateByKey
根据key聚合,先局部再全局

    scala> val rdd8 = sc.parallelize(List(("cat",3),("cat",8),("mouse",6),("dog",8)))
rdd8: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> def func(index:Int,iter:Iterator[(String,Int)]):Iterator[String] = {iter.toList.map(x => "partID:" + index + "," + "values:" + x + "]").iterator}
func: (index: Int, iter: Iterator[(String, Int)])Iterator[String] scala> rdd8.mapPartitionsWithIndex(func).collect
res34: Array[String] = Array(partID:0,values:(cat,3)], partID:1,values:(cat,8)], partID:2,values:(mouse,6)], partID:3,values:(dog,8)]) scala> rdd8.aggregateByKey(0)(_+_,_+_).collect
res35: Array[(String, Int)] = Array((dog,8), (mouse,6), (cat,11))

4、combineByKey
aggregateByKey和reduceByKey底层调用都是combineByKey
最底层的方法,先局部累加,再全局累加

    scala> val rdd1 = sc.textFile("hdfs://192.168.146.111:9000/words.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect
rdd1: Array[(String, Int)] = Array((haha,1), (heihei,1), (hello,3), (Beijing,1), (world,1), (China,1)) scala> val rdd2 = sc.textFile("hdfs://192.168.146.111:9000/words.txt").flatMap(_.split("\t")).map((_,1)).combineByKey(x => x,(m:Int,n:Int) => (m+n),(a:Int,b:Int) => (a+b)).collect
rdd2: Array[(String, Int)] = Array((haha,1), (heihei,1), (hello,3), (Beijing,1), (world,1), (China,1))

5、coalesce
coalesce(4,true)
分区数4
是否shuffle

repartition的实现,已默认加了shuffle

    scala> val rdd2 = sc.parallelize(List(1,2,3,4,5,6,7,8,9),2)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24 scala> rdd2.partitions.length
res42: Int = 2 scala> val rdd3 = rdd2.coalesce(4,true)
rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at coalesce at <console>:26 scala> rdd3.partitions.length
res43: Int = 4 scala> val rdd4 = rdd3.repartition(5)
rdd4: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[23] at repartition at <console>:28 scala> rdd4.partitions.length
res44: Int = 5

6、filterByRange
过滤出指定范围的元素

    scala> val rdd6 = sc.parallelize(List(("a",3),("b",2),("d",5),("e",8)))
rdd6: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[24] at parallelize at <console>:24 scala> rdd6.filterByRange("b","d").collect
res45: Array[(String, Int)] = Array((b,2), (d,5)) scala> rdd6.filterByRange("b","e").collect
res46: Array[(String, Int)] = Array((b,2), (d,5), (e,8))

7、flatMapValues
切分出每个元素

    scala> val rdd7 = sc.parallelize(List(("a","3 6"),("b","2 5"),("d","5 8")))
rdd7: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[27] at parallelize at <console>:24 scala> rdd7.flatMapValues(_.split(" ")).collect
res47: Array[(String, String)] = Array((a,3), (a,6), (b,2), (b,5), (d,5), (d,8))

8、foldByKey
需求:根据key来拼接字符串

    scala> val rdd8 = sc.parallelize(List("Tom","John","Mary","Joe"),2)
rdd8: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[31] at parallelize at <console>:24 scala> val rdd9 = rdd8.map(x => (x.length,x))
rdd9: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[32] at map at <console>:26 scala> rdd9.collect
res48: Array[(Int, String)] = Array((3,Tom), (4,John), (4,Mary), (3,Joe)) scala> rdd9.foldByKey("")(_+_).collect
res49: Array[(Int, String)] = Array((4,JohnMary), (3,JoeTom))

9、foreach
遍历元素

import org.apache.spark.{SparkConf, SparkContext}

object ForeachDemo {

  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("ForeachDemo").setMaster("local[2]")
val sc = new SparkContext(conf) //创建rdd
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
rdd1.foreach(println(_)) sc.stop()
}
}

结果:

10、keyBy
以什么为key
keys values
拿到key 拿到value

    scala> val rdd2 = sc.parallelize(List("Tom","John","Jack"),3)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[37] at parallelize at <console>:24 scala> val rdd3 = rdd2.keyBy(_.length)
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[38] at keyBy at <console>:26 scala> rdd3.collect
res60: Array[(Int, String)] = Array((3,Tom), (4,John), (4,Jack)) scala> rdd3.keys.collect
res61: Array[Int] = Array(3, 4, 4) scala> rdd3.values.collect
res62: Array[String] = Array(Tom, John, Jack)

六、RDD并行化流程

Spark-RDD算子的更多相关文章

  1. Spark RDD 算子总结

    Spark算子总结 算子分类 Transformation(转换) 转换算子 含义 map(func) 返回一个新的RDD,该RDD由每一个输入元素经过func函数转换后组成 filter(func) ...

  2. Spark RDD算子介绍

    Spark学习笔记总结 01. Spark基础 1. 介绍 Spark可以用于批处理.交互式查询(Spark SQL).实时流处理(Spark Streaming).机器学习(Spark MLlib) ...

  3. Spark RDD概念学习系列之Spark的算子的分类(十一)

    Spark的算子的分类 从大方向来说,Spark 算子大致可以分为以下两类: 1)Transformation 变换/转换算子:这种变换并不触发提交作业,完成作业中间过程处理. Transformat ...

  4. Spark RDD概念学习系列之Spark的算子的作用(十四)

    Spark的算子的作用 首先,关于spark算子的分类,详细见 http://www.cnblogs.com/zlslch/p/5723857.html 1.Transformation 变换/转换算 ...

  5. Spark操作算子本质-RDD的容错

    Spark操作算子本质-RDD的容错spark模式1.standalone master 资源调度 worker2.yarn resourcemanager 资源调度 nodemanager在一个集群 ...

  6. spark教程(四)-SparkContext 和 RDD 算子

    SparkContext SparkContext 是在 spark 库中定义的一个类,作为 spark 库的入口点: 它表示连接到 spark,在进行 spark 操作之前必须先创建一个 Spark ...

  7. Spark性能调优-RDD算子调优篇(深度好文,面试常问,建议收藏)

    RDD算子调优 不废话,直接进入正题! 1. RDD复用 在对RDD进行算子时,要避免相同的算子和计算逻辑之下对RDD进行重复的计算,如下图所示: 对上图中的RDD计算架构进行修改,得到如下图所示的优 ...

  8. Spark—RDD编程常用转换算子代码实例

    Spark-RDD编程常用转换算子代码实例 Spark rdd 常用 Transformation 实例: 1.def map[U: ClassTag](f: T => U): RDD[U]  ...

  9. Spark中普通集合与RDD算子的sortBy()有什么区别

    分别观察一下集合与算子的sortBy()的参数列表 普通集合的sortBy() RDD算子的sortBy() 结论:普通集合的sortBy就没有false参数,也就是说只能默认的升序排. 如果需要对普 ...

  10. Spark RDD的依赖解读

    在Spark中, RDD是有依赖关系的,这种依赖关系有两种类型 窄依赖(Narrow Dependency) 宽依赖(Wide Dependency) 以下图说明RDD的窄依赖和宽依赖 窄依赖 窄依赖 ...

随机推荐

  1. myslq的索引类型为MyISAM和BDB的表:复合索引下的自增长

    本文源自:http://www.himigame.com/mysql/781.html 3.6.9. 使用AUTO_INCREMENT 可以通过AUTO_INCREMENT属性为新的行产生唯一的标识: ...

  2. R语言低级绘图函数-grid

    grid 函数用来在一张图表上添加网格线, 基本用法:默认在添加刻度线的地方添加网格线 plot(1:5, 1:5, xlim = c(0,6), ylim = c (0,6), type = &qu ...

  3. CentOS系统资源常用命令

    系统: # uname -a   # 查看内核/操作系统/CPU信息 # cat /etc/issue # cat /etc/redhat-release # 查看操作系统版本 # cat /proc ...

  4. Struts2 ajax json使用介绍

    一.jar包首先引入Struts和json所需的jar包. watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvaXRteWhvbWUxOTkw/font/5a6 ...

  5. mysql数据库,如何进行数据目录的初始化操作

    需求描述: 当将mysql的二进制包解压之后,只有通过初始化数据库的数据目录,创建系统表之后才能使用mysql Server 数据库版本:5.5.57-log 操作过程: 1.通过解压之后的mysql ...

  6. POJ 1337 A Lazy Worker(区间DP, 背包变形)

    Description There is a worker who may lack the motivation to perform at his peak level of efficiency ...

  7. Javascript中的感叹号和函数function

    js函数前加分号和感叹号是什么意思?有什么用?:http://www.cnblogs.com/mq0036/p/4605255.html function与感叹号:https://swordair.c ...

  8. 详解google Chrome浏览器(理论篇)

    注解:各位读者,经博客园工作人员反馈,hosts涉及违规问题,我暂时屏蔽了最新hosts,若已经获取最新hosts的朋友们,注意保密,不要外传.给大家带来麻烦,对此非常抱歉!!! 开篇概述 1.详解g ...

  9. Nginx虚拟主机配置教程

    说明:配置之前先把域名解析到服务器IP地址上 站点1:bbs.osyunwei.com  程序所在目录/data/osyunwei/bbs 站点2:sns.osyunwei.com  程序所在目录/d ...

  10. VC6IDE环境宏辅助添加移除注释

    VC6很老了(15年),当年的IDE功能不如现在的各种IDE功能丰富. 比如自动添加注释,就需要借助第三方插件或自己动手实现. 最近做些code试验,新装上了VC6,但是改代码时不能自动添加注释,很不 ...