Spark算子总结及案例
spark算子大致上可分三大类算子:
1、Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Value型的数据。
2、Key-Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Key-Value型的数据。
3、Action算子,这类算子会触发SparkContext提交作业。
一、Value型Transformation算子
1)map
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )
val b = a.map(_.length)
val c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,), (salmon,), (salmon,), (rat,), (elephant,))

2)flatMap
val a = sc.parallelize( to , )
a.flatMap( to _).collect
res47: Array[Int] = Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ) sc.parallelize(List(, , ), ).flatMap(x => List(x, x, x)).collect
res85: Array[Int] = Array(, , , , , , , , )

3)mapPartiions
val x = sc.parallelize( to , )
x.flatMap(List.fill(scala.util.Random.nextInt())(_)).collect res1: Array[Int] = Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , )

4)glom(形成一个Array数组)
val a = sc.parallelize( to , )
a.glom.collect
res8: Array[Array[Int]] = Array(Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ), Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ), Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ))

5)union
val a = sc.parallelize( to , )
val b = sc.parallelize( to , )
(a ++ b).collect
res0: Array[Int] = Array(, , , , , )

6)cartesian(笛卡尔操作)
val x = sc.parallelize(List(,,,,))
val y = sc.parallelize(List(,,,,))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,), (,))

7)groupBy(生成相应的key,相同的放在一起)
val a = sc.parallelize( to , )
a.groupBy(x => { if (x % == ) "even" else "odd" }).collect
res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(, , , )), (odd,ArrayBuffer(, , , , )))

8)filter
val a = sc.parallelize( to , )
val b = a.filter(_ % == )
b.collect
res3: Array[Int] = Array(, , , , )

9)distinct(去重)
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), )
c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

10)subtract(去掉含有重复的项)
val a = sc.parallelize( to , )
val b = sc.parallelize( to , )
val c = a.subtract(b)
c.collect
res3: Array[Int] = Array(, , , , , )

11)sample
val a = sc.parallelize( to , )
a.sample(false, 0.1, ).count
res24: Long =

12)takesample
val x = sc.parallelize( to , )
x.takeSample(true, , )
res3: Array[Int] = Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , )

13)cache、persist
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), )
c.getStorageLevel
res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, )
c.cache
c.getStorageLevel
res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, )
二、Key-Value型Transformation算子
1)mapValues
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), )
val b = a.map(x => (x.length, x))
b.mapValues("x" + _ + "x").collect
res5: Array[(Int, String)] = Array((,xdogx), (,xtigerx), (,xlionx), (,xcatx), (,xpantherx), (,xeaglex))

2)combineByKey
val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )
val b = sc.parallelize(List(,,,,,,,,), )
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((,List(cat, dog, turkey)), (,List(gnu, rabbit, salmon, bee, bear, wolf)))

3)reduceByKey
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), )
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((,dogcatowlgnuant))
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), )
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((,lion), (,dogcat), (,panther), (,tigereagle))

4)partitionBy
(对RDD进行分区操作)

5)cogroup
val a = sc.parallelize(List(, , , ), )
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(,(ArrayBuffer(b),ArrayBuffer(c))),
(,(ArrayBuffer(b),ArrayBuffer(c))),
(,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)

6)join
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )
val d = c.keyBy(_.length)
b.join(d).collect
res0: Array[(Int, (String, String))] = Array((,(salmon,salmon)), (,(salmon,rabbit)), (,(salmon,turkey)), (,(salmon,salmon)), (,(salmon,rabbit)), (,(salmon,turkey)), (,(dog,dog)), (,(dog,cat)), (,(dog,gnu)), (,(dog,bee)), (,(rat,dog)), (,(rat,cat)), (,(rat,gnu)), (,(rat,bee)))

7)leftOutJoin
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )
val d = c.keyBy(_.length)
b.leftOuterJoin(d).collect
res1: Array[(Int, (String, Option[String]))] = Array((,(salmon,Some(salmon))), (,(salmon,Some(rabbit))), (,(salmon,Some(turkey))), (,(salmon,Some(salmon))), (,(salmon,Some(rabbit))), (,(salmon,Some(turkey))), (,(dog,Some(dog))), (,(dog,Some(cat))), (,(dog,Some(gnu))), (,(dog,Some(bee))), (,(rat,Some(dog))), (,(rat,Some(cat))), (,(rat,Some(gnu))), (,(rat,Some(bee))), (,(elephant,None)))
8)rightOutJoin
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), )
val b = a.keyBy(_.length)
val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), )
val d = c.keyBy(_.length)
b.rightOuterJoin(d).collect
res2: Array[(Int, (Option[String], String))] = Array((,(Some(salmon),salmon)), (,(Some(salmon),rabbit)), (,(Some(salmon),turkey)), (,(Some(salmon),salmon)), (,(Some(salmon),rabbit)), (,(Some(salmon),turkey)), (,(Some(dog),dog)), (,(Some(dog),cat)), (,(Some(dog),gnu)), (,(Some(dog),bee)), (,(Some(rat),dog)), (,(Some(rat),cat)), (,(Some(rat),gnu)), (,(Some(rat),bee)), (,(None,wolf)), (,(None,bear)))
三、Actions算子
1)foreach
val c = sc.parallelize(List("cat", "dog", "tiger", "lion", "gnu", "crocodile", "ant", "whale", "dolphin", "spider"), )
c.foreach(x => println(x + "s are yummy"))
lions are yummy
gnus are yummy
crocodiles are yummy
ants are yummy
whales are yummy
dolphins are yummy
spiders are yummy

2)saveAsTextFile
val a = sc.parallelize( to , )
a.saveAsTextFile("mydata_a")
// :: INFO FileOutputCommitter: Saved output of task 'attempt_201404032111_0000_m_000002_71' to file:/home/cloudera/Documents/spark-0.9.-incubating-bin-cdh4/bin/mydata_a

3)saveAsObjectFile
val x = sc.parallelize( to , )
x.saveAsObjectFile("objFile")
val y = sc.objectFile[Int]("objFile")
y.collect
res52: Array[Int] = Array[Int] = Array(, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , )

4)collect
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), )
c.collect
res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

5)collectAsMap
val a = sc.parallelize(List(, , , ), )
val b = a.zip(a)
b.collectAsMap
res1: scala.collection.Map[Int,Int] = Map( -> , -> , -> )

6)reduceByKeyLocally
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), )
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((,dogcatowlgnuant))
7)lookup
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), )
val b = a.map(x => (x.length, x))
b.lookup()
res0: Seq[String] = WrappedArray(tiger, eagle)

8)count
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), )
c.count
res2: Long =

9)top
val c = sc.parallelize(Array(, , , , , ), )
c.top()
res28: Array[Int] = Array(, )
10)reduce
val a = sc.parallelize( to , )
a.reduce(_ + _)
res41: Int =
11)fold
val a = sc.parallelize(List(,,), )
a.fold()(_ + _)
res59: Int =

12)aggregate
val z = sc.parallelize(List(,,,,,), ) // lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
} z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:, val: ], [partID:, val: ], [partID:, val: ], [partID:, val: ], [partID:, val: ], [partID:, val: ]) z.aggregate()(math.max(_, _), _ + _)
res40: Int =
参考:http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
Spark算子总结及案例的更多相关文章
- Spark算子总结(带案例)
Spark算子总结(带案例) spark算子大致上可分三大类算子: 1.Value数据类型的Transformation算子,这种变换不触发提交作业,针对处理的数据项是Value型的数据. 2.Key ...
- Spark Streaming 进阶与案例实战
Spark Streaming 进阶与案例实战 1.带状态的算子: UpdateStateByKey 2.实战:计算到目前位置累积出现的单词个数写入到MySql中 1.create table CRE ...
- (转)Spark 算子系列文章
http://lxw1234.com/archives/2015/07/363.htm Spark算子:RDD基本转换操作(1)–map.flagMap.distinct Spark算子:RDD创建操 ...
- 《图解Spark:核心技术与案例实战》作者经验谈
1,看您有维护博客,还利用业余时间著书,在技术输出.自我提升以及本职工作的时间利用上您有没有什么心得和大家分享?(也可以包含一些您写书的小故事.)回答:在工作之余能够写博客.著书主要对技术的坚持和热爱 ...
- UserView--第二种方式(避免第一种方式Set饱和),基于Spark算子的java代码实现
UserView--第二种方式(避免第一种方式Set饱和),基于Spark算子的java代码实现 测试数据 java代码 package com.hzf.spark.study; import ...
- UserView--第一种方式set去重,基于Spark算子的java代码实现
UserView--第一种方式set去重,基于Spark算子的java代码实现 测试数据 java代码 package com.hzf.spark.study; import java.util.Ha ...
- 【原创 Hadoop&Spark 动手实践 6】Spark 编程实例与案例演示
[原创 Hadoop&Spark 动手实践 6]Spark 编程实例与案例演示 Spark 编程实例和简易电影分析系统的编写 目标: 1. 掌握理论:了解Spark编程的理论基础 2. 搭建 ...
- Scala进阶之路-Spark底层通信小案例
Scala进阶之路-Spark底层通信小案例 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Spark Master和worker通信过程简介 1>.Worker会向ma ...
- spark算子之DataFrame和DataSet
前言 传统的RDD相对于mapreduce和storm提供了丰富强大的算子.在spark慢慢步入DataFrame到DataSet的今天,在算子的类型基本不变的情况下,这两个数据集提供了更为强大的的功 ...
随机推荐
- 原创:LoadTest系列之Local.testtings之Web Test
在录制脚本后,需要对脚本进行回放,而在Loal.testting的Web Test设置中,可以设置脚本运行的次数,以及是否考虑think time等,如下图所示: 1. 双击Solution Item ...
- 有关sqlitedrop数据库重建比delete方式来清空数据库更加有效率
今天浏览stackoverflow 发现一个有趣的问题: which was more preferable as performance wise and without error cause t ...
- USACO Section 1.2 Palindromic Squares 解题报告
题目 题目描述 输入一个基数B,现在要从1到300之间找出一些符合要求的数字N.如果N的平方转换成B进制数之后是一个回文串,那么N就符合要求.我们将N转换成B进制数输出,然后再将N的平方转换成B进制数 ...
- 伸展树 Splay 模板
学习Splay的时候参考了很多不同的资料,然而参考资料太杂的后果就是模板调出来一直都有问题,尤其是最后发现网上找的各种资料均有不同程度的错误. 好在啃了几天之后终于算是啃下来了. Splay也算是平衡 ...
- 【HighCharts系列教程】三、图表属性——chart
一.chart属性说明 Chart是HighCharts图表中主要属性,包括了图表区域的颜色.线条.高度.宽度.对齐.图表类型等诸多属性,也是HighCharts图表中必须配置的属性之一. 配置cha ...
- w3school之CSS学习笔记
由于web自动化测试中,会用到比较复杂的定位方式:CSS定位,这种定位方式比较简洁,定位速度较快,所以继续学习前端的CSS知识,总结下学习笔记,以便后续查看.同时,也希望能帮助到大家. 学习网址:ht ...
- EasyUi之datagird解读
1.其json格式需要为: JSON Code 1234567891011121314151617181920212223 { , "rows": [ ...
- h2database. 官方文档
http://www.h2database.com/html/advanced.html http://www.h2database.com/html/tutorial.html#csv http:/ ...
- [iOS Animation]-CALayer 缓冲
缓冲 生活和艺术一样,最美的永远是曲线. -- 爱德华布尔沃 - 利顿 在第九章“图层时间”中,我们讨论了动画时间和CAMediaTiming协议.现在我们来看一下另一个和时间相关的机制--所谓的缓冲 ...
- CI框架学习——检查用户名与密码是否合法(二)
检查用户名与密码是否合法 步骤一.编写用户登录的窗体见下面内容 步骤二.编写数据库校验方法 $username = $_POST["username"]; # ...
