spark 常用技巧总结2
zip拉链操作
def zip[U](other: org.apache.spark.rdd.RDD[U])(implicit evidence$10: scala.reflect.ClassTag[U]): org.apache.spark.rdd.RDD[(String, U)]
scala> val rdd1=sc.makeRDD(Array("apple","pear","grape","egg","elephant"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at makeRDD at <console>:24
scala> val rdd2=sc.makeRDD(List(20,5,8,6,3))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at makeRDD at <console>:24
scala> rdd1.zip(rdd2).collect
res35: Array[(String, Int)] = Array((apple,20), (pear,5), (grape,8), (egg,6), (elephant,3))
scala> val rdd3=rdd1 zip rdd2
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[27] at zip at <console>:28
scala> rdd3.collect
res36: Array[(String, Int)] = Array((apple,20), (pear,5), (grape,8), (egg,6), (elephant,3))
-------------------------
def combineByKey[C](createCombiner: Int => C,mergeValue: (C, Int) => C,mergeCombiners: (C, C) => C): org.apache.spark.rdd.RDD[(String, C)]
def combineByKey[C](createCombiner: Int => C,mergeValue: (C, Int) => C,mergeCombiners: (C, C) => C,numPartitions: Int): org.apache.spark.rdd.RDD[(String, C)]
def combineByKey[C](createCombiner: Int => C,mergeValue: (C, Int) => C,mergeCombiners: (C, C) => C,partitioner: org.apache.spark.Partitioner,mapSideCombine: Boolean,serializer: org.apache.spark.serializer.Serializer): org.apache.spark.rdd.RDD[(String, C)]
def combineByKey[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C, n
umPartitions: Int): RDD[(K, C)] = self.withScope {
combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)
}
scala> rdd3.collect
res53: Array[(String, Int)] = Array((apple,2), (pear,1), (grape,2), (egg,1), (elephant,1))
scala> val rdd4=rdd3.combineByKey(List(_),(x:List[Int],v:Int)=>x:+v,(m:List[Int],n:List[Int])=>m++n)
rdd4: org.apache.spark.rdd.RDD[(String, List[Int])] = ShuffledRDD[35] at combineByKey at <console>:30
scala> rdd4.collect
res51: Array[(String, List[Int])] = Array((egg,List(1)), (elephant,List(1)), (pear,List(1)), (apple,List(2)), (grape,List(2)))
scala> val rdd4=rdd3.map(x=>(x._2,x._1))
rdd4: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[33] at map at <console>:30
scala> val rdd5=rdd4.combineByKey(List(_),(x:List[String],v:String)=>x:+v,(m:List[String],n:List[String])=>m++n)
rdd5: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[37] at combineByKey at <console>:32
scala> rdd5.collect
res52: Array[(Int, List[String])] = Array((1,List(pear, egg, elephant)), (2,List(apple, grape)))
--------------------
scala> val rdd1=sc.makeRDD(Array("apple","apple","pear","egg","hellokitty","egg","apple"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at makeRDD at <console>:24
scala> rdd1.countByValue
res1: scala.collection.Map[String,Long] = Map(hellokitty -> 1, egg -> 2, pear -> 1, apple -> 3)
scala> val map1=rdd1.countByValue
map1: scala.collection.Map[String,Long] = Map(hellokitty -> 1, egg -> 2, pear -> 1, apple -> 3)
scala> val rdd2=sc.makeRDD(map1.toList)
rdd2: org.apache.spark.rdd.RDD[(String, Long)] = ParallelCollectionRDD[21] at makeRDD at <console>:28
scala> rdd2.collect
res5: Array[(String, Long)] = Array((hellokitty,1), (egg,2), (pear,1), (apple,3))
-------------------
scala> val rdd1=sc.makeRDD(Array("apple","apple","pear","egg","hellokitty","egg","apple"))
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at makeRDD at <console>:24
scala> val rdd2=rdd1.map(x=>(x,1))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[29] at map at <console>:26
scala> rdd2.collect
res33: Array[(String, Int)] = Array((apple,1), (apple,1), (pear,1), (egg,1), (hellokitty,1), (egg,1), (apple,1))
scala> rdd2.partitions.size
res34: Int = 4
scala> rdd2.reduceByKey(_+_).collect
res36: Array[(String, Int)] = Array((hellokitty,1), (egg,2), (pear,1), (apple,3))
scala> rdd2.reduceByKey(_+_,2).partitions.size //shuffile重新分为2个分区
res37: Int = 2
-------------------------------
shuffle操作可以重新分区,指定分区数
进行 shuffle 操作的是是很消耗系统资源的,需要写入到磁盘并通过网络传输,有时还需要对数据进行排序.常见的 Transformation 操作如:repartition,join,cogroup,以及任何 *By 或者 *ByKey 的 Transformation 都需要 shuffle
--------------------------------------
scala> val rdd2=rdd1.map(x=>(x,1))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[29] at map at <console>:26
scala> rdd2.collect
res39: Array[(String, Int)] = Array((apple,1), (apple,1), (pear,1), (egg,1), (hellokitty,1), (egg,1), (apple,1))
scala> rdd2.combineByKey(x=>x,(c:Int,n:Int)=>c+n,(c1:Int,c2:Int)=>c1+c2).collect
res41: Array[(String, Int)] = Array((hellokitty,1), (egg,2), (pear,1), (apple,3))
scala> rdd1.countByValue()
res42: scala.collection.Map[String,Long] = Map(hellokitty -> 1, egg -> 2, pear -> 1, apple -> 3)
scala> rdd2.reduceByKey(_+_).collect
res44: Array[(String, Int)] = Array((hellokitty,1), (egg,2), (pear,1), (apple,3))
-------------------------------
scala> val rdd3=rdd1.map(x=>(1,x))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[40] at map at <console>:26
scala> rdd3.collect
res45: Array[(Int, String)] = Array((1,apple), (1,apple), (1,pear), (1,egg), (1,hellokitty), (1,egg), (1,apple))
scala> rdd3.combineByKey(x=>List(x),(c:List[String],y:String)=>c:+y,(c1:List[String],c2:List[String])=>c1++c2).collect
res49: Array[(Int, List[String])] = Array((1,List(apple, apple, pear, egg, hellokitty, egg, apple)))
---------------------------------------------
scala> val rdd00=sc.makeRDD(List(("a",1),("b",1),("a",3),("ba",3),("b",1),("g",10)),2)
rdd00: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[44] at makeRDD at <console>:24
scala> val rdd3=rdd00.map(x=>(x._2,x._1))
rdd3: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[45] at map at <console>:26
scala> rdd3.collect
res51: Array[(Int, String)] = Array((1,a), (1,b), (3,a), (3,ba), (1,b), (10,g))
scala> rdd3.groupByKey().collect
res53: Array[(Int, Iterable[String])] = Array((10,CompactBuffer(g)), (1,CompactBuffer(a, b, b)), (3,CompactBuffer(a, ba)))
scala> rdd3.combineByKey(x=>List(x),(c:List[String],y:String)=>c:+y,(c1:List[String],c2:List[String])=>c1++c2).collect
res54: Array[(Int, List[String])] = Array((10,List(g)), (1,List(a, b, b)), (3,List(a, ba)))
-----------------------
distinct(numPartitions:Int) 去重的同时重新分区
scala> val bb=sc.makeRDD(Array(1,1,2,1,8,6,8,4,5,4),2)
bb: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[81] at makeRDD at <console>:25
scala> bb.distinct(1).partitions.size
res61: Int = 1
scala> bb.distinct(3).partitions.size
res62: Int = 3
----------------------
def randomSplit(weights: Array[Double],seed: Long): Array[org.apache.spark.rdd.RDD[Int]]
randomSplit操作根据weights权重将一个RDD分割为多个RDD。权重越高,划分到的几率越大,权重的总和加起来为1,否则会不正常
scala> val split=aa.randomSplit(Array(0.1,0.2,0.3,0.4))
split: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[165] at randomSplit at <console>:27, MapPartitionsRDD[166] at randomSplit at <console>:27, MapPartitionsRDD[167] at randomSplit at <console>:27, MapPartitionsRDD[168] at randomSplit at <console>:27)
scala> split(0).count
res94: Long = 11
scala> split(1).count
res95: Long = 19
scala> split(2).count
res96: Long = 34
scala> split(3).count
res97: Long = 36
-----------------------------------------------------
def glom(): org.apache.spark.rdd.RDD[Array[Int]]
glom将每个分区中的元素放到一个数组里,变成和分区数一样多的数据
scala> val bb=sc.makeRDD(1 to 10,3)
bb: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[203] at makeRDD at <console>:25
scala> bb.glom().collect
res127: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))
spark 常用技巧总结2的更多相关文章
- spark 常用技巧总结
解析url scala> import java.net.URLimport java.net.URL scala> val urlstr="http://www.baidu.c ...
- 【shell 大系】Linux Shell常用技巧
在最近的日常工作中由于经常会和Linux服务器打交道,如Oracle性能优化.我们数据采集服务器的资源利用率监控,以及Debug服务器代码并解决其效率和稳定性等问题.因此这段时间总结的有关Linux ...
- oracle存储过程常用技巧
我们在进行pl/sql编程时打交道最多的就是存储过程了.存储过程的结构是非常的简单的,我们在这里除了学习存储过程的基本结构外,还会学习编写存储过程时相关的一些实用的知识.如:游标的处理,异常的处理,集 ...
- Vim 常用技巧:
Vim 常用技巧: 将回车由默认的8个空格改为4个空格: 命令:set sw=4 修改tab为4空格: 命令:set ts=4 设置每一级的缩进长度: 命令:set shiftwidth=4 设置文件 ...
- JS~~~ 前端开发一些常用技巧 模块化结构 &&&&& 命名空间处理 奇技淫巧!!!!!!
前端开发一些常用技巧 模块化结构 &&&&& 命名空间处理 奇技淫巧!!!!!!2016-09-29 17 ...
- Android ListView 常用技巧
Android ListView 常用技巧 Android TextView 常用技巧 1.使用ViewHolder提高效率 ViewHolder模式充分利用了ListView的视图缓存机制,避免了每 ...
- JavaScript常用技巧总结(持续添加中...)
在我学习过程中收集的一些常用技巧: typeof x !== undifined 判断x是否已定义: x === Object(x) 判断x是否为对象: Object.keys(x).length ...
- Eclipse调试常用技巧(转)
Eclipse调试常用技巧 转自http://daimojingdeyu.iteye.com/blog/633824 1. 条件断点 断点大家都比较熟悉,在Eclipse Java 编辑区的行头双击就 ...
- AS技巧合集「常用技巧篇」
转载:http://www.apkbus.com/forum.php?mod=viewthread&tid=254723&extra=page%3D2%26filter%3Dautho ...
随机推荐
- 不可小视的String字符串
String印象 String是java中的无处不在的类,使用也很简单.初学java,就已经有字符串是不可变的盖棺定论,解释通常是:它是final的. 不过,String是有字面量这一说法的,这是其他 ...
- 【转】SOA架构设计经验分享—架构、职责、数据一致性
1.背景介绍 最近一段时间都在做系统分析和设计工作,面对的业务是典型的重量级企业应用方向.突然发现很多以往觉得很简单的问题变得没有想象的那么容易,最大的问题就 是职责如何分配.论系统架 ...
- PAT 乙级 1083 是否存在相等的差(20 分)
1083 是否存在相等的差(20 分) 给定 N 张卡片,正面分别写上 1.2.…….N,然后全部翻面,洗牌,在背面分别写上 1.2.…….N.将每张牌的正反两面数字相减(大减小),得到 N 个非负差 ...
- 理解 neutron(15):Neutron Linux Bridge + VLAN/VXLAN 虚拟网络
学习 Neutron 系列文章: (1)Neutron 所实现的虚拟化网络 (2)Neutron OpenvSwitch + VLAN 虚拟网络 (3)Neutron OpenvSwitch + GR ...
- Levenberg-Marquardt 的 MATLAB 代码
参考资料: 1,<精通MATLAB最优化计算(第2版)>作者:龚纯 等 的 第9章 9.3 小节 L-M 法 2,<数值分析> 作者:Timothy Sauer 的 第4章 4 ...
- grep简单用法
grep 常用参数: -c: 打印符合要求的行数 -i :忽略大小写 -n:输出行和行号 -v:打印不符合要求的行,即反选 -A:后跟数字(有无空格都可以),例如-A2 表示打印筛选行及前2行 -B: ...
- spring中关于<context:component-scan>的使用说明(转)
https://blog.csdn.net/liuxingsiye/article/details/52171508 通常情况下我们在创建spring项目的时候在xml配置文件中都会配置这个表情,配置 ...
- Python3 文件的重命名
在Python3中我们要实现将本地文件homework.txt中的内容的修改操作时,大体的思路是这样的:先将homework.txt文件的内容读取到内存中,在内存中对里面的数据进行修改,接着将修改完成 ...
- 高可用hadoop的hdfs启动的时候namenode启动不了
启动的时候,一直要求输入namenode密码: 查看namenode的日志如下: 2019-03-28 18:38:08,961 INFO org.apache.hadoop.ipc.Client: ...
- Oracle 在SQL语句中如何获取系统当前时间并进行操作
select sysdate from dual;select to_char(sysdate,'yyyy-mm-dd hh24:mi:ss') from dual; select to_char(s ...