Spark RDD Action 简单用例(二)
foreach(f: T => Unit)
对RDD的所有元素应用f函数进行处理,f无返回值。
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
scala> val rdd = sc.parallelize(1 to 9, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.foreach(x=>{println(x)})
[Stage 0:> (0 + 0) / 2]
1
2
3
4
5
6
7
8
9
foreachPartition(f: Iterator[T] => Unit)
遍历所有的分区进行f函数操作
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit
scala> val rdd = sc.parallelize(1 to 9, 2)
scala> rdd.foreachPartition(x=>{
| while(x.hasNext){
| println(x.next)
| }
| println("===========")
| }
| )
1
2
3
4
===========
5
6
7
8
9
===========
getCheckpointFile
获取RDD checkpoint的目录.
/**
* Gets the name of the directory to which this RDD was checkpointed.
* This is not defined if the RDD is checkpointed locally.
*/
def getCheckpointFile: Option[String]
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> rdd.checkpoint /*
checkpoint操作后直接查询得到None,说明checkpoint是lazy的
*/
scala> rdd.getCheckpointFile
res6: Option[String] = None scala> rdd.count
res7: Long = 9 scala> rdd.getCheckpointFile
res8: Option[String] = Some(file:/home/check/ca771099-b1bf-46c8-9404-68b4ace7feeb/rdd-1)
getNumPartitions
获取分区数量
/**
* Returns the number of partitions of this RDD.
*/
@Since("1.6.0")
final def getNumPartitions: Int = partitions.length
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.getNumPartitions
res9: Int = 2
getStorageLevel
获取当前RDD的存储级别
/** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
def getStorageLevel: StorageLevel = storageLevel
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.getStorageLevel
res10: org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas) scala> rdd.cache
res11: rdd.type = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.getStorageLevel
res12: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)
isCheckpointed
获取该RDD是否已checkpoint处理
/**
* Return whether this RDD is checkpointed and materialized, either reliably or locally.
*/
def isCheckpointed: Boolean
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.isCheckpointed
res13: Boolean = false scala> rdd.checkpoint scala> rdd.isCheckpointed
res15: Boolean = false scala> rdd.count
res16: Long = 9 scala> rdd.isCheckpointed
res17: Boolean = true
isEmpty()
获取RDD是否为空,如果RDD为Nothing或Null,则抛出异常
/**
* @note due to complications in the internal implementation, this method will raise an
* exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice
* because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`.
* (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.)
* @return true if and only if the RDD contains no elements at all. Note that an RDD
* may be empty even when it has at least 1 partition.
*/
def isEmpty(): Boolean
scala> val rdd = sc.parallelize(Seq())
rdd: org.apache.spark.rdd.RDD[Nothing] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> rdd.isEmpty
org.apache.spark.SparkDriverExecutionException: Execution error
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1187)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1305)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1279)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1413)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1413)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1413)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1412)
... 48 elided
Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1884)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1884)
at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) scala> val rdd = sc.parallelize(Seq(1 to 9))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> rdd.isEmpty
res19: Boolean = false
max()
/**
* Returns the max of this RDD as defined by the implicit Ordering[T].
* @return the maximum element of the RDD
* */
def max()(implicit ord: Ordering[T]): T
min()
/**
* Returns the min of this RDD as defined by the implicit Ordering[T].
* @return the minimum element of the RDD
* */
def min()(implicit ord: Ordering[T]): T
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24 scala> rdd.max
res21: Int = 9 scala> rdd.min
res22: Int = 1
reduce(f: (T, T) => T)
对RDD所有元素进行聚合运算
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> def func(x:Int, y:Int):Int={
| if(x >= y){
| x
| }else{
| y}
| }
func: (x: Int, y: Int)Int scala> rdd.reduce(func(_,_))
res23: Int = 9 scala> rdd.reduce((x,y)=>{
| if(x>=y){
| x
| }else{
| y
| }
| }
| )
res24: Int = 9
saveAsObjectFile(path: String)
将RDD保存指定目录下文件中
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> rdd.saveAsObjectFile("/home/check/object") [root@localhost ~]# ls /home/check/object/
part-00000 _SUCCESS
saveAsTextFile(path: String)
将RDD保存至文本文件
/**
* Save this RDD as a text file, using string representations of elements.
*/
def saveAsTextFile(path: String): Unit
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> rdd.saveAsTextFile("/home/check/text")
[root@localhost ~]# ls /home/check/text/part-00000
/home/check/text/part-00000
[root@localhost ~]# more /home/check/text/part-00000
1
2
3
4
5
6
7
8
9
take(num: Int)
返回前num个元素。
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @note due to complications in the internal implementation, this method will raise
* an exception if called on an RDD of `Nothing` or `Null`.
*/
def take(num: Int): Array[T]
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:24 scala> rdd.take(5)
res28: Array[Int] = Array(1, 2, 3, 4, 5)
takeOrdered(num: Int) 排序后返回前num个元素
scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24 scala> rdd.takeOrdered(3)
res30: Array[Int] = Array(1, 2, 3)
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> rdd.takeSample(true,6,8)
res34: Array[Int] = Array(5, 2, 2, 5, 3, 2) scala> rdd.takeSample(false,6,8)
res35: Array[Int] = Array(9, 3, 2, 6, 1, 5)
top(num: Int)
降序排列后返回top n
/*
* @param num k, the number of top elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def top(num: Int)(implicit ord: Ordering[T]): Array[T]
scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24
scala> rdd.top(3)
res37: Array[Int] = Array(9, 6, 5)
Spark RDD Action 简单用例(二)的更多相关文章
- Spark RDD Action 简单用例(一)
collectAsMap(): Map[K, V] 返回key-value对,key是唯一的,如果rdd元素中同一个key对应多个value,则只会保留一个./** * Return the key- ...
- Spark RDD Transformation 简单用例(二)
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...
- Spark RDD Transformation 简单用例(三)
cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...
- Spark RDD Transformation 简单用例(一)
map(func) /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: C ...
- PHP 下基于 php-amqp 扩展的 RabbitMQ 简单用例 (二) -- Topic Exchange 和 Fanout Exchange
Topic Exchange 此模式下交换机,在推送消息时, 会根据消息的主题词和队列的主题词决定将消息推送到哪个队列. 交换机只会为 Queue 分发符合其指定的主题的消息. 向交换机发送消息时,消 ...
- spark RDD transformation与action函数整理
1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...
- Spark基础:(二)Spark RDD编程
1.RDD基础 Spark中的RDD就是一个不可变的分布式对象集合.每个RDD都被分为多个分区,这些分区运行在分区的不同节点上. 用户可以通过两种方式创建RDD: (1)读取外部数据集====> ...
- spring事务详解(二)简单样例
系列目录 spring事务详解(一)初探事务 spring事务详解(二)简单样例 spring事务详解(三)源码详解 spring事务详解(四)测试验证 spring事务详解(五)总结提高 一.引子 ...
- Action的三种实现方式,struts.xml配置的详细解释及其简单执行过程(二)
勿以恶小而为之,勿以善小而不为--------------------------刘备 劝诸君,多行善事积福报,莫作恶 上一章简单介绍了Struts2的'两个蝴蝶飞,你好' (一),如果没有看过,请观 ...
随机推荐
- Kubernetes 编排系统
1.1 Kubernetes简介 1.1.1 什么是Kubernetes Kubernetes (通常称为K8s,K8s是将8个字母“ubernete”替换为“8”的缩写) 是用于自动部署.扩展和管理 ...
- 微软BI 之SSIS 系列 - 通过 OLE DB 连接访问 Excel 2013 以及对不同 Sheet 页的数据处理
文章更新历史 2014年9月7日 - 加入了部分更新内容,在文章最后提到了关于不同 Office Excel 版本间的连接问题. 开篇介绍 这篇文章主要总结在 SSIS 中访问和处理 Excel 数据 ...
- MongoDB副本集配置系列七:MongoDB oplog详解
1:oplog简介 oplog是local库下的一个固定集合,Secondary就是通过查看Primary 的oplog这个集合来进行复制的.每个节点都有oplog,记录这从主节点复制过来的信息,这样 ...
- device eth0 does not seem to be present, delaying initialization(转)
vmlite虚拟机启动出错,就把这个虚拟机删除掉重新建立,系统虚拟硬盘使用之前的,启动系统后不能上网,通过ifconfig查看网卡没启动,遂启动网卡服务,但是出错,就是:device eth0 doe ...
- Winform开发框架之图表报表在线设计器2-图表-SNF.EasyQuery项目--SNF快速开发平台3.3-Spring.Net.Framework
上一篇讲到,如何快速创建报表程序了.这篇教大家如何快速制作图表报表. 继上一篇,Winform开发框架之图表报表在线设计器-报表 上一篇讲到如何了创建数据源,这里就不在介绍了.那我们就直接从图表设计器 ...
- Vue(九):样式绑定v-bind示例
Vue.js class class 与 style 是 HTML 元素的属性,用于设置元素的样式,我们可以用 v-bind 来设置样式属性. Vue.js v-bind 在处理 class 和 st ...
- IP子系统集成
IP子系统集成 1.Creating External Connections 由此可以看出:block design的设计是可以连接电路板上的CPU的(外挂CPU). 2.生成外部接口 端口生成之后 ...
- ES6,Array.find()和findIndex()函数的用法
ES6为Array增加了find(),findIndex函数. find()函数用来查找目标元素,找到就返回该元素,找不到返回undefined. findIndex()函数也是查找目标元素,找到就返 ...
- 解决Redis之MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o...
解决Redis之MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist o... ...
- Java设计模式六大原则
一.单一职责原则 单一职责原则是最简单的面向对象设计原则,它用于控制类的粒度大小.单一职责原则定义如下: 单一职责原则(Single Responsibility Principle, SRP):一个 ...