Spark RDD Action 简单用例(二)
foreach(f: T => Unit)
对RDD的所有元素应用f函数进行处理,f无返回值。
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
scala> val rdd = sc.parallelize(1 to 9, 2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.foreach(x=>{println(x)})
[Stage 0:> (0 + 0) / 2]
1
2
3
4
5
6
7
8
9
foreachPartition(f: Iterator[T] => Unit)
遍历所有的分区进行f函数操作
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit
scala> val rdd = sc.parallelize(1 to 9, 2)
scala> rdd.foreachPartition(x=>{
| while(x.hasNext){
| println(x.next)
| }
| println("===========")
| }
| )
1
2
3
4
===========
5
6
7
8
9
===========
getCheckpointFile
获取RDD checkpoint的目录.
/**
* Gets the name of the directory to which this RDD was checkpointed.
* This is not defined if the RDD is checkpointed locally.
*/
def getCheckpointFile: Option[String]
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> rdd.checkpoint /*
checkpoint操作后直接查询得到None,说明checkpoint是lazy的
*/
scala> rdd.getCheckpointFile
res6: Option[String] = None scala> rdd.count
res7: Long = 9 scala> rdd.getCheckpointFile
res8: Option[String] = Some(file:/home/check/ca771099-b1bf-46c8-9404-68b4ace7feeb/rdd-1)
getNumPartitions
获取分区数量
/**
* Returns the number of partitions of this RDD.
*/
@Since("1.6.0")
final def getNumPartitions: Int = partitions.length
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.getNumPartitions
res9: Int = 2
getStorageLevel
获取当前RDD的存储级别
/** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
def getStorageLevel: StorageLevel = storageLevel
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.getStorageLevel
res10: org.apache.spark.storage.StorageLevel = StorageLevel(1 replicas) scala> rdd.cache
res11: rdd.type = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.getStorageLevel
res12: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)
isCheckpointed
获取该RDD是否已checkpoint处理
/**
* Return whether this RDD is checkpointed and materialized, either reliably or locally.
*/
def isCheckpointed: Boolean
scala> val rdd = sc.parallelize(1 to 9,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.isCheckpointed
res13: Boolean = false scala> rdd.checkpoint scala> rdd.isCheckpointed
res15: Boolean = false scala> rdd.count
res16: Long = 9 scala> rdd.isCheckpointed
res17: Boolean = true
isEmpty()
获取RDD是否为空,如果RDD为Nothing或Null,则抛出异常
/**
* @note due to complications in the internal implementation, this method will raise an
* exception if called on an RDD of `Nothing` or `Null`. This may be come up in practice
* because, for example, the type of `parallelize(Seq())` is `RDD[Nothing]`.
* (`parallelize(Seq())` should be avoided anyway in favor of `parallelize(Seq[T]())`.)
* @return true if and only if the RDD contains no elements at all. Note that an RDD
* may be empty even when it has at least 1 partition.
*/
def isEmpty(): Boolean
scala> val rdd = sc.parallelize(Seq())
rdd: org.apache.spark.rdd.RDD[Nothing] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> rdd.isEmpty
org.apache.spark.SparkDriverExecutionException: Execution error
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1187)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1305)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.take(RDD.scala:1279)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1413)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1413)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1413)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1412)
... 48 elided
Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:90)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1884)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1884)
at org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1656)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) scala> val rdd = sc.parallelize(Seq(1 to 9))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> rdd.isEmpty
res19: Boolean = false
max()
/**
* Returns the max of this RDD as defined by the implicit Ordering[T].
* @return the maximum element of the RDD
* */
def max()(implicit ord: Ordering[T]): T
min()
/**
* Returns the min of this RDD as defined by the implicit Ordering[T].
* @return the minimum element of the RDD
* */
def min()(implicit ord: Ordering[T]): T
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24 scala> rdd.max
res21: Int = 9 scala> rdd.min
res22: Int = 1
reduce(f: (T, T) => T)
对RDD所有元素进行聚合运算
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> def func(x:Int, y:Int):Int={
| if(x >= y){
| x
| }else{
| y}
| }
func: (x: Int, y: Int)Int scala> rdd.reduce(func(_,_))
res23: Int = 9 scala> rdd.reduce((x,y)=>{
| if(x>=y){
| x
| }else{
| y
| }
| }
| )
res24: Int = 9
saveAsObjectFile(path: String)
将RDD保存指定目录下文件中
/**
* Save this RDD as a SequenceFile of serialized objects.
*/
def saveAsObjectFile(path: String): Unit
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> rdd.saveAsObjectFile("/home/check/object") [root@localhost ~]# ls /home/check/object/
part-00000 _SUCCESS
saveAsTextFile(path: String)
将RDD保存至文本文件
/**
* Save this RDD as a text file, using string representations of elements.
*/
def saveAsTextFile(path: String): Unit
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24
scala> rdd.saveAsTextFile("/home/check/text")
[root@localhost ~]# ls /home/check/text/part-00000
/home/check/text/part-00000
[root@localhost ~]# more /home/check/text/part-00000
1
2
3
4
5
6
7
8
9
take(num: Int)
返回前num个元素。
/**
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*
* @note due to complications in the internal implementation, this method will raise
* an exception if called on an RDD of `Nothing` or `Null`.
*/
def take(num: Int): Array[T]
scala> val rdd = sc.parallelize(1 to 9)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at <console>:24 scala> rdd.take(5)
res28: Array[Int] = Array(1, 2, 3, 4, 5)
takeOrdered(num: Int) 排序后返回前num个元素
scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24 scala> rdd.takeOrdered(3)
res30: Array[Int] = Array(1, 2, 3)
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> rdd.takeSample(true,6,8)
res34: Array[Int] = Array(5, 2, 2, 5, 3, 2) scala> rdd.takeSample(false,6,8)
res35: Array[Int] = Array(9, 3, 2, 6, 1, 5)
top(num: Int)
降序排列后返回top n
/*
* @param num k, the number of top elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def top(num: Int)(implicit ord: Ordering[T]): Array[T]
scala> val rdd = sc.parallelize(List(2,6,3,1,5,9))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24
scala> rdd.top(3)
res37: Array[Int] = Array(9, 6, 5)
Spark RDD Action 简单用例(二)的更多相关文章
- Spark RDD Action 简单用例(一)
collectAsMap(): Map[K, V] 返回key-value对,key是唯一的,如果rdd元素中同一个key对应多个value,则只会保留一个./** * Return the key- ...
- Spark RDD Transformation 简单用例(二)
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...
- Spark RDD Transformation 简单用例(三)
cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...
- Spark RDD Transformation 简单用例(一)
map(func) /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: C ...
- PHP 下基于 php-amqp 扩展的 RabbitMQ 简单用例 (二) -- Topic Exchange 和 Fanout Exchange
Topic Exchange 此模式下交换机,在推送消息时, 会根据消息的主题词和队列的主题词决定将消息推送到哪个队列. 交换机只会为 Queue 分发符合其指定的主题的消息. 向交换机发送消息时,消 ...
- spark RDD transformation与action函数整理
1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...
- Spark基础:(二)Spark RDD编程
1.RDD基础 Spark中的RDD就是一个不可变的分布式对象集合.每个RDD都被分为多个分区,这些分区运行在分区的不同节点上. 用户可以通过两种方式创建RDD: (1)读取外部数据集====> ...
- spring事务详解(二)简单样例
系列目录 spring事务详解(一)初探事务 spring事务详解(二)简单样例 spring事务详解(三)源码详解 spring事务详解(四)测试验证 spring事务详解(五)总结提高 一.引子 ...
- Action的三种实现方式,struts.xml配置的详细解释及其简单执行过程(二)
勿以恶小而为之,勿以善小而不为--------------------------刘备 劝诸君,多行善事积福报,莫作恶 上一章简单介绍了Struts2的'两个蝴蝶飞,你好' (一),如果没有看过,请观 ...
随机推荐
- Android批量图片加载经典系列——afinal框架实现图片的异步缓存加载
一.问题描述 在之前的系列文章中,我们使用了Volley和Xutil框架实现图片的缓存加载(查看系列文章:http://www.cnblogs.com/jerehedu/p/4607599.html# ...
- WIN10平板 如何设置不允许切换竖屏
点击右下角的通知,然后点击旋转锁定,即可禁止自动竖屏切换
- volitile关键字
1.volatile关键字的两层语义 一旦一个共享变量(类的成员变量.类的静态成员变量)被volatile修饰之后,那么就具备了两层语义: 1)保证了不同线程对这个变量进行操作时的可见性,即一个线程修 ...
- 为什么用svg放弃了iconfont?
svg替代iconfont的好处(无论是基于Vue.Jquery),都推荐svg http://www.woshipm.com/pd/463866.html svg图标库,svg图标在线制作 http ...
- vbox磁盘空间如何扩容
vbox磁盘空间如何扩容 为虚拟机硬盘扩容(Oracle VM VirtualBox) VBoxManage modifyhd <uuid>|<filename& ...
- PHP —— 识别运算符实现逻辑比较
最近遇到一个功能的开发,大致意思就是根据用户输入的条件,进行相关的比较操作.本来打算使用用户选择运算符的方式,但是后来结合项目实际,发现需要使用用户输入的自定义运算比较现实一点.大致意思就是: 1.用 ...
- Linux下ip地址查询
[时间:2016-12] [状态:Open] [关键词:linux,ip地址,ifconfig,ip addr] 0 引用 说起来比较搞笑,我在windows下知道可以使用ipconfig命令查询本机 ...
- 基于jQuery鼠标滚轮滑动到页面节点部分
基于jQuery鼠标滚轮滑动到页面节点部分.这是一款基于jQuery+CSS3实现的使用鼠标滚轮或者手势滑动到页面节点部分特效.效果图如下: 在线预览 源码下载 实现的代码. html代码: &l ...
- [转]对form:input标签中的数字进行格式化
原文地址:https://blog.csdn.net/qq_29662201/article/details/80708373 数字进行格式化(保留2位小数) 单独使用<fmt:formatNu ...
- 配合angularjs中interceptor一劳永逸的加载$ionicloading的方法
在我们日常的项目开发中,每当页面需要和服务端存在交互的时候,为了界面的友好,我们都会在界面中给个loading的加载图标,当从服务端获取到数据或者已经把本地数据送到服务端并且得到相应的回应的时候我们就 ...