Spark RDD Action 简单用例(一)

collectAsMap(): Map[K, V]

返回key-value对，key是唯一的，如果rdd元素中同一个key对应多个value，则只会保留一个。
/**
 * Return the key-value pairs in this RDD to the master as a Map.
 *
 * Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
 *          one value per key is preserved in the map returned)
 *
 * @note this method should only be used if the resulting data is expected to be small, as
 * all the data is loaded into the driver's memory.
 */
def collectAsMap(): Map[K, V]

scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.collectAsMap

res0: scala.collection.Map[String,Int] = Map(A -> 3, C -> 3, B -> 2)

countByKey(): Map[K, Long]

计算有多少个不同的key.
/**
 * Count the number of elements for each key, collecting the results to a local Map.
 *
 * Note that this method should only be used if the resulting map is expected to be small, as
 * the whole thing is loaded into the driver's memory.
 * To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
 * returns an RDD[T, Long] instead of a map.
 */
def countByKey(): Map[K, Long] = self.withScope {
  self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}

scala> val rdd = sc.parallelize(List((1,1),(1,2),(1,3),(2,1),(2,2),(2,3)),3)

rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:24

scala> rdd.countByKey

res5: scala.collection.Map[Int,Long] = Map(1 -> 3, 2 -> 3)

countByValue()

计算不同的value个数，该函数首先通过map将每个元素转成(value,null)的key-value（value为null）对，
然后调用countByKey进行统计。

/**
 * Return the count of each unique value in this RDD as a local map of (value, count) pairs.
 *
 * Note that this method should only be used if the resulting map is expected to be small, as
 * the whole thing is loaded into the driver's memory.
 * To handle very large results, consider using rdd.map(x =&gt; (x, 1L)).reduceByKey(_ + _), which
 * returns an RDD[T, Long] instead of a map.
 */
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
  map(value => (value, null)).countByKey()
}

scala> val rdd = sc.parallelize(List(1,2,3,4,5,4,4,3,2,1))

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24

scala> rdd.countByValue

res12: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 2, 2 -> 2, 3 -> 2, 4 -> 3)

lookup(key: K)

根据key值搜索所有的value.
/**
 * Return the list of values in the RDD for key `key`. This operation is done efficiently if the
 * RDD has a known partitioner by only searching the partition that the key maps to.
 */
def lookup(key: K): Seq[V]

scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.lookup("A")

res2: Seq[Int] = WrappedArray(1, 2, 3)

checkpoint()

将RDD数据根据设置的checkpoint目录保存至硬盘中。

/**
 * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
 * directory set with `SparkContext#setCheckpointDir` and all references to its parent
 * RDDs will be removed. This function must be called before any job has been
 * executed on this RDD. It is strongly recommended that this RDD is persisted in
 * memory, otherwise saving it on a file will require recomputation.
 */
def checkpoint(): Unit

/*通过linux命令创建/home/check目录后，设置checkpoint directory*/

scala> sc.setCheckpointDir("/home/check")

scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)

rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[6] at parallelize at <console>:24

/*

*执行下面的代码会在/home/check目录下创建一个空的目录/home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49

*/

scala> rdd.checkpoint

/*

执行count后会在上述目录下创建一个rdd目录，rdd目录下是数据文件

*/

scala> rdd.count

res5: Long = 6

[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/

total

drwxr-xr-x.  root root  Sep   : .

drwxr-xr-x.  root root  Sep   : ..

[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/

total

drwxr-xr-x.  root root  Sep   : .

drwxr-xr-x.  root root  Sep   : ..

drwxr-xr-x.  root root  Sep   : rdd-

[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/rdd-/

total

drwxr-xr-x.  root root  Sep   : .

drwxr-xr-x.  root root  Sep   : ..

-rw-r--r--.  root root   Sep   : part-

-rw-r--r--.  root root    Sep   : .part-.crc

-rw-r--r--.  root root   Sep   : part-

-rw-r--r--.  root root    Sep   : .part-.crc

-rw-r--r--.  root root   Sep   : part-

-rw-r--r--.  root root    Sep   : .part-.crc

collect()

返回RDD所有元素的数组。
/**
 * Return an array that contains all of the elements in this RDD.
 *
 * @note this method should only be used if the resulting array is expected to be small, as
 * all the data is loaded into the driver's memory.
 */
def collect(): Array[T]

scala> val rdd = sc.parallelize(1 to 10,3)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

scala> rdd.collect

res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

toLocalIterator: Iterator[T]

返回一个包含所有算的迭代器。
/**
 * Return an iterator that contains all of the elements in this RDD.
 *
 * The iterator will consume as much memory as the largest partition in this RDD.
 *
 * Note: this results in multiple Spark jobs, and if the input RDD is the result
 * of a wide transformation (e.g. join with different partitioners), to avoid
 * recomputing the input RDD should be cached first.
 */
def toLocalIterator: Iterator[T]

scala> val rdd = sc.parallelize(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val it = rdd.toLocalIterator

it: Iterator[Int] = non-empty iterator

scala> while(it.hasNext){

     | println(it.next)

     | }

1

2

3

4

5

6

7

8

9

10

count()

返回RDD中元素的数量。
/**
 * Return the number of elements in the RDD.
 */
def count(): Long

scala> val rdd = sc.parallelize(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.count

res1: Long = 10

dependencies

返回该RDD的依赖RDD的地址。
/**
 * Get the list of dependencies of this RDD, taking into account whether the
 * RDD is checkpointed or not.
 */
final def dependencies: Seq[Dependency[_]]

scala> val rdd = sc.parallelize(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val rdd1 = rdd.filter(_>3)

rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at filter at <console>:26

scala> val rdd2 = rdd1.filter(_<6)

rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:28

scala> rdd2.dependencies

res2: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@21c882b5)

partitions

以数组形式返回RDD各分区地址
/**
 * Get the array of partitions of this RDD, taking into account whether the
 * RDD is checkpointed or not.
 */
final def partitions: Array[Partition]

scala> val rdd = sc.parallelize(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.partitions

res4: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@70c, org.apache.spark.rdd.ParallelCollectionPartition@70d)

first()

返回RDD的第一个元素。
/**
 * Return the first element in this RDD.
 */
def first(): T

scala> val rdd = sc.parallelize(1 to 10,2)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> rdd.first

res5: Int = 1

fold(zeroValue: T)(op: (T, T) => T)

使用zeroValue和每个分区的元素进行聚合运算，最后各分区结果和zeroValue再进行一次聚合运算。
/**
 * @param zeroValue the initial value for the accumulated result of each partition for the `op`
 *                  operator, and also the initial value for the combine results from different
 *                  partitions for the `op` operator - this will typically be the neutral
 *                  element (e.g. `Nil` for list concatenation or `0` for summation)
 * @param op an operator used to both accumulate results within a partition and combine results
 *                  from different partitions
 */
def fold(zeroValue: T)(op: (T, T) => T): T

scala> val rdd = sc.parallelize(1 to 5)

rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24

scala> rdd.fold(10)(_+_)

res13: Int = 35

Spark RDD Action 简单用例(一)的更多相关文章

Spark RDD Action 简单用例(二)
foreach(f: T => Unit) 对RDD的所有元素应用f函数进行处理,f无返回值./** * Applies a function f to all elements of this ...
Spark RDD Transformation 简单用例（三）
cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...
Spark RDD Transformation 简单用例（二）
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...
Spark RDD Transformation 简单用例（一）
map(func) /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: C ...
spark RDD transformation与action函数整理
1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...
Apache Spark 2.2.0 中文文档 - Spark RDD（Resilient Distributed Datasets）论文 | ApacheCN
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Apache Spark RDD（Resilient Distributed Datasets）论文
Spark RDD(Resilient Distributed Datasets)论文概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...
Spark RDD深度解析-RDD计算流程
Spark RDD深度解析-RDD计算流程摘要 RDD(Resilient Distributed Datasets)是Spark的核心数据结构,所有数据计算操作均基于该结构进行,包括Spark ...
spark RDD 常见操作
fold 操作区别与 co 1.mapValus 2.flatMapValues 3.comineByKey 4.foldByKey 5.reduceByKey 6.groupByKey 7.so ...

随机推荐

Unity Alpha融合参数（便查）
Alpha Blending,中文译作Alpha混合 Blending就是控制透明的.处于光栅化的最后阶段. 这里例如我们给一个模型贴一个材质,那么在某个点计算出来颜色值称为源,而该点之前累积的颜色值 ...
[Docker] Benefits of Multi-stage Builds
The benfits of multi-stage builds: Avoid manual creation of intermediate images Reduce complexity Se ...
pid 控制算法
http://blog.csdn.net/huangkangying/article/details/78129148 https://zh.wikipedia.org/wiki/PID%E6%8E% ...
vim Google style format
https://github.com/google/vim-codefmt https://github.com/rhysd/vim-clang-format http://pre.tir.tw/00 ...
ROS nodelet的使用
ROS是一种基于分布式网络通讯的操作系统,整个机器人控制系统是由一个Master主节点和若干个功能相对独立的Node子节点组成,这也是ROS系统最主要的特点就是分布式以及模块化的设计.在ROS通讯过程 ...
gcc下inline的一个问题
今天发现一个问题,与inline有关,也与编译时候是不是优化有关. 大概问题可以用下面的代码来描述: 先写一个libtest1,代码如下 libtest1.h #ifndef LIBTEST_H #d ...
安装Logtail（Linux系统）
Logtail客户端是日志服务提供的日志采集客户端,请参考本文档,在Linux服务器上安装Logtail客户端. 支持的系统支持如下版本的Linux x86-64(64位)服务器: Aliyun L ...
Easyui datagrid 特殊处理，记录笔记
1. 特殊的单元格样式 columns中 { field: , styler: function (value, row, index) { ') { return 'background-color ...
Spark 核心概念RDD
文章正文 RDD全称叫做弹性分布式数据集(Resilient Distributed Datasets),它是一种分布式的内存抽象,表示一个只读的记录分区的集合,它只能通过其他RDD转换而创建,为此, ...
为OLED屏添加GUI支持6：进度条控件
为OLED屏添加GUI支持6:进度条控件本文博客链接:http://blog.csdn.net/jdh99,作者:jdh,转载请注明. 环境: 主机:WIN10 开发环境:MDK5.13 MCU:S ...

Spark RDD Action 简单用例(一)

collectAsMap(): Map[K, V]

countByKey(): Map[K, Long]

countByValue()

lookup(key: K)

checkpoint()

collect()

toLocalIterator: Iterator[T]

count()

dependencies

partitions

first()

fold(zeroValue: T)(op: (T, T) => T)

Spark RDD Action 简单用例(一)的更多相关文章

随机推荐

热门专题