collectAsMap(): Map[K, V]

返回key-value对,key是唯一的,如果rdd元素中同一个key对应多个value,则只会保留一个。
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
* one value per key is preserved in the map returned)
*
* @note this method should only be used if the resulting data is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collectAsMap(): Map[K, V]
scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd.collectAsMap
res0: scala.collection.Map[String,Int] = Map(A -> 3, C -> 3, B -> 2)

countByKey(): Map[K, Long]

计算有多少个不同的key.
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* Note that this method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
scala> val rdd = sc.parallelize(List((1,1),(1,2),(1,3),(2,1),(2,2),(2,3)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> rdd.countByKey
res5: scala.collection.Map[Int,Long] = Map(1 -> 3, 2 -> 3)

countByValue()

计算不同的value个数,该函数首先通过map将每个元素转成(value,null)的key-value(value为null)对,
然后调用countByKey进行统计。 /**
* Return the count of each unique value in this RDD as a local map of (value, count) pairs.
*
* Note that this method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.map(x =&gt; (x, 1L)).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
map(value => (value, null)).countByKey()
}
scala> val rdd = sc.parallelize(List(1,2,3,4,5,4,4,3,2,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> rdd.countByValue
res12: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 2, 2 -> 2, 3 -> 2, 4 -> 3)

lookup(key: K)

根据key值搜索所有的value.
/**
* Return the list of values in the RDD for key `key`. This operation is done efficiently if the
* RDD has a known partitioner by only searching the partition that the key maps to.
*/
def lookup(key: K): Seq[V]
scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.lookup("A")
res2: Seq[Int] = WrappedArray(1, 2, 3)

checkpoint()

将RDD数据根据设置的checkpoint目录保存至硬盘中。

/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
def checkpoint(): Unit
/*通过linux命令创建/home/check目录后,设置checkpoint directory*/
scala> sc.setCheckpointDir("/home/check") scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[6] at parallelize at <console>:24 /*
*执行下面的代码会在/home/check目录下创建一个空的目录/home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49
*/
scala> rdd.checkpoint /*
执行count后会在上述目录下创建一个rdd目录,rdd目录下是数据文件
*/
scala> rdd.count
res5: Long = 6
[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/
total
drwxr-xr-x. root root Sep : .
drwxr-xr-x. root root Sep : ..
[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/
total
drwxr-xr-x. root root Sep : .
drwxr-xr-x. root root Sep : ..
drwxr-xr-x. root root Sep : rdd-
[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/rdd-/
total
drwxr-xr-x. root root Sep : .
drwxr-xr-x. root root Sep : ..
-rw-r--r--. root root Sep : part-
-rw-r--r--. root root Sep : .part-.crc
-rw-r--r--. root root Sep : part-
-rw-r--r--. root root Sep : .part-.crc
-rw-r--r--. root root Sep : part-
-rw-r--r--. root root Sep : .part-.crc

collect()

返回RDD所有元素的数组。
/**
* Return an array that contains all of the elements in this RDD.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T]
scala> val rdd = sc.parallelize(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24 scala> rdd.collect
res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

toLocalIterator: Iterator[T]

返回一个包含所有算的迭代器。
/**
* Return an iterator that contains all of the elements in this RDD.
*
* The iterator will consume as much memory as the largest partition in this RDD.
*
* Note: this results in multiple Spark jobs, and if the input RDD is the result
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input RDD should be cached first.
*/
def toLocalIterator: Iterator[T]
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> val it = rdd.toLocalIterator
it: Iterator[Int] = non-empty iterator scala> while(it.hasNext){
| println(it.next)
| }
1
2
3
4
5
6
7
8
9
10

count()

返回RDD中元素的数量。
/**
* Return the number of elements in the RDD.
*/
def count(): Long
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.count
res1: Long = 10

dependencies

返回该RDD的依赖RDD的地址。
/**
* Get the list of dependencies of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def dependencies: Seq[Dependency[_]]
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd1 = rdd.filter(_>3)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at filter at <console>:26 scala> val rdd2 = rdd1.filter(_<6)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:28 scala> rdd2.dependencies
res2: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@21c882b5)

partitions

以数组形式返回RDD各分区地址
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition]
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.partitions
res4: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@70c, org.apache.spark.rdd.ParallelCollectionPartition@70d)

first()

返回RDD的第一个元素。
/**
* Return the first element in this RDD.
*/
def first(): T
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> rdd.first
res5: Int = 1

fold(zeroValue: T)(op: (T, T) => T)

使用zeroValue和每个分区的元素进行聚合运算,最后各分区结果和zeroValue再进行一次聚合运算。
/**
* @param zeroValue the initial value for the accumulated result of each partition for the `op`
* operator, and also the initial value for the combine results from different
* partitions for the `op` operator - this will typically be the neutral
* element (e.g. `Nil` for list concatenation or `0` for summation)
* @param op an operator used to both accumulate results within a partition and combine results
* from different partitions
*/
def fold(zeroValue: T)(op: (T, T) => T): T
scala> val rdd = sc.parallelize(1 to 5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> rdd.fold(10)(_+_)
res13: Int = 35

Spark RDD Action 简单用例(一)的更多相关文章

  1. Spark RDD Action 简单用例(二)

    foreach(f: T => Unit) 对RDD的所有元素应用f函数进行处理,f无返回值./** * Applies a function f to all elements of this ...

  2. Spark RDD Transformation 简单用例(三)

    cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...

  3. Spark RDD Transformation 简单用例(二)

    aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...

  4. Spark RDD Transformation 简单用例(一)

    map(func) /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: C ...

  5. spark RDD transformation与action函数整理

    1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...

  6. Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)论文 | ApacheCN

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  7. Apache Spark RDD(Resilient Distributed Datasets)论文

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  8. Spark RDD深度解析-RDD计算流程

    Spark RDD深度解析-RDD计算流程 摘要  RDD(Resilient Distributed Datasets)是Spark的核心数据结构,所有数据计算操作均基于该结构进行,包括Spark ...

  9. spark RDD 常见操作

    fold 操作 区别 与 co 1.mapValus 2.flatMapValues 3.comineByKey 4.foldByKey 5.reduceByKey 6.groupByKey 7.so ...

随机推荐

  1. NModbus类库使用

    通过串口进行通信 : 1.将 NMobus 类库导入工程中,添加引用.命名空间.工程属性必须配置 为 .NET 4.0. 2.创建 SerialPort 类的一个实例,配置参数,打开串口,如: pub ...

  2. Android夜间模式的几种实现

    一.直接修改widget颜色,这种方式实现起来最简单,但需要每个控件都去修改,太过复杂.例如: /** * 相应交互,修改控件颜色 * @param view */public void onMeth ...

  3. MDX Cookbook 05 - 条件过滤 FILTER-COUNT 与 SUM-IIF 实现

    下面的这个查询返回每个财月的 Customer Count 和 基于上个月比较的 Growth in Customer Base 的记录,Slicer 是 Mountain bikes. SELECT ...

  4. JFinal项目部署到Weblogic注意事项

    1:修改web.xml配置文件增加以下监听配置 <listener> <listener-class>com.jfinal.ext.kit.ElResolverListener ...

  5. Android上实现各种风格的隐藏菜单,比如左右滑动菜单、上下滑动显示隐藏菜单

    Android上的菜单展示风格目前是各式各样的,但用的最多且体验最好的莫过于左右滑动来显示隐藏的菜单本示例实现了各种方式的菜单展示效果,只有你想不到的源码:https://github.com/Sim ...

  6. 外网IP监测上报程序(使用Poco库的SMTPClientSession发送邮件)

    目录 IPReport 项目介绍 编译说明 安装使用说明 获取外网IP方式 邮件发送关键代码 IPReport 代码地址https://gitee.com/solym/IPReport 项目介绍 外网 ...

  7. android 获得View的高度

      在一个activity中有一个textview,设置字数不同,如何能在打开这个activity时就及时获得这个textview在activity的高度,有利于我对textview的高度进行设置. ...

  8. goaccess生成nginx每日访问纪录

    使用php写的,方便点 <?php // 定义全局参数 $date = date("Ymd"); $day = date("d", strtotime(' ...

  9. 使用jmeter往指定文件中插入一定数量的数据

    有一个需求,新建一批账号,把获取的账号相关信息存入文本文件,当文本文件保存的数据达到一定的数量,就自动停止新建账号. 分析下需求: 1.把账号信息保存到文件,需要使用bean shell脚本(bean ...

  10. request.GetResponse()超时的解决办法

    var request = (HttpWebRequest)WebRequest.Create(url); request.Timeout = Timeout.Infinite; request.Ke ...