collectAsMap(): Map[K, V]

返回key-value对,key是唯一的,如果rdd元素中同一个key对应多个value,则只会保留一个。
/**
* Return the key-value pairs in this RDD to the master as a Map.
*
* Warning: this doesn't return a multimap (so if you have multiple values to the same key, only
* one value per key is preserved in the map returned)
*
* @note this method should only be used if the resulting data is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collectAsMap(): Map[K, V]
scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> rdd.collectAsMap
res0: scala.collection.Map[String,Int] = Map(A -> 3, C -> 3, B -> 2)

countByKey(): Map[K, Long]

计算有多少个不同的key.
/**
* Count the number of elements for each key, collecting the results to a local Map.
*
* Note that this method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByKey(): Map[K, Long] = self.withScope {
self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap
}
scala> val rdd = sc.parallelize(List((1,1),(1,2),(1,3),(2,1),(2,2),(2,3)),3)
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> rdd.countByKey
res5: scala.collection.Map[Int,Long] = Map(1 -> 3, 2 -> 3)

countByValue()

计算不同的value个数,该函数首先通过map将每个元素转成(value,null)的key-value(value为null)对,
然后调用countByKey进行统计。 /**
* Return the count of each unique value in this RDD as a local map of (value, count) pairs.
*
* Note that this method should only be used if the resulting map is expected to be small, as
* the whole thing is loaded into the driver's memory.
* To handle very large results, consider using rdd.map(x =&gt; (x, 1L)).reduceByKey(_ + _), which
* returns an RDD[T, Long] instead of a map.
*/
def countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope {
map(value => (value, null)).countByKey()
}
scala> val rdd = sc.parallelize(List(1,2,3,4,5,4,4,3,2,1))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> rdd.countByValue
res12: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 2, 2 -> 2, 3 -> 2, 4 -> 3)

lookup(key: K)

根据key值搜索所有的value.
/**
* Return the list of values in the RDD for key `key`. This operation is done efficiently if the
* RDD has a known partitioner by only searching the partition that the key maps to.
*/
def lookup(key: K): Seq[V]
scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.lookup("A")
res2: Seq[Int] = WrappedArray(1, 2, 3)

checkpoint()

将RDD数据根据设置的checkpoint目录保存至硬盘中。

/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
def checkpoint(): Unit
/*通过linux命令创建/home/check目录后,设置checkpoint directory*/
scala> sc.setCheckpointDir("/home/check") scala> val rdd = sc.parallelize(List(("A",1),("A",2),("A",3),("B",1),("B",2),("C",3)),3)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[6] at parallelize at <console>:24 /*
*执行下面的代码会在/home/check目录下创建一个空的目录/home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49
*/
scala> rdd.checkpoint /*
执行count后会在上述目录下创建一个rdd目录,rdd目录下是数据文件
*/
scala> rdd.count
res5: Long = 6
[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/
total
drwxr-xr-x. root root Sep : .
drwxr-xr-x. root root Sep : ..
[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/
total
drwxr-xr-x. root root Sep : .
drwxr-xr-x. root root Sep : ..
drwxr-xr-x. root root Sep : rdd-
[root@localhost ~]# ll -a /home/check/5545e4ca-d53d-4d93-aaf4-fd3c74f1ea49/rdd-/
total
drwxr-xr-x. root root Sep : .
drwxr-xr-x. root root Sep : ..
-rw-r--r--. root root Sep : part-
-rw-r--r--. root root Sep : .part-.crc
-rw-r--r--. root root Sep : part-
-rw-r--r--. root root Sep : .part-.crc
-rw-r--r--. root root Sep : part-
-rw-r--r--. root root Sep : .part-.crc

collect()

返回RDD所有元素的数组。
/**
* Return an array that contains all of the elements in this RDD.
*
* @note this method should only be used if the resulting array is expected to be small, as
* all the data is loaded into the driver's memory.
*/
def collect(): Array[T]
scala> val rdd = sc.parallelize(1 to 10,3)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24 scala> rdd.collect
res8: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

toLocalIterator: Iterator[T]

返回一个包含所有算的迭代器。
/**
* Return an iterator that contains all of the elements in this RDD.
*
* The iterator will consume as much memory as the largest partition in this RDD.
*
* Note: this results in multiple Spark jobs, and if the input RDD is the result
* of a wide transformation (e.g. join with different partitioners), to avoid
* recomputing the input RDD should be cached first.
*/
def toLocalIterator: Iterator[T]
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> val it = rdd.toLocalIterator
it: Iterator[Int] = non-empty iterator scala> while(it.hasNext){
| println(it.next)
| }
1
2
3
4
5
6
7
8
9
10

count()

返回RDD中元素的数量。
/**
* Return the number of elements in the RDD.
*/
def count(): Long
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> rdd.count
res1: Long = 10

dependencies

返回该RDD的依赖RDD的地址。
/**
* Get the list of dependencies of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def dependencies: Seq[Dependency[_]]
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd1 = rdd.filter(_>3)
rdd1: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at filter at <console>:26 scala> val rdd2 = rdd1.filter(_<6)
rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[2] at filter at <console>:28 scala> rdd2.dependencies
res2: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.OneToOneDependency@21c882b5)

partitions

以数组形式返回RDD各分区地址
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition]
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24 scala> rdd.partitions
res4: Array[org.apache.spark.Partition] = Array(org.apache.spark.rdd.ParallelCollectionPartition@70c, org.apache.spark.rdd.ParallelCollectionPartition@70d)

first()

返回RDD的第一个元素。
/**
* Return the first element in this RDD.
*/
def first(): T
scala> val rdd = sc.parallelize(1 to 10,2)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> rdd.first
res5: Int = 1

fold(zeroValue: T)(op: (T, T) => T)

使用zeroValue和每个分区的元素进行聚合运算,最后各分区结果和zeroValue再进行一次聚合运算。
/**
* @param zeroValue the initial value for the accumulated result of each partition for the `op`
* operator, and also the initial value for the combine results from different
* partitions for the `op` operator - this will typically be the neutral
* element (e.g. `Nil` for list concatenation or `0` for summation)
* @param op an operator used to both accumulate results within a partition and combine results
* from different partitions
*/
def fold(zeroValue: T)(op: (T, T) => T): T
scala> val rdd = sc.parallelize(1 to 5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> rdd.fold(10)(_+_)
res13: Int = 35

Spark RDD Action 简单用例(一)的更多相关文章

  1. Spark RDD Action 简单用例(二)

    foreach(f: T => Unit) 对RDD的所有元素应用f函数进行处理,f无返回值./** * Applies a function f to all elements of this ...

  2. Spark RDD Transformation 简单用例(三)

    cache和persist 将RDD数据进行存储,persist(newLevel: StorageLevel)设置了存储级别,cache()和persist()是相同的,存储级别为MEMORY_ON ...

  3. Spark RDD Transformation 简单用例(二)

    aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) aggregateByKey(zeroValue)(seqOp, combOp, [numTa ...

  4. Spark RDD Transformation 简单用例(一)

    map(func) /** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: C ...

  5. spark RDD transformation与action函数整理

    1.创建RDD val lines = sc.parallelize(List("pandas","i like pandas")) 2.加载本地文件到RDD ...

  6. Apache Spark 2.2.0 中文文档 - Spark RDD(Resilient Distributed Datasets)论文 | ApacheCN

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  7. Apache Spark RDD(Resilient Distributed Datasets)论文

    Spark RDD(Resilient Distributed Datasets)论文 概要 1: 介绍 2: Resilient Distributed Datasets(RDDs) 2.1 RDD ...

  8. Spark RDD深度解析-RDD计算流程

    Spark RDD深度解析-RDD计算流程 摘要  RDD(Resilient Distributed Datasets)是Spark的核心数据结构,所有数据计算操作均基于该结构进行,包括Spark ...

  9. spark RDD 常见操作

    fold 操作 区别 与 co 1.mapValus 2.flatMapValues 3.comineByKey 4.foldByKey 5.reduceByKey 6.groupByKey 7.so ...

随机推荐

  1. 高并发 Web 服务的演变:节约系统内存和 CPU

    本文内容 越来越多的并发连接数 Web 前端优化,降低服务端压力 节约 Web 服务端的内存 节约 Web 服务器的 CPU 小结 一,越来越多的并发连接数 现在,Web 系统面对的并发连接数呈现指数 ...

  2. 微软补丁安装工具wusa报错。

    命令行需要msu格式的补丁安装文件的全路径,否则报错.

  3. Jenkins常用插件

    Generic Webhook Trigger Plugin触发器webhook用户触发构建 Deploy to container Plugin部署到tomcat Gradle Plugin Gra ...

  4. 配置Windows Server 2008/2012/2016允许2个用户同时远程桌面

    Windows Server 系列服务器默认情况下只能支持一个用户远程,如果第二个人远程上去之后会直接把前面一个登录用户踢掉.在日常工作中如果有多个人需要同时远程过去工作,会很不方面. 网上很多教程讲 ...

  5. Egret里用矢量挖圆形的洞

    项目里需要用到,但是不是用在新手引导上,下面的代码可以绘制一个圆的四分之一,用四个即可拼出一个圆. private createShape(): egret.Shape { let magicNum ...

  6. lua -- 生成协议

    这是爬塔的协议 <?xml version="1.0" encoding="utf-8" ?> <coder name="Tower ...

  7. windows下安装pycharm并连接Linux的python环境

    1. 下载安装Pycharm专业版 具体方法略.Pycharm5激活方法参考http://www.cnblogs.com/snsdzjlz320/p/7110186.html 2. 添加配置连接远程服 ...

  8. Goldengate:ERROR 180 encountered commit SCN that is not greater than the highest SCN already processed

    How to recover from Extract ERROR 180 encountered commit SCN that is not greater than the highest SC ...

  9. phpcmsv9 管理加密解密

    例子:  密码:123123  encrypt:Jiu5He 第一步:  md5("123456")="4297f44b13955235245b2497399d7a93& ...

  10. Android ShareUserId 使用总结

    今天讲一下Android里面经常看到却不太留意的知识点——ShareUserId,在Android里面每个app都有一个唯一的linux user ID,则这样权限就被设置成该应用程序的文件只对该用户 ...