【原】1.1RDD源码解读(二)

（6）transformation 操作，通过外在的不同RDD表现形式来达到内部数据的处理过程。这类操作并不会触发作业的执行，也常被称为lazy操作。

大部分操作会生成并返回一个新的RDD，例sortByKey就不会产生一个新的RDD。

1) map函数，一行数据经过map函数处理后还是一行数据

//将map函数作用在RDD的所有元素上，并返回一个新的RDD

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
//将函数作用在父RDD的每一个分区上

new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}

2) flatMap函数，和map函数功能类似，但一行数据经过flatMap函数处理后是多行数据

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

3) filter函数，将不满足条件的数据过滤掉，并返回一个新的RDD

def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}

4) distinct函数，将重复的元素去掉，返回不同的元素，并返回一个新的RDD

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}

具体过程如下所示：

5) repartition函数，对RDD重新分区，并返回一个新的RDD

该方法用于增加或减少RDD的并行度，实际上是通过shuffle来分发数据的

如果想要减少RDD的分区，考虑使用‘coalesce’函数，避免shuffle

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

6) coalesce函数，将RDD重新分区并返回一个新的RDD

这个操作是窄依赖，比如，如果你从1000个分区合并为100个分区，这个合并过程并没有shuffle，而是100个新的分区将每个分区将是原来的10个分区。

def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
    : RDD[T] = withScope {
if (shuffle) {
//从一个随机的分区开始，将数据均匀地分布到新分区上

val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
      items.map { t =>
position = position + 1
(position, t)
      }
    } : Iterator[(Int, T)]
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
      numPartitions).values
} else {
new CoalescedRDD(this, numPartitions)
}
}

7) sample函数，随机返回RDD的部分样例数据

def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}

8) sortBy将RDD根据所给的key函数排序，并返回本身，注意不是创建一个新的RDD，同时也说明并不是所有的transformation都是创建一个新的RDD

def sortBy[K](
    f: (T) => K,
    ascending: Boolean = true,
    numPartitions: Int = this.partitions.length)
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
      .sortByKey(ascending, numPartitions)
      .values
}

9) glom函数，将每个分区的元素合并成一个数组并返回一个新的RDD

def glom(): RDD[Array[T]] = withScope {
new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))
}

10) groupByKey函数，返回key和相同key的value结合组成的RDD。

这个操作可能开销比较大，如果想要求总数sum或均值，用PairRDDFunctions.aggregateByKey或PairRDDFunctions.reduceByKey会有更好的效果。

def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null)
: RDD[(K, Iterable[T])] = withScope {
val cleanF = sc.clean(f)
this.map(t => (cleanF(t), t)).groupByKey(p)
}

（7）Action操作，触发作业的执行并将返回值反馈给用户程序

1) foreach函数，将此函数应用于RDD的所有元素上

def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

2) foreachPartition函数，将此函数作用于RDD的每一个分区上，比如连接数据库的连接可以一个分区共用一个连接

def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}

3) collect函数，将包含在RDD中所有的元素以数组形式返回

def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}

4) count函数，返回RDD中元素的个数

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

5) take函数，取RDD的前num元素。先取一个分区的元素，如果不够再取其他分区的元素。

def take(num: Int): Array[T] = withScope {
if (num == 0) {
new Array[T](0)
} else {
val buf = new ArrayBuffer[T]
val totalParts = this.partitions.length
var partsScanned = 0
while (buf.size < num && partsScanned < totalParts) {
var numPartsToTry = 1
if (partsScanned > 0) {
if (buf.size == 0) {
          numPartsToTry = partsScanned * 4
} else {
numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
          numPartsToTry = Math.min(numPartsToTry, partsScanned * 4)
        }
      }
val left = num - buf.size
val p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts)
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
      res.foreach(buf ++= _.take(num - buf.size))
      partsScanned += numPartsToTry
    }
    buf.toArray
}
}

6) first函数，取RDD中的第一个元素，实际上是take（1）操作

def first(): T = withScope {
take(1) match {
case Array(t) => t
case _ => throw new UnsupportedOperationException("empty collection")
}
}

7) top函数，返回RDD中的top k，隐式排序按照Ordering[T]排序，即降序，刚好和[takeOrdered]相反

def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
takeOrdered(num)(ord.reverse)
}

8) saveAsTextFile函数，将RDD保存为文本文件

def saveAsTextFile(path: String): Unit = withScope {
val nullWritableClassTag = implicitly[ClassTag[NullWritable]]
val textClassTag = implicitly[ClassTag[Text]]
val r = this.mapPartitions { iter =>
val text = new Text()
    iter.map { x =>
      text.set(x.toString)
      (NullWritable.get(), text)
    }
}
RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
    .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

9) saveAsObjectFile函数，将RDD中的元素序列化并保存为文件

def saveAsObjectFile(path: String): Unit = withScope {
this.mapPartitions(iter => iter.grouped(10).map(_.toArray))
.map(x => (NullWritable.get(), new BytesWritable(Utils.serialize(x))))
.saveAsSequenceFile(path)
}

（8）隐式转换

在RDD object中定义了好多隐式转换函数，这些函数额外提供了许多本身不具有的功能

比如将RDD隐式转化为PairRDDFunctions，那么该RDD就具有了reduceByKey等功能。

implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])
(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null): PairRDDFunctions[K, V] = {
new PairRDDFunctions(rdd)
}

【原】1.1RDD源码解读(二)的更多相关文章

【原】SparkContex源码解读（二）
版权声明:本文为原创文章,未经允许不得转载. 继续前一篇的内容.前一篇内容为: SparkContex源码解读(一)http://www.cnblogs.com/yourarebest/p/53266 ...
jQuery.Callbacks 源码解读二
一.参数标记 /* * once: 确保回调列表仅只fire一次 * unique: 在执行add操作中,确保回调列表中不存在重复的回调 * stopOnFalse: 当执行回调返回值为false,则 ...
(转)go语言nsq源码解读二 nsqlookupd、nsqd与nsqadmin
转自:http://www.baiyuxiong.com/?p=886 ---------------------------------------------------------------- ...
【原】1.1RDD源码解读（一）
1.RDD(Resilient Distributed DataSet)是Spark生态系统中最基本的抽象,代表不可变的.可并行操作的分区元素集合.RDD这个类有RDD系列所有基本的操作,比如map. ...
mybatis源码解读(二)——构建Configuration对象
Configuration 对象保存了所有mybatis的配置信息,主要包括: ①. mybatis-configuration.xml 基础配置文件 ②. mapper.xml 映射器配置文件 1. ...
ConcurrentHashMap源码解读二
接下来就讲解put里面的三个方法,分别是 1.数组初始化方法initTable() 2.线程协助扩容方法helpTransfer() 3.计数方法addCount() 首先是数组初始化,再将源码之前, ...
go语言nsq源码解读二 nsqlookupd、nsqd与nsqadmin
nsqlookupd: 官方文档解释见:http://bitly.github.io/nsq/components/nsqlookupd.html 用官方话来讲是:nsqlookupd管理拓扑信息,客 ...
vue2.0 源码解读(二)
小伞最近比较忙,阅读源码的速度越来越慢了最近和朋友交流的时候,发现他们对于源码的目录结构都不是很清楚红色圈子内是我们需要关心的地方 compiler 模板编译部分 core 核心实现部分 ent ...
ROS源码解读(二)--全局路径规划
博客转载自:https://blog.csdn.net/xmy306538517/article/details/79032324 ROS中,机器人全局路径规划默认使用的是navfn包 ,move_b ...

随机推荐

前端资源多个产品整站一键打包&包版本管理（四）—— js&css文件文件打包并生成哈希后缀，自动写入路径、解决资源缓存问题。
问题: 当我们版本更新的时候,我们都要清理缓存的js跟css,如何使得在网页中不需要手动清理呢? 答案: 生成带有哈希后缀的js跟css文件 1.文件路径路径中的conf.js 是用于放置全局打包的 ...
POJ1182并查集
食物链时间限制:1000 ms | 内存限制:65535 KB 难度:5 描述动物王国中有三类动物A,B,C,这三类动物的食物链构成了有趣的环形.A吃B, B吃C,C吃A. 现有N个动物, ...
第四章 Web表单
4.1 跨站请求伪造保护安装flask-wtf app = Flask(__name__) app.config['SECRET_KEY'] = 'hard to guess string' 密钥不 ...
Style 的优先级
Dependency Property(简称DP)是WPF的核心,Style就是基于Dependency Property的,关于DP的内幕,请参见深入WPF--依赖属性.Style中的Setter就 ...
蜗牛历险记(二) Web框架(中)
上篇简单介绍了框架所使用的Autofac,采用Autofac提供的Ioc管理整个Web项目中所有对象的生命周期,实现框架面向接口编程.接下来介绍框架的日志系统. 一.介绍之前框架日志是否有存在的必要 ...
iOS8定位问题
正文:主要解决iOS8以前能定位,但是在iOS8时候无法定位的问题在iOS8以前,我们的GPS定位是在用户设置的里面显示的是总是使用,但是在iOS8以后,苹果修改了这部分授权,你需要多加入2个pli ...
Python爬取17吉他网吉他谱
最近学习吉他,一张一张保存吉他谱太麻烦,写个小程序下载吉他谱. 安装 BeautifulSoup,BeautifulSoup是一个解析HTML的库.pip install BeautifulSoup4 ...
Python 学习日志（一）
第一天: (一)安装Python3.3: (二)试运行: 1.在IDLE中输入:print("Hello,world"); //回车查看结果 2.使用"File" ...
oracle----删除数据
1. 删除数据:delete语句: 语法: DELETE FROM table_name; (1),无条件删除: SQL> create table testdel as select * fr ...
Adaboost原理及目标检测中的应用
Adaboost原理及目标检测中的应用 whowhoha@outlook.com Adaboost原理 Adaboost(AdaptiveBoosting)是一种迭代算法,通过对训练集不断训练弱分类器 ...

【原】1.1RDD源码解读(二)

【原】1.1RDD源码解读(二)的更多相关文章

随机推荐

热门专题