Spark 源码分析 -- RDD
关于RDD, 详细可以参考Spark的论文, 下面看下源码
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be operated on in parallel.
* Internally, each RDD is characterized by five main properties:
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
RDD分为一下几类,
basic(org.apache.spark.rdd.RDD): This class contains the basic operations available on all RDDs, such as `map`, `filter`, and `persist`.
org.apache.spark.rdd.PairRDDFunctions: contains operations available only on RDDs of key-value pairs, such as `groupByKey` and `join`
org.apache.spark.rdd.DoubleRDDFunctions: contains operations available only on RDDs of Doubles
org.apache.spark.rdd.SequenceFileRDDFunctions: contains operations available on RDDs that can be saved as SequenceFiles
RDD首先是泛型类, T表示存放数据的类型, 在处理数据是都是基于Iterator[T]
以SparkContext和依赖关系Seq deps为初始化参数
从RDD提供的这些接口大致就可以知道, 什么是RDD
1. RDD是一块数据, 可能比较大的数据, 所以不能保证可以放在一个机器的memory中, 所以需要分成partitions, 分布在集群的机器的memory
所以自然需要getPartitions, partitioner如果分区, getPreferredLocations分区如何考虑locality
Partition的定义很简单, 只有id, 不包含data
trait Partition extends Serializable {
/**
* Get the split's index within its parent RDD
*/
def index: Int
// A better default implementation of HashCode
override def hashCode(): Int = index
}
2. RDD之间是有关联的, 一个RDD可以通过compute逻辑把父RDD的数据转化成当前RDD的数据, 所以RDD之间有因果关系
并且通过getDependencies, 可以取到所有的dependencies
3. RDD是可以被persisit的, 常用的是cache, 即StorageLevel.MEMORY_ONLY
4. RDD是可以被checkpoint的, 以提高failover的效率, 当有很长的RDD链时, 单纯的依赖replay会比较低效
5. RDD.iterator可以产生用于迭代真正数据的Iterator[T]
6. 在RDD上可以做各种transforms和actions
abstract class RDD[T: ClassManifest](
@transient private var sc: SparkContext, //@transient, 不需要序列化
@transient private var deps: Seq[Dependency[_]]
) extends Serializable with Logging {
/**辅助构造函数, 专门用于初始化1对1依赖关系的RDD,这种还是很多的, filter, map...
Construct an RDD with just a one-to-one dependency on one parent */
def this(@transient oneParent: RDD[_]) =
this(oneParent.context , List(new OneToOneDependency(oneParent))) // 不同于一般的RDD, 这种情况因为只有一个parent, 所以直接传入parent RDD对象即可// =======================================================================
// Methods that should be implemented by subclasses of RDD
// =======================================================================
/** Implemented by subclasses to compute a given partition. */
def compute(split: Partition, context: TaskContext): Iterator[T] /**
* Implemented by subclasses to return the set of partitions in this RDD. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getPartitions: Array[Partition] /**
* Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only
* be called once, so it is safe to implement a time-consuming computation in it.
*/
protected def getDependencies: Seq[Dependency[_]] = deps /** Optionally overridden by subclasses to specify placement preferences. */
protected def getPreferredLocations(split: Partition): Seq[String] = Nil /** Optionally overridden by subclasses to specify how they are partitioned. */
val partitioner: Option[Partitioner] = None // =======================================================================
// Methods and fields available on all RDDs
// ======================================================================= /** The SparkContext that created this RDD. */
def sparkContext: SparkContext = sc /** A unique ID for this RDD (within its SparkContext). */
val id: Int = sc.newRddId() /** A friendly name for this RDD */
var name: String = null /**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet..
*/
def persist(newLevel: StorageLevel): RDD[T] = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {
throw new UnsupportedOperationException(
"Cannot change storage level of an RDD after it was already assigned a level")
}
storageLevel = newLevel
// Register the RDD with the SparkContext
sc.persistentRdds(id) = this
this
} /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): RDD[T] = persist(StorageLevel.MEMORY_ONLY) /** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): RDD[T] = persist()
/** Get the RDD's current storage level, or StorageLevel.NONE if none is set. */
def getStorageLevel = storageLevel // Our dependencies and partitions will be gotten by calling subclass's methods below, and will
// be overwritten when we're checkpointed
private var dependencies_ : Seq[Dependency[_]] = null
@transient private var partitions_ : Array[Partition] = null /** An Option holding our checkpoint RDD, if we are checkpointed
* checkpoint就是把RDD存到磁盘文件中, 以提高failover的效率, 虽然也可以选择replay
* 并且在RDD的实现中, 如果存在checkpointRDD, 则可以直接从中读到RDD数据, 而不需要compute */
private def checkpointRDD: Option[RDD[T]] = checkpointData.flatMap(_.checkpointRDD)
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
/** 这是RDD访问数据的核心, 在RDD中的Partition中只包含id而没有真正数据
* 那么如果获取RDD的数据? 参考storage模块
* 在cacheManager.getOrCompute中, 会将RDD和Partition id对应到相应的block, 并从中读出数据*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {//StorageLevel不为None,说明这个RDD persist过, 可以直接读出来
SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)
} else {
computeOrReadCheckpoint(split, context) //如果没有persisit过, 只有从新计算出, 或从checkpoint中读出
}
}
// Transformations (return a new RDD)
//...... 各种transformations的接口,map, union...
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassManifest](f: T => U): RDD[U] = new MappedRDD(this, sc.clean(f))
// Actions (launch a job to return a value to the user program)
//......各种actions的接口,count, collect...
/**
* Return the number of elements in the RDD.
*/
def count(): Long = {// 只有在action中才会真正调用runJob, 所以transform都是lazy的
sc.runJob(this, (iter: Iterator[T]) => {
var result = 0L
while (iter.hasNext) {
result += 1L
iter.next()
}
result
}).sum
}
// =======================================================================
// Other internal methods and fields
// =======================================================================
/** Returns the first parent RDD
返回第一个parent RDD*/
protected[spark] def firstParent[U: ClassManifest] = {
dependencies.head.rdd.asInstanceOf[RDD[U]]
}
//................
}
这里先只讨论一些basic的RDD, pairRDD会单独讨论
FilteredRDD
One-to-one Dependency, FilteredRDD
使用FilteredRDD, 将当前RDD作为第一个参数, f函数作为第二个参数, 返回值是filter过后的RDD
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = new FilteredRDD(this, sc.clean(f))
在compute中, 对parent RDD的Iterator[T]进行filter操作
private[spark] class FilteredRDD[T: ClassManifest]( //filter是典型的one-to-one dependency, 使用辅助构造函数
prev: RDD[T], //parent RDD
f: T => Boolean) //f,过滤函数
extends RDD[T](prev) {
//firstParent会从deps中取出第一个RDD对象, 就是传入的prev RDD, 在One-to-one Dependency中,parent和child的partition信息相同
override def getPartitions: Array[Partition] = firstParent[T].partitions override val partitioner = prev.partitioner // Since filter cannot change a partition's keys override def compute(split: Partition, context: TaskContext) =
firstParent[T].iterator(split, context).filter(f) //compute就是真正产生RDD的逻辑
}
UnionRDD
Range Dependency, 仍然是narrow的
先看看如果使用union的, 第二个参数是, 两个RDD的array, 返回值就是把这两个RDD union后产生的新的RDD
/**
* Return the union of this RDD and another one. Any identical elements will appear multiple
* times (use `.distinct()` to eliminate them).
*/
def union(other: RDD[T]): RDD[T] = new UnionRDD(sc, Array(this, other))
先定义UnionPartition, Union操作的特点是, 只是把多个RDD的partition合并到一个RDD中, 而partition本身没有变化, 所以可以直接重用parent partition
3个参数
idx, partition id, 在当前UnionRDD中的序号
rdd, parent RDD
splitIndex, parent partition的id
private[spark] class UnionPartition[T: ClassManifest](idx: Int, rdd: RDD[T], splitIndex: Int)
extends Partition { var split: Partition = rdd.partitions(splitIndex)//从parent RDD中取出相应的partition, 重用 def iterator(context: TaskContext) = rdd.iterator(split, context)//Iterator也可以重用 def preferredLocations() = rdd.preferredLocations(split) override val index: Int = idx//partition id是新的, 因为多个合并后, 序号肯定会发生变化
}
定义UnionRDD
class UnionRDD[T: ClassManifest](
sc: SparkContext,
@transient var rdds: Seq[RDD[T]]) //parent RDD Seq
extends RDD[T](sc, Nil) { // Nil since we implement getDependencies override def getPartitions: Array[Partition] = {
val array = new Array[Partition](rdds.map(_.partitions.size).sum) //UnionRDD的partition数,是所有parent RDD中的partition数目的和
var pos = 0
for (rdd <- rdds; split <- rdd.partitions) {
array(pos) = new UnionPartition(pos, rdd, split.index) //创建所有的UnionPartition
pos += 1
}
array
} override def getDependencies: Seq[Dependency[_]] = {
val deps = new ArrayBuffer[Dependency[_]]
var pos = 0
for (rdd <- rdds) {
deps += new RangeDependency(rdd, 0, pos, rdd.partitions.size)//创建RangeDependency
pos += rdd.partitions.size)//由于是RangeDependency, 所以pos的递增是加上整个区间size
}
deps
} override def compute(s: Partition, context: TaskContext): Iterator[T] =
s.asInstanceOf[UnionPartition[T]].iterator(context)//Union的compute非常简单,什么都不需要做 override def getPreferredLocations(s: Partition): Seq[String] =
s.asInstanceOf[UnionPartition[T]].preferredLocations()
}
Spark 源码分析 -- RDD的更多相关文章
- Spark源码分析 – 汇总索引
http://jerryshao.me/categories.html#architecture-ref http://blog.csdn.net/pelick/article/details/172 ...
- Spark源码分析 – Dependency
Dependency 依赖, 用于表示RDD之间的因果关系, 一个dependency表示一个parent rdd, 所以在RDD中使用Seq[Dependency[_]]来表示所有的依赖关系 Dep ...
- Spark源码分析:多种部署方式之间的区别与联系(转)
原文链接:Spark源码分析:多种部署方式之间的区别与联系(1) 从官方的文档我们可以知道,Spark的部署方式有很多种:local.Standalone.Mesos.YARN.....不同部署方式的 ...
- Spark 源码分析 -- task实际执行过程
Spark源码分析 – SparkContext 中的例子, 只分析到sc.runJob 那么最终是怎么执行的? 通过DAGScheduler切分成Stage, 封装成taskset, 提交给Task ...
- Spark源码分析 – Shuffle
参考详细探究Spark的shuffle实现, 写的很清楚, 当前设计的来龙去脉 Hadoop Hadoop的思路是, 在mapper端每次当memory buffer中的数据快满的时候, 先将memo ...
- Spark源码分析 – BlockManager
参考, Spark源码分析之-Storage模块 对于storage, 为何Spark需要storage模块?为了cache RDD Spark的特点就是可以将RDD cache在memory或dis ...
- Spark源码分析 – DAGScheduler
DAGScheduler的架构其实非常简单, 1. eventQueue, 所有需要DAGScheduler处理的事情都需要往eventQueue中发送event 2. eventLoop Threa ...
- Spark源码分析 – SparkContext
Spark源码分析之-scheduler模块 这位写的非常好, 让我对Spark的源码分析, 变的轻松了许多 这里自己再梳理一遍 先看一个简单的spark操作, val sc = new SparkC ...
- Spark源码分析之八:Task运行(二)
在<Spark源码分析之七:Task运行(一)>一文中,我们详细叙述了Task运行的整体流程,最终Task被传输到Executor上,启动一个对应的TaskRunner线程,并且在线程池中 ...
随机推荐
- 102. Linked List Cycle【medium】
Given a linked list, determine if it has a cycle in it. Example Given -21->10->4->5, tail ...
- 喵神 onevcat 的直播首秀
喵神 onevcat 的直播首秀 王巍在圈内人称喵神,我和他在网上很早就认识,平时多有交流.在我眼中,他是一个幽默风趣高手.虽然他的博客中主要内容是 iOS 开发,但是他实际上涉及的技术领域还包括 ...
- 在eclipse中执行sql的编码问题
症状-分析: 刚才在eclipse中执行sql文件,发现数据进入数据库的时候总是乱码 后来查看MySQL的编码设置,全是UTF8,没问题,sql文件本身也是UTF8的编码 并且,使用MySQL的CMD ...
- Django教程:[33]从数据库生成模型
在使用django做网站的时候,有时候我们的数据库来自一个已有的数据库,如何整合这个数据库呢? django提供了方便的方法来整合已有数据库,下面我们看看具体的方法: 1.先来设置数据库:在网站文件夹 ...
- RFID Hacking–资源大合集
原文: http://www.freebuf.com/news/others/605.html http://www.proxmark.org/forum/index.php RFID破解神器官方论坛 ...
- 利用|,&,^,~,<<,>>>写出高效艺术的代码
简单介绍: 大家在阅读源代码的时候常常会看到一些比方以下这样特别难理解的代码. cancelEvent.setAction(MotionEvent.ACTION_CANCEL | (motionEve ...
- svn:ignore 的用处
用svn管理代码,一直以来都受到一件不爽事情的困扰: 1)有些文件或文件夹不想在commit的时候看到,虽然他们是non-versioned,比如*.bak.*.class,*.scc(vss文件), ...
- eclipse不能自动编译生成class文件的解决办法
最近在项目项目开发过程中遇到eclipse不能自动编译生成class文件,当时很纳闷,每次修改代码后运行都是修改前的效果,没辙了,只好反编译原来的class文件,结果发现,class文件里并没有看到修 ...
- redis是如何存储对象和集合的
在项目中,缓存以及mq消息队列可以说是不可或缺的2个重要技术.前者主要是为了减轻数据库压力,大幅度提升性能.后者主要是为了提高用户的体验度,我理解的是再后端做的一个ajax请求(异步),并且像ribb ...
- C# 导出Excel "正在中止线程" 错误
导出Excel相信很多人都用过,但是我却遇到了一个问题 “正在中止线程” 源代码如下: public static void ExportExcel(string fileName, GridView ...