dagScheduler
由一个action动作触发sparkcontext的runjob,再由此触发dagScheduler.runJob,然后触发submitJob,封装一个JobSubmitted放入一个队列。然后再通过doOnReceive里面的dagScheduler.handleJobSubmitted提交。
1:由action动作触发工作的提交。
2:sparkcontext提交job。
3:调用DagScheduler提交job。
4:调用DagScheduler的submitJob。
5:生成一个JobSubmit对象,通过DAGSchedulerEventProcessLoop的post把JobSubmit加入到队列。
6:DAGSchedulerEventProcessLoop执行doOnReceive,调用handleJobSubmitted。
stage
通过handleJobSubmitted将会划分stage。
首先看下stage的源码
private[scheduler] abstract class Stage(
val id: Int,
val rdd: RDD[_],
val numTasks: Int,
val parents: List[Stage],
val firstJobId: Int,
val callSite: CallSite)
stage有两个子类,分别是ResultStage和ShfflemapStage。
通过对比源码发现
ResultStage 多了一个 val func: (TaskContext, Iterator[_]) => _, 保存action对应的处理函数
ShfflemapStage多了一个 val shuffleDep: ShuffleDependency[_, _, _]) 保存Dependency信息
stage的划分
stage的划分
stage的划分是Spark作业调度的关键一步,它基于DAG确定依赖关系,借此来划分stage,将依赖链断开,每个stage内部可以并行运行,整个作业按照stage顺序依次执行,最终完成整个Job。实际应用提交的Job中RDD依赖关系是十分复杂的,依据这些依赖关系来划分stage自然是十分困难的,Spark此时就利用了前文提到的依赖关系,调度器从DAG图末端出发,逆向遍历整个依赖关系链,遇到ShuffleDependency(宽依赖关系的一种叫法)就断开,遇到NarrowDependency就将其加入到当前stage。stage中task数目由stage末端的RDD分区个数来决定,RDD转换是基于分区的一种粗粒度计算,一个stage执行的结果就是这几个分区构成的RDD。
回到刚才DagSchedler的handleJobSubmitted。因为rdd是倒序遍历的,所以首先生成一个名为finalStage的ResultStage。
var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}
stage的划分的关键代码
/**
* Returns shuffle dependencies that are immediate parents of the given RDD.
*
* This function will not return more distant ancestors. For example, if C has a shuffle
* dependency on B which has a shuffle dependency on A:
*
* A <-- B <-- C
*
* calling this function with rdd C will only return the B <-- C dependency.
*
* This function is scheduler-visible for the purpose of unit testing.
*/
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
val parents = new HashSet[ShuffleDependency[_, _, _]]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new Stack[RDD[_]]
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
if (!visited(toVisit)) {
visited += toVisit
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep //如果是宽依赖
case dependency =>
waitingForVisit.push(dependency.rdd)
}
}
}
parents
}
如果是宽依赖,直接把当前RDD加入parent并返回。这个parent即为每个stage的边界点。这里并没有得到每个stage的依赖。真正获取每个stage的依赖是在submitStage。
对于任何的job都会产生出一个finalStage
来产生和提交task。其次对于某些简单的job,它没有依赖关系,并且只有一个partition,这样的job会使用local thread处理而并非提交到TaskScheduler
上处理。
接下来产生finalStage
后,需要调用submitStage()
,它根据stage之间的依赖关系得出stage DAG,并以依赖关系进行处理:
/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
对于新提交的job,finalStage
的parent stage还未获得,因此submitStage
会调用getMissingParentStages()
来获得依赖关系:
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}
这里parent stage是通过RDD
的依赖关系递归遍历获得。对于Wide Dependecy
也就是Shuffle Dependecy
,Spark会产生新的mapStage
作为finalStage
的parent,而对于Narrow Dependecy
Spark则不会产生新的stage。这里对stage的划分是按照上面提到的作为划分依据的,因此对于本段开头提到的两种job,第一种job只会产生一个finalStage
,而第二种job会产生finalStage
和mapStage
。
当stage DAG产生以后,针对每个stage需要产生task去执行,故在这会调用submitMissingTasks()
:
/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// Get our pending tasks and remember them in our pendingTasks entry
stage.pendingPartitions.clear() // First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
} stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
} taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage // Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
} val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
} case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
} if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None) val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString) submitWaitingChildStages(stage)
}
}
首先根据stage所依赖的RDD
的partition的分布,会产生出与partition数量相等的task,这些task根据partition的locality进行分布;其次对于finalStage
或是mapStage
会产生不同的task;最后所有的task会封装到TaskSet
内提交到TaskScheduler
去执行。
至此job在DAGScheduler
内的启动过程全部完成,交由TaskScheduler
执行task,当task执行完后会将结果返回给DAGScheduler
,DAGScheduler
调用handleTaskComplete()
处理task返回:
private def handleTaskCompletion(event: CompletionEvent) {
val task = event.task
val stage = idToStage(task.stageId)
def markStageAsFinished(stage: Stage) = {
val serviceTime = stage.submissionTime match {
case Some(t) => "%.03f".format((System.currentTimeMillis() - t) / 1000.0)
case _ => "Unkown"
}
logInfo("%s (%s) finished in %s s".format(stage, stage.origin, serviceTime))
running -= stage
}
event.reason match {
case Success =>
...
task match {
case rt: ResultTask[_, _] =>
...
case smt: ShuffleMapTask =>
...
}
case Resubmitted =>
...
case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
...
case other =>
abortStage(idToStage(task.stageId), task + " failed: " + other)
}
}
每个执行完成的task都会将结果返回给DAGScheduler
,DAGScheduler
根据返回结果来进行进一步的动作。
dagScheduler的更多相关文章
- Spark核心作业调度和任务调度之DAGScheduler源码
前言:本文是我学习Spark 源码与内部原理用,同时也希望能给新手一些帮助,入道不深,如有遗漏或错误的,请在原文评论或者发送至我的邮箱 tongzhenguotongzhenguo@gmail.com ...
- Spark源码学习1.1——DAGScheduler.scala
本文以Spark1.1.0版本为基础. 经过前一段时间的学习,基本上能够对Spark的工作流程有一个了解,但是具体的细节还是需要阅读源码,而且后续的科研过程中也肯定要修改源码的,所以最近开始Spark ...
- spark1.1.0源码阅读-dagscheduler and stage
1. rdd action ->sparkContext.runJob->dagscheduler.runJob def runJob[T, U: ClassTag]( rdd: RDD[ ...
- Spark Scheduler模块源码分析之DAGScheduler
本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...
- Spark源码剖析 - SparkContext的初始化(六)_创建和启动DAGScheduler
6.创建和启动DAGScheduler DAGScheduler主要用于在任务正式交给TaskSchedulerImpl提交之前做一些准备工作,包括:创建Job,将DAG中的RDD划分到不同的Stag ...
- Spark Stage切分 源码剖析——DAGScheduler
Spark中的任务管理是很重要的内容,可以说想要理解Spark的计算流程,就必须对它的任务的切分有一定的了解.不然你就看不懂Spark UI,看不懂Spark UI就无法去做优化...因此本篇就从源码 ...
- Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedul
在写Spark程序是遇到问题 Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.orgapacheapachesparksch ...
- Spark分析之DAGScheduler
DAGScheduler概述:是一个面向Stage层面的调度器: 主要入参有: dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, ...
- DagScheduler 和 TaskScheduler
DagScheduler 和 TaskScheduler 的任务交接 spark 调度器分为两个部分, 一个是 DagScheduler, 一个是 TaskScheduler, DagSchedule ...
随机推荐
- 学习Unity -- 理解依赖注入(IOC)三种方式依赖注入
IOC:英文全称:Inversion of Control,中文名称:控制反转,它还有个名字叫依赖注入(Dependency Injection).作用:将各层的对象以松耦合的方式组织在一起,解耦,各 ...
- Java面向对象 第3节 类的封装和继承
一.封装 封装的概念:将类的某些信息隐藏在类内部,不允许外部程序直接访问,而是通过该类提供的方法来实现对隐藏信息的访问和操作. 封装的2个大致原则:1)把尽可能多的东西隐藏起来,对外提供便捷的接口 ...
- 第二章 JavaScript案例(中)
1. js事件 HTML代码 <!DOCTYPE html> <html lang="en" onUnload="ud()"> < ...
- linux下openldap 的安装与配置自己总结版
---恢复内容开始--- 前段时间公司需要安装openldap 于是去网上查找相关资料,安装文档倒是不少但是或多或少都有点问题 导致自己一直没有安装上,于是结合英文安装文档磕磕巴巴的 安装少了 于是将 ...
- (转)解决OSX上面PHP curl SSLRead()
原创 2016年05月19日 19:39:04 标签: php / curl / osx 830 这个问题的原因是因为OSX curl默认使用 SecureTransport 而不是OpenSSL. ...
- ajax请求完成执行的操作
var createAjax = $("#createId").ajax(function(){ //ajax操作 }); $.when(createAjax).done(func ...
- 【maven】之nexus常用的一些配置
nexus私服主要是在项目和maven中央仓库中间做代理,一般在公司内网或者公司内部的一些私包,都需要这么个产品.下面主要是关于maven和nexus之间的一些配置 1.在pom中配置nexus私服 ...
- Ubuntu 16.04 安装Mysql数据库
系统环境 Ubuntu 16.04; 安装步骤 1.通过以下环境安装mysql服务端与客户端软件 sudo apt-get install mysql-server apt-get isntall m ...
- Program type already present: android.support.v4.widget.EdgeEffectCompat
1.确保所有依赖包的 implementation 'com.android.support:appcompat-v7:25.4.0'是一样的 2.确保最外层的build.gradle中增加如下代码: ...
- 《女神异闻录 5》的 UI 设计
转自:https://www.zhihu.com/question/50995871?sort=created <女神异闻录5>是近两年最为火热的JRPG游戏之一,它的出色不仅在于剧情暗讽 ...