由一个action动作触发sparkcontext的runjob,再由此触发dagScheduler.runJob,然后触发submitJob,封装一个JobSubmitted放入一个队列。然后再通过doOnReceive里面的dagScheduler.handleJobSubmitted提交。

1:由action动作触发工作的提交。

2:sparkcontext提交job。

3:调用DagScheduler提交job。

4:调用DagScheduler的submitJob。

5:生成一个JobSubmit对象,通过DAGSchedulerEventProcessLoop的post把JobSubmit加入到队列。

6:DAGSchedulerEventProcessLoop执行doOnReceive,调用handleJobSubmitted。

stage  

通过handleJobSubmitted将会划分stage。

首先看下stage的源码

private[scheduler] abstract class Stage(
val id: Int,
val rdd: RDD[_],
val numTasks: Int,
val parents: List[Stage],
val firstJobId: Int,
val callSite: CallSite)

stage有两个子类,分别是ResultStage和ShfflemapStage。

通过对比源码发现

ResultStage 多了一个 val func: (TaskContext, Iterator[_]) => _,   保存action对应的处理函数

ShfflemapStage多了一个 val shuffleDep: ShuffleDependency[_, _, _])  保存Dependency信息

stage的划分

stage的划分

stage的划分是Spark作业调度的关键一步,它基于DAG确定依赖关系,借此来划分stage,将依赖链断开,每个stage内部可以并行运行,整个作业按照stage顺序依次执行,最终完成整个Job。实际应用提交的Job中RDD依赖关系是十分复杂的,依据这些依赖关系来划分stage自然是十分困难的,Spark此时就利用了前文提到的依赖关系,调度器从DAG图末端出发,逆向遍历整个依赖关系链,遇到ShuffleDependency(宽依赖关系的一种叫法)就断开,遇到NarrowDependency就将其加入到当前stage。stage中task数目由stage末端的RDD分区个数来决定,RDD转换是基于分区的一种粗粒度计算,一个stage执行的结果就是这几个分区构成的RDD。

回到刚才DagSchedler的handleJobSubmitted。因为rdd是倒序遍历的,所以首先生成一个名为finalStage的ResultStage。

 var finalStage: ResultStage = null
try {
// New stage creation may throw an exception if, for example, jobs are run on a
// HadoopRDD whose underlying HDFS files have been deleted.
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
} catch {
case e: Exception =>
logWarning("Creating new stage failed due to exception - job: " + jobId, e)
listener.jobFailed(e)
return
}

stage的划分的关键代码

/**
* Returns shuffle dependencies that are immediate parents of the given RDD.
*
* This function will not return more distant ancestors. For example, if C has a shuffle
* dependency on B which has a shuffle dependency on A:
*
* A <-- B <-- C
*
* calling this function with rdd C will only return the B <-- C dependency.
*
* This function is scheduler-visible for the purpose of unit testing.
*/
private[scheduler] def getShuffleDependencies(
rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
val parents = new HashSet[ShuffleDependency[_, _, _]]
val visited = new HashSet[RDD[_]]
val waitingForVisit = new Stack[RDD[_]]
waitingForVisit.push(rdd)
while (waitingForVisit.nonEmpty) {
val toVisit = waitingForVisit.pop()
if (!visited(toVisit)) {
visited += toVisit
toVisit.dependencies.foreach {
case shuffleDep: ShuffleDependency[_, _, _] =>
parents += shuffleDep //如果是宽依赖
case dependency =>
waitingForVisit.push(dependency.rdd)
}
}
}
parents
}

如果是宽依赖,直接把当前RDD加入parent并返回。这个parent即为每个stage的边界点。这里并没有得到每个stage的依赖。真正获取每个stage的依赖是在submitStage。

对于任何的job都会产生出一个finalStage来产生和提交task。其次对于某些简单的job,它没有依赖关系,并且只有一个partition,这样的job会使用local thread处理而并非提交到TaskScheduler上处理。

接下来产生finalStage后,需要调用submitStage(),它根据stage之间的依赖关系得出stage DAG,并以依赖关系进行处理:

/** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
logDebug("submitStage(" + stage + ")")
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}

对于新提交的job,finalStage的parent stage还未获得,因此submitStage会调用getMissingParentStages()来获得依赖关系:

private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// We are manually maintaining a stack here to prevent StackOverflowError
// caused by recursively visiting
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
dep match {
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
missing.toList
}

这里parent stage是通过RDD的依赖关系递归遍历获得。对于Wide Dependecy也就是Shuffle Dependecy,Spark会产生新的mapStage作为finalStage的parent,而对于Narrow Dependecy Spark则不会产生新的stage。这里对stage的划分是按照上面提到的作为划分依据的,因此对于本段开头提到的两种job,第一种job只会产生一个finalStage,而第二种job会产生finalStagemapStage

当stage DAG产生以后,针对每个stage需要产生task去执行,故在这会调用submitMissingTasks()

 /** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage, jobId: Int) {
logDebug("submitMissingTasks(" + stage + ")")
// Get our pending tasks and remember them in our pendingTasks entry
stage.pendingPartitions.clear() // First figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() // Use the scheduling pool, job group, description, etc. from an ActiveJob associated
// with this Stage
val properties = jobIdToActiveJob(jobId).properties runningStages += stage
// SparkListenerStageSubmitted should be posted before testing whether tasks are
// serializable. If tasks are not serializable, a SparkListenerStageCompleted event
// will be posted, which should always come after a corresponding SparkListenerStageSubmitted
// event.
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
case NonFatal(e) =>
stage.makeNewStageAttempt(partitionsToCompute.size)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
} stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.
// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast
// the serialized copy of the RDD and for each task we will deserialize it, which means each
// task gets a different copy of the RDD. This provides stronger isolation between tasks that
// might modify state of objects referenced in their closures. This is necessary in Hadoop
// where the JobConf/Configuration object is not thread-safe.
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
} taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
// In the case of a failure during serialization, abort the stage.
case e: NotSerializableException =>
abortStage(stage, "Task not serializable: " + e.toString, Some(e))
runningStages -= stage // Abort execution
return
case NonFatal(e) =>
abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
} val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
} case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
case NonFatal(e) =>
abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))
runningStages -= stage
return
} if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
// Because we posted SparkListenerStageSubmitted earlier, we should mark
// the stage as completed here in case there are no tasks to run
markStageAsFinished(stage, None) val debugString = stage match {
case stage: ShuffleMapStage =>
s"Stage ${stage} is actually done; " +
s"(available: ${stage.isAvailable}," +
s"available outputs: ${stage.numAvailableOutputs}," +
s"partitions: ${stage.numPartitions})"
case stage : ResultStage =>
s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"
}
logDebug(debugString) submitWaitingChildStages(stage)
}
}

首先根据stage所依赖的RDD的partition的分布,会产生出与partition数量相等的task,这些task根据partition的locality进行分布;其次对于finalStage或是mapStage会产生不同的task;最后所有的task会封装到TaskSet内提交到TaskScheduler去执行。

至此job在DAGScheduler内的启动过程全部完成,交由TaskScheduler执行task,当task执行完后会将结果返回给DAGSchedulerDAGScheduler调用handleTaskComplete()处理task返回:

private def handleTaskCompletion(event: CompletionEvent) {
val task = event.task
val stage = idToStage(task.stageId)
def markStageAsFinished(stage: Stage) = {
val serviceTime = stage.submissionTime match {
case Some(t) => "%.03f".format((System.currentTimeMillis() - t) / 1000.0)
case _ => "Unkown"
}
logInfo("%s (%s) finished in %s s".format(stage, stage.origin, serviceTime))
running -= stage
}
event.reason match {
case Success =>
...
task match {
case rt: ResultTask[_, _] =>
...
case smt: ShuffleMapTask =>
...
}
case Resubmitted =>
...
case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>
...
case other =>
abortStage(idToStage(task.stageId), task + " failed: " + other)
}
}

每个执行完成的task都会将结果返回给DAGSchedulerDAGScheduler根据返回结果来进行进一步的动作。

dagScheduler的更多相关文章

  1. Spark核心作业调度和任务调度之DAGScheduler源码

    前言:本文是我学习Spark 源码与内部原理用,同时也希望能给新手一些帮助,入道不深,如有遗漏或错误的,请在原文评论或者发送至我的邮箱 tongzhenguotongzhenguo@gmail.com ...

  2. Spark源码学习1.1——DAGScheduler.scala

    本文以Spark1.1.0版本为基础. 经过前一段时间的学习,基本上能够对Spark的工作流程有一个了解,但是具体的细节还是需要阅读源码,而且后续的科研过程中也肯定要修改源码的,所以最近开始Spark ...

  3. spark1.1.0源码阅读-dagscheduler and stage

    1. rdd action ->sparkContext.runJob->dagscheduler.runJob def runJob[T, U: ClassTag]( rdd: RDD[ ...

  4. Spark Scheduler模块源码分析之DAGScheduler

    本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...

  5. Spark源码剖析 - SparkContext的初始化(六)_创建和启动DAGScheduler

    6.创建和启动DAGScheduler DAGScheduler主要用于在任务正式交给TaskSchedulerImpl提交之前做一些准备工作,包括:创建Job,将DAG中的RDD划分到不同的Stag ...

  6. Spark Stage切分 源码剖析——DAGScheduler

    Spark中的任务管理是很重要的内容,可以说想要理解Spark的计算流程,就必须对它的任务的切分有一定的了解.不然你就看不懂Spark UI,看不懂Spark UI就无法去做优化...因此本篇就从源码 ...

  7. Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedul

    在写Spark程序是遇到问题 Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.orgapacheapachesparksch ...

  8. Spark分析之DAGScheduler

    DAGScheduler概述:是一个面向Stage层面的调度器: 主要入参有: dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, ...

  9. DagScheduler 和 TaskScheduler

    DagScheduler 和 TaskScheduler 的任务交接 spark 调度器分为两个部分, 一个是 DagScheduler, 一个是 TaskScheduler, DagSchedule ...

随机推荐

  1. 学习Unity -- 理解依赖注入(IOC)三种方式依赖注入

    IOC:英文全称:Inversion of Control,中文名称:控制反转,它还有个名字叫依赖注入(Dependency Injection).作用:将各层的对象以松耦合的方式组织在一起,解耦,各 ...

  2. Java面向对象 第3节 类的封装和继承

      一.封装 封装的概念:将类的某些信息隐藏在类内部,不允许外部程序直接访问,而是通过该类提供的方法来实现对隐藏信息的访问和操作. 封装的2个大致原则:1)把尽可能多的东西隐藏起来,对外提供便捷的接口 ...

  3. 第二章 JavaScript案例(中)

    1. js事件 HTML代码 <!DOCTYPE html> <html lang="en" onUnload="ud()"> < ...

  4. linux下openldap 的安装与配置自己总结版

    ---恢复内容开始--- 前段时间公司需要安装openldap 于是去网上查找相关资料,安装文档倒是不少但是或多或少都有点问题 导致自己一直没有安装上,于是结合英文安装文档磕磕巴巴的 安装少了 于是将 ...

  5. (转)解决OSX上面PHP curl SSLRead()

    原创 2016年05月19日 19:39:04 标签: php / curl / osx 830 这个问题的原因是因为OSX curl默认使用 SecureTransport 而不是OpenSSL. ...

  6. ajax请求完成执行的操作

    var createAjax = $("#createId").ajax(function(){ //ajax操作 }); $.when(createAjax).done(func ...

  7. 【maven】之nexus常用的一些配置

    nexus私服主要是在项目和maven中央仓库中间做代理,一般在公司内网或者公司内部的一些私包,都需要这么个产品.下面主要是关于maven和nexus之间的一些配置 1.在pom中配置nexus私服 ...

  8. Ubuntu 16.04 安装Mysql数据库

    系统环境 Ubuntu 16.04; 安装步骤 1.通过以下环境安装mysql服务端与客户端软件 sudo apt-get install mysql-server apt-get isntall m ...

  9. Program type already present: android.support.v4.widget.EdgeEffectCompat

    1.确保所有依赖包的 implementation 'com.android.support:appcompat-v7:25.4.0'是一样的 2.确保最外层的build.gradle中增加如下代码: ...

  10. 《女神异闻录 5》的 UI 设计

    转自:https://www.zhihu.com/question/50995871?sort=created <女神异闻录5>是近两年最为火热的JRPG游戏之一,它的出色不仅在于剧情暗讽 ...