由一个action动作触发sparkcontext的runjob,再由此触发dagScheduler.runJob,然后触发submitJob，封装一个JobSubmitted放入一个队列。然后再通过doOnReceive里面的dagScheduler.handleJobSubmitted提交。

1:由action动作触发工作的提交。

2:sparkcontext提交job。

3:调用DagScheduler提交job。

4:调用DagScheduler的submitJob。

5:生成一个JobSubmit对象，通过DAGSchedulerEventProcessLoop的post把JobSubmit加入到队列。

6:DAGSchedulerEventProcessLoop执行doOnReceive,调用handleJobSubmitted。

stage　　

通过handleJobSubmitted将会划分stage。

首先看下stage的源码

private[scheduler] abstract class Stage(

    val id: Int,

    val rdd: RDD[_],

    val numTasks: Int,

    val parents: List[Stage],

    val firstJobId: Int,

    val callSite: CallSite)

stage有两个子类，分别是ResultStage和ShfflemapStage。

通过对比源码发现

ResultStage 多了一个 val func: (TaskContext, Iterator[_]) => _, 保存action对应的处理函数

ShfflemapStage多了一个 val shuffleDep: ShuffleDependency[_, _, _]) 保存Dependency信息

stage的划分

stage的划分

stage的划分是Spark作业调度的关键一步，它基于DAG确定依赖关系，借此来划分stage，将依赖链断开，每个stage内部可以并行运行，整个作业按照stage顺序依次执行，最终完成整个Job。实际应用提交的Job中RDD依赖关系是十分复杂的，依据这些依赖关系来划分stage自然是十分困难的，Spark此时就利用了前文提到的依赖关系，调度器从DAG图末端出发，逆向遍历整个依赖关系链，遇到ShuffleDependency（宽依赖关系的一种叫法）就断开，遇到NarrowDependency就将其加入到当前stage。stage中task数目由stage末端的RDD分区个数来决定，RDD转换是基于分区的一种粗粒度计算，一个stage执行的结果就是这几个分区构成的RDD。

回到刚才DagSchedler的handleJobSubmitted。因为rdd是倒序遍历的，所以首先生成一个名为finalStage的ResultStage。

 var finalStage: ResultStage = null

    try {

      // New stage creation may throw an exception if, for example, jobs are run on a

      // HadoopRDD whose underlying HDFS files have been deleted.

      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)

    } catch {

      case e: Exception =>

        logWarning("Creating new stage failed due to exception - job: " + jobId, e)

        listener.jobFailed(e)

        return

    }

stage的划分的关键代码

/**

   * Returns shuffle dependencies that are immediate parents of the given RDD.

   *

   * This function will not return more distant ancestors.  For example, if C has a shuffle

   * dependency on B which has a shuffle dependency on A:

   *

   * A <-- B <-- C

   *

   * calling this function with rdd C will only return the B <-- C dependency.

   *

   * This function is scheduler-visible for the purpose of unit testing.

   */

  private[scheduler] def getShuffleDependencies(

      rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {

    val parents = new HashSet[ShuffleDependency[_, _, _]]

    val visited = new HashSet[RDD[_]]

    val waitingForVisit = new Stack[RDD[_]]

    waitingForVisit.push(rdd)

    while (waitingForVisit.nonEmpty) {

      val toVisit = waitingForVisit.pop()

      if (!visited(toVisit)) {

        visited += toVisit

        toVisit.dependencies.foreach {

          case shuffleDep: ShuffleDependency[_, _, _] =>

            parents += shuffleDep  //如果是宽依赖

          case dependency =>

            waitingForVisit.push(dependency.rdd)

        }

      }

    }

    parents

  }

如果是宽依赖，直接把当前RDD加入parent并返回。这个parent即为每个stage的边界点。这里并没有得到每个stage的依赖。真正获取每个stage的依赖是在submitStage。

对于任何的job都会产生出一个finalStage来产生和提交task。其次对于某些简单的job，它没有依赖关系，并且只有一个partition，这样的job会使用local thread处理而并非提交到TaskScheduler上处理。

接下来产生finalStage后，需要调用submitStage()，它根据stage之间的依赖关系得出stage DAG，并以依赖关系进行处理：

/** Submits stage, but first recursively submits any missing parents. */

  private def submitStage(stage: Stage) {

    val jobId = activeJobForStage(stage)

    if (jobId.isDefined) {

      logDebug("submitStage(" + stage + ")")

      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {

        val missing = getMissingParentStages(stage).sortBy(_.id)

        logDebug("missing: " + missing)

        if (missing.isEmpty) {

          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")

          submitMissingTasks(stage, jobId.get)

        } else {

          for (parent <- missing) {

            submitStage(parent)

          }

          waitingStages += stage

        }

      }

    } else {

      abortStage(stage, "No active job for stage " + stage.id, None)

    }

  }

对于新提交的job，finalStage的parent stage还未获得，因此submitStage会调用getMissingParentStages()来获得依赖关系：

private def getMissingParentStages(stage: Stage): List[Stage] = {

    val missing = new HashSet[Stage]

    val visited = new HashSet[RDD[_]]

    // We are manually maintaining a stack here to prevent StackOverflowError

    // caused by recursively visiting

    val waitingForVisit = new Stack[RDD[_]]

    def visit(rdd: RDD[_]) {

      if (!visited(rdd)) {

        visited += rdd

        val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)

        if (rddHasUncachedPartitions) {

          for (dep <- rdd.dependencies) {

            dep match {

              case shufDep: ShuffleDependency[_, _, _] =>

                val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)

                if (!mapStage.isAvailable) {

                  missing += mapStage

                }

              case narrowDep: NarrowDependency[_] =>

                waitingForVisit.push(narrowDep.rdd)

            }

          }

        }

      }

    }

    waitingForVisit.push(stage.rdd)

    while (waitingForVisit.nonEmpty) {

      visit(waitingForVisit.pop())

    }

    missing.toList

  }

这里parent stage是通过RDD的依赖关系递归遍历获得。对于Wide Dependecy也就是Shuffle Dependecy，Spark会产生新的mapStage作为finalStage的parent，而对于Narrow Dependecy Spark则不会产生新的stage。这里对stage的划分是按照上面提到的作为划分依据的，因此对于本段开头提到的两种job，第一种job只会产生一个finalStage，而第二种job会产生finalStage和mapStage。

当stage DAG产生以后，针对每个stage需要产生task去执行，故在这会调用submitMissingTasks()：

 /** Called when stage's parents are available and we can now do its task. */

  private def submitMissingTasks(stage: Stage, jobId: Int) {

    logDebug("submitMissingTasks(" + stage + ")")

    // Get our pending tasks and remember them in our pendingTasks entry

    stage.pendingPartitions.clear()

    // First figure out the indexes of partition ids to compute.

    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated

    // with this Stage

    val properties = jobIdToActiveJob(jobId).properties

    runningStages += stage

    // SparkListenerStageSubmitted should be posted before testing whether tasks are

    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event

    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted

    // event.

    stage match {

      case s: ShuffleMapStage =>

        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)

      case s: ResultStage =>

        outputCommitCoordinator.stageStart(

          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)

    }

    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {

      stage match {

        case s: ShuffleMapStage =>

          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap

        case s: ResultStage =>

          partitionsToCompute.map { id =>

            val p = s.partitions(id)

            (id, getPreferredLocs(stage.rdd, p))

          }.toMap

      }

    } catch {

      case NonFatal(e) =>

        stage.makeNewStageAttempt(partitionsToCompute.size)

        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))

        runningStages -= stage

        return

    }

    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)

    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))

    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.

    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast

    // the serialized copy of the RDD and for each task we will deserialize it, which means each

    // task gets a different copy of the RDD. This provides stronger isolation between tasks that

    // might modify state of objects referenced in their closures. This is necessary in Hadoop

    // where the JobConf/Configuration object is not thread-safe.

    var taskBinary: Broadcast[Array[Byte]] = null

    try {

      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).

      // For ResultTask, serialize and broadcast (rdd, func).

      val taskBinaryBytes: Array[Byte] = stage match {

        case stage: ShuffleMapStage =>

          JavaUtils.bufferToArray(

            closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))

        case stage: ResultStage =>

          JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))

      }

      taskBinary = sc.broadcast(taskBinaryBytes)

    } catch {

      // In the case of a failure during serialization, abort the stage.

      case e: NotSerializableException =>

        abortStage(stage, "Task not serializable: " + e.toString, Some(e))

        runningStages -= stage

        // Abort execution

        return

      case NonFatal(e) =>

        abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))

        runningStages -= stage

        return

    }

    val tasks: Seq[Task[_]] = try {

      stage match {

        case stage: ShuffleMapStage =>

          partitionsToCompute.map { id =>

            val locs = taskIdToLocations(id)

            val part = stage.rdd.partitions(id)

            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,

              taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),

              Option(sc.applicationId), sc.applicationAttemptId)

          }

        case stage: ResultStage =>

          partitionsToCompute.map { id =>

            val p: Int = stage.partitions(id)

            val part = stage.rdd.partitions(p)

            val locs = taskIdToLocations(id)

            new ResultTask(stage.id, stage.latestInfo.attemptId,

              taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,

              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)

          }

      }

    } catch {

      case NonFatal(e) =>

        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))

        runningStages -= stage

        return

    }

    if (tasks.size > 0) {

      logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")

      stage.pendingPartitions ++= tasks.map(_.partitionId)

      logDebug("New pending partitions: " + stage.pendingPartitions)

      taskScheduler.submitTasks(new TaskSet(

        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))

      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())

    } else {

      // Because we posted SparkListenerStageSubmitted earlier, we should mark

      // the stage as completed here in case there are no tasks to run

      markStageAsFinished(stage, None)

      val debugString = stage match {

        case stage: ShuffleMapStage =>

          s"Stage ${stage} is actually done; " +

            s"(available: ${stage.isAvailable}," +

            s"available outputs: ${stage.numAvailableOutputs}," +

            s"partitions: ${stage.numPartitions})"

        case stage : ResultStage =>

          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"

      }

      logDebug(debugString)

      submitWaitingChildStages(stage)

    }

  }

首先根据stage所依赖的RDD的partition的分布，会产生出与partition数量相等的task，这些task根据partition的locality进行分布；其次对于finalStage或是mapStage会产生不同的task；最后所有的task会封装到TaskSet内提交到TaskScheduler去执行。

至此job在DAGScheduler内的启动过程全部完成，交由TaskScheduler执行task，当task执行完后会将结果返回给DAGScheduler，DAGScheduler调用handleTaskComplete()处理task返回:

private def handleTaskCompletion(event: CompletionEvent) {

  val task = event.task

  val stage = idToStage(task.stageId)

  def markStageAsFinished(stage: Stage) = {

    val serviceTime = stage.submissionTime match {

      case Some(t) => "%.03f".format((System.currentTimeMillis() - t) / 1000.0)

      case _ => "Unkown"

    }

    logInfo("%s (%s) finished in %s s".format(stage, stage.origin, serviceTime))

    running -= stage

  }

  event.reason match {

    case Success =>

        ...

      task match {

        case rt: ResultTask[_, _] =>

          ...

        case smt: ShuffleMapTask =>

          ...

      }

    case Resubmitted =>

      ...

    case FetchFailed(bmAddress, shuffleId, mapId, reduceId) =>

      ...

    case other =>

      abortStage(idToStage(task.stageId), task + " failed: " + other)

  }

}

每个执行完成的task都会将结果返回给DAGScheduler，DAGScheduler根据返回结果来进行进一步的动作。

dagScheduler的更多相关文章

Spark核心作业调度和任务调度之DAGScheduler源码
前言:本文是我学习Spark 源码与内部原理用,同时也希望能给新手一些帮助,入道不深,如有遗漏或错误的,请在原文评论或者发送至我的邮箱 tongzhenguotongzhenguo@gmail.com ...
Spark源码学习1.1——DAGScheduler.scala
本文以Spark1.1.0版本为基础. 经过前一段时间的学习,基本上能够对Spark的工作流程有一个了解,但是具体的细节还是需要阅读源码,而且后续的科研过程中也肯定要修改源码的,所以最近开始Spark ...
spark1.1.0源码阅读-dagscheduler and stage
1. rdd action ->sparkContext.runJob->dagscheduler.runJob def runJob[T, U: ClassTag]( rdd: RDD[ ...
Spark Scheduler模块源码分析之DAGScheduler
本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...
Spark源码剖析 - SparkContext的初始化(六)_创建和启动DAGScheduler
6.创建和启动DAGScheduler DAGScheduler主要用于在任务正式交给TaskSchedulerImpl提交之前做一些准备工作,包括:创建Job,将DAG中的RDD划分到不同的Stag ...
Spark Stage切分源码剖析——DAGScheduler
Spark中的任务管理是很重要的内容,可以说想要理解Spark的计算流程,就必须对它的任务的切分有一定的了解.不然你就看不懂Spark UI,看不懂Spark UI就无法去做优化...因此本篇就从源码 ...
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedul
在写Spark程序是遇到问题 Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.orgapacheapachesparksch ...
Spark分析之DAGScheduler
DAGScheduler概述:是一个面向Stage层面的调度器: 主要入参有: dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, ...
DagScheduler 和 TaskScheduler
DagScheduler 和 TaskScheduler 的任务交接 spark 调度器分为两个部分, 一个是 DagScheduler, 一个是 TaskScheduler, DagSchedule ...

随机推荐

PHP输出缓存ob系列函数详解
PHP输出缓存ob系列函数详解 ob,输出缓冲区,是output buffering的简称,而不是output cache.ob用对了,是能对速度有一定的帮助,但是盲目的加上ob函数,只会增加CPU额 ...
Mat的详解
[转]OpenCV中Mat的详解每次碰到Mat都得反复查具体的用法,网上的基础讲解不多,难得看到一篇,赶快转来收藏~ 原文地址:http://www.opencvchina.com/thread-1 ...
创建java类并实例化类对象
创建java类并实例化类对象例一1.面向对象的编程关注于类的设计2.设计类实际上就是设计类的成员3.基本的类的成员,属性(成员变量)&方法面向对象思想的落地法则一:1.设计类,并设计类的成员 ...
memcached-redis
http://www.runoob.com/memcached/memcached-cas.html https://github.com/memcached/memcached/blob/maste ...
集合总结二(LinkedList的实现原理)
一.概述先来看看源码中的这一段注释,我们先尝试从中提取一些信息: Doubly-linked list implementation of the List and Deque interfaces ...
C语言编程知识点
(1)预处理指令#define 声明一个常数,用以表明1年中有多少秒(忽略闰年问题):#define SECONDS_PER_YEAR (60 * 60 * 24 * 365)UL 1) #defin ...
自己遇到的ajax调用ashx文件无法获取返回值的一种情况
无法获取返回值的ashx文件大致如下: public void ProcessRequest (HttpContext context) { context.Response.ContentType ...
石板地面 Base Shape
软件:Substance Designer 2017.1.2 石板地面 Base Shape 效果见图一图一:Base Shape (2D View) 首先使用Cells 1(Pattern)结点生 ...
总结：Java 集合进阶精讲1
知识点:Java 集合框架图总结:Java 集合进阶精讲1 总结:Java 集合进阶精讲2-ArrayList 集合进阶1---为集合指定初始容量集合在Java编程中使用非常广泛,当容器的量变得非 ...
delphi combobox屏蔽鼠标滑动
//第1种方法 procedure TForm1.FormMouseWheel(Sender: TObject; Shift: TShiftState; WheelDelta: Integer; Mo ...

dagScheduler

stage的划分

dagScheduler的更多相关文章

随机推荐

热门专题