DAGScheduler的架构其实非常简单,

1. eventQueue, 所有需要DAGScheduler处理的事情都需要往eventQueue中发送event

2. eventLoop Thread, 会不断的从eventQueue中获取event并处理

3. 实现TaskSchedulerListener, 并注册到TaskScheduler中, 这样TaskScheduler可以随时调用TaskSchedulerListener中的接口报告状况变更
TaskSchedulerListener的实现其实也就是post各种event到eventQueue

/**
* The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
* stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
* minimal schedule to run the job. It then submits stages as TaskSets to an underlying
* TaskScheduler implementation that runs them on the cluster.
*
* In addition to coming up with a DAG of stages, this class also determines the preferred
* locations to run each task on, based on the current cache status, and passes these to the
* low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
* lost, in which case old stages may need to be resubmitted. Failures *within* a stage that are
* not caused by shuffie file loss are handled by the TaskScheduler, which will retry each task
* a small number of times before cancelling the whole stage.
*
* THREADING: This class runs all its logic in a single thread executing the run() method, to which
* events are submitted using a synchonized queue (eventQueue). The public API methods, such as
* runJob, taskEnded and executorLost, post events asynchronously to this queue. All other methods
* should be private.
*/
private[spark]
class DAGScheduler(
taskSched: TaskScheduler, // 绑定的TaskScheduler
mapOutputTracker: MapOutputTracker,
blockManagerMaster: BlockManagerMaster,
env: SparkEnv)
extends TaskSchedulerListener with Logging { def this(taskSched: TaskScheduler) {
this(taskSched, SparkEnv.get.mapOutputTracker, SparkEnv.get.blockManager.master, SparkEnv.get)
}
// task需要将task执行的状况报告给DAGScheduler,所以需要把DAGScheduler作为listener加到TaskScheduler中
taskSched.setListener(this)
  // 并且实现各种TaskSchedulerListener的接口, 以便于TaskScheduler在状态发生变化时调用
// Called by TaskScheduler to report task's starting.
override def taskStarted(task: Task[_], taskInfo: TaskInfo) {
eventQueue.put(BeginEvent(task, taskInfo))
}
  //……省略其他的接口实现
 
  private val eventQueue = new LinkedBlockingQueue[DAGSchedulerEvent] // DAGScheduler的核心event queue

  val nextJobId = new AtomicInteger(0)
val nextStageId = new AtomicInteger(0)
val stageIdToStage = new TimeStampedHashMap[Int, Stage]
val shuffleToMapStage = new TimeStampedHashMap[Int, Stage]
private[spark] val stageToInfos = new TimeStampedHashMap[Stage, StageInfo] private val listenerBus = new SparkListenerBus() //DAGScheduler本身也提供SparkListenerBus, 便于其他模块listen DAGScheduler // Contains the locations that each RDD's partitions are cached on
private val cacheLocs = new HashMap[Int, Array[Seq[TaskLocation]]]
 
  // Start a thread to run the DAGScheduler event loop
def start() {
new Thread("DAGScheduler") { // 创建event处理线程
setDaemon(true)
override def run() {
DAGScheduler.this.run()
}
}.start()
}
 
  /**
* The main event loop of the DAG scheduler, which waits for new-job / task-finished / failure
* events and responds by launching tasks. This runs in a dedicated thread and receives events
* via the eventQueue.
*/
private def run() {
SparkEnv.set(env) while (true) {
val event = eventQueue.poll(POLL_TIMEOUT, TimeUnit.MILLISECONDS)
if (event != null) {
logDebug("Got event of type " + event.getClass.getName)
}
this.synchronized { // needed in case other threads makes calls into methods of this class
if (event != null) {
if (processEvent(event)) {
return
}
} val time = System.currentTimeMillis() // TODO: use a pluggable clock for testability
// Periodically resubmit failed stages if some map output fetches have failed and we have
// waited at least RESUBMIT_TIMEOUT. We wait for this short time because when a node fails,
// tasks on many other nodes are bound to get a fetch failure, and they won't all get it at
// the same time, so we want to make sure we've identified all the reduce tasks that depend
// on the failed node.
if (failed.size > 0 && time > lastFetchFailureTime + RESUBMIT_TIMEOUT) {
resubmitFailedStages()
} else {
submitWaitingStages()
}
}
}
}
 
  /**
* Process one event retrieved from the event queue.
* Returns true if we should stop the event loop.
*/
private[scheduler] def processEvent(event: DAGSchedulerEvent): Boolean = {
event match {
case JobSubmitted(finalRDD, func, partitions, allowLocal, callSite, listener, properties) =>
val jobId = nextJobId.getAndIncrement() // 获取新的jobId, nextJobId是AtomicInteger
val finalStage = newStage(finalRDD, None, jobId, Some(callSite)) // 用finalRDD创建finalStage,前面是否有其他的stage或RDD需要根据deps推断
val job = new ActiveJob(jobId, finalStage, func, partitions, callSite, listener, properties) // 用finalStage创建Job
clearCacheLocs()
if (allowLocal && finalStage.parents.size == 0 && partitions.length == 1) {
// Compute very short actions like first() or take() with no parent stages locally.
runLocally(job) // 对于简单的Job, 直接locally执行
} else {
listenerBus.post(SparkListenerJobStart(job, properties))
idToActiveJob(jobId) = job
activeJobs += job
resultStageToJob(finalStage) = job
submitStage(finalStage)
}
      // 对于各种event的处理, 这里只看JobSubmitted, 其他的先省略
}

 

1. dagScheduler.runJob

继续前面, 在SparkContext中调用runJob的结果就是调用dagScheduler.runJob

而dagScheduler.runJob的工作, 就是把toSubmit event放到eventQueue中去, 并且wait这个Job结束, 很简单

而PrepareJob的工作就是创建JobWaiter和JobSubmitted对象

  def runJob[T, U: ClassManifest](
finalRdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: String,
allowLocal: Boolean,
resultHandler: (Int, U) => Unit,
properties: Properties = null)
{
if (partitions.size == 0) {
return
} val (toSubmit: JobSubmitted, waiter: JobWaiter[_]) = prepareJob(
finalRdd, func, partitions, callSite, allowLocal, resultHandler, properties)
eventQueue.put(toSubmit)
waiter.awaitResult() match {
case JobSucceeded => {}
case JobFailed(exception: Exception, _) =>
logInfo("Failed to run " + callSite)
throw exception
}
}

 

1.1 JobWaiter

JobWaiter比较简单, 首先实现JobListener的taskSucceeded和jobFailed函数, 当DAGScheduler收到tasksuccess或fail的event就会调用相应的函数

在tasksuccess会判断当所有task都success时, 就表示jobFinished

而awaitResult, 就是一直等待jobFinished被置位

private[spark] class JobWaiter[T](totalTasks: Int, resultHandler: (Int, T) => Unit)
extends JobListener {
override def taskSucceeded(index: Int, result: Any) {
synchronized {
if (jobFinished) {
throw new UnsupportedOperationException("taskSucceeded() called on a finished JobWaiter")
}
resultHandler(index, result.asInstanceOf[T]) // 使用resultHandler处理task result
finishedTasks += 1
if (finishedTasks == totalTasks) {
jobFinished = true
jobResult = JobSucceeded
this.notifyAll()
}
}
} override def jobFailed(exception: Exception) {……} def awaitResult(): JobResult = synchronized {
while (!jobFinished) {
this.wait()
}
return jobResult
}
}

 

1.2 JobSubmitted

JobSubmitted只是DAGSchedulerEvent的一种, 典型的pattern matching的场景

可以看到除了JobSubmitted还其他很多的DAGSchedulerEvent

private[spark] sealed trait DAGSchedulerEvent

private[spark] case class JobSubmitted(
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
allowLocal: Boolean,
callSite: String,
listener: JobListener,
properties: Properties = null)
extends DAGSchedulerEvent private[spark] case class BeginEvent(task: Task[_], taskInfo: TaskInfo) extends DAGSchedulerEvent private[spark] case class CompletionEvent(
task: Task[_],
reason: TaskEndReason,
result: Any,
accumUpdates: Map[Long, Any],
taskInfo: TaskInfo,
taskMetrics: TaskMetrics)
extends DAGSchedulerEvent private[spark] case class ExecutorGained(execId: String, host: String) extends DAGSchedulerEvent private[spark] case class ExecutorLost(execId: String) extends DAGSchedulerEvent private[spark] case class TaskSetFailed(taskSet: TaskSet, reason: String) extends DAGSchedulerEvent private[spark] case object StopDAGScheduler extends DAGSchedulerEvent

 

2 processEvent.JobSubmitted

JobSubmit, 首先创建final stage, 然后submit final stage

stage相关操作参考, Spark 源码分析 -- Stage

2.1 submitStage

在submitStage, 首先会产生Stage的DAG, 然后按照先后顺序去提交每个stage的tasks

  /** Submits stage, but first recursively submits any missing parents. */
private def submitStage(stage: Stage) {
logDebug("submitStage(" + stage + ")")
if (!waiting(stage) && !running(stage) && !failed(stage)) {
val missing = getMissingParentStages(stage).sortBy(_.id) // 根据final stage发现是否有parent stage
logDebug("missing: " + missing)
if (missing == Nil) {
logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
submitMissingTasks(stage) // 如果没有parent stage需要执行, 则直接submit当前stage
running += stage
} else {
for (parent <- missing) {
submitStage(parent) // 如果有parent stage,需要先submit parent, 因为stage之间需要顺序执行
}
waiting += stage // 当前stage放到waiting列表中
}
}
}

 

2.2 submitMissingTasks

task相关参考 Spark 源码分析 -- Task

可见无论是哪种stage, 都是对于每个stage中的每个partitions创建task

并最终封装成TaskSet, 将该stage提交给taskscheduler

/** Called when stage's parents are available and we can now do its task. */
private def submitMissingTasks(stage: Stage) {
// Get our pending tasks and remember them in our pendingTasks entry
var tasks = ArrayBuffer[Task[_]]()
if (stage.isShuffleMap) { // 对于ShuffleMap Stage
for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) {
val locs = getPreferredLocs(stage.rdd, p)
tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs)
}
} else { // 对于Result Stage
// This is a final stage; figure out its job's missing partitions
val job = resultStageToJob(stage)
for (id <- 0 until job.numPartitions if !job.finished(id)) {
val partition = job.partitions(id)
val locs = getPreferredLocs(stage.rdd, partition)
tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id)
}
} taskSched.submitTasks(
new TaskSet(tasks.toArray, stage.id, stage.newAttemptId(), stage.jobId, properties))
if (!stage.submissionTime.isDefined) {
stage.submissionTime = Some(System.currentTimeMillis())
}
} else {
logDebug("Stage " + stage + " is actually done; %b %d %d".format(
stage.isAvailable, stage.numAvailableOutputs, stage.numPartitions))
running -= stage
}
}

Spark源码分析 – DAGScheduler的更多相关文章

  1. Spark源码分析 – 汇总索引

    http://jerryshao.me/categories.html#architecture-ref http://blog.csdn.net/pelick/article/details/172 ...

  2. Spark源码分析:多种部署方式之间的区别与联系(转)

    原文链接:Spark源码分析:多种部署方式之间的区别与联系(1) 从官方的文档我们可以知道,Spark的部署方式有很多种:local.Standalone.Mesos.YARN.....不同部署方式的 ...

  3. Spark 源码分析 -- task实际执行过程

    Spark源码分析 – SparkContext 中的例子, 只分析到sc.runJob 那么最终是怎么执行的? 通过DAGScheduler切分成Stage, 封装成taskset, 提交给Task ...

  4. Spark源码分析 – Shuffle

    参考详细探究Spark的shuffle实现, 写的很清楚, 当前设计的来龙去脉 Hadoop Hadoop的思路是, 在mapper端每次当memory buffer中的数据快满的时候, 先将memo ...

  5. Spark源码分析 -- TaskScheduler

    Spark在设计上将DAGScheduler和TaskScheduler完全解耦合, 所以在资源管理和task调度上可以有更多的方案 现在支持, LocalSheduler, ClusterSched ...

  6. Spark源码分析 – SparkContext

    Spark源码分析之-scheduler模块 这位写的非常好, 让我对Spark的源码分析, 变的轻松了许多 这里自己再梳理一遍 先看一个简单的spark操作, val sc = new SparkC ...

  7. Spark源码分析之七:Task运行(一)

    在Task调度相关的两篇文章<Spark源码分析之五:Task调度(一)>与<Spark源码分析之六:Task调度(二)>中,我们大致了解了Task调度相关的主要逻辑,并且在T ...

  8. Spark源码分析之六:Task调度(二)

    话说在<Spark源码分析之五:Task调度(一)>一文中,我们对Task调度分析到了DriverEndpoint的makeOffers()方法.这个方法针对接收到的ReviveOffer ...

  9. Spark源码分析之五:Task调度(一)

    在前四篇博文中,我们分析了Job提交运行总流程的第一阶段Stage划分与提交,它又被细化为三个分阶段: 1.Job的调度模型与运行反馈: 2.Stage划分: 3.Stage提交:对应TaskSet的 ...

随机推荐

  1. Spring Cloud心跳监测

    Spring Cloud实现心跳监测,在服务注册和停止时,注册中心能得到通知,并更新服务实例列表 Spring Cloud注册中心添加配置: eureka.server.enable-self-pre ...

  2. 常用vim编辑器命令行

    按ESC键 跳到命令模式,然后: :w 保存文件但不退出vi:w file 将修改另外保存到file中,不退出vi:w! 强制保存,不推出vi:wq 保存文件并退出vi:wq! 强制保存文件,并退出v ...

  3. Pgsql特殊排序

    对字段值为A,B,C,D的时候进行特殊排序. CASE WHEN aa = 'H' THEN ' WHENaa = 'O' THEN ' ELSE aa END 对数字进行排序,升序,0排到最后面 C ...

  4. shell向python传参数

    想要在shell中调用python脚本时实现: python pyServer.py argu1 argu2 argu3 利用 sys.argv 即可读取到 相应参数: # coding=utf-8 ...

  5. c经典算法

    1. 河内之塔 说明 河内之塔(Towers of Hanoi)是法国人M.Claus(Lucas)于1883年从泰国带至法国的,河内为越战时 北越的首都,即现在的胡志明市:1883年法国数学家 Ed ...

  6. 在J2EE中属于Web层的组件有(选择1项)

    在J2EE中属于Web层的组件有(选择1项) A. HTML B. EJB C. Applet D. JSP 解答:D

  7. 一般来说,主方法main()结束的时候线程结束

    suspend()是将一个运行时状态进入阻塞状态(注意不释放锁标记).恢复状态的时候用resume().Stop()指释放全部. 这几个方法上都有Deprecated标志,说明这个方法不推荐使用. 一 ...

  8. List接口的实现类与ArrayList相似,区别是Vector是重量级的组件,使用使消耗的资源比较多

    List接口的实现类(Vector)(与ArrayList相似,区别是Vector是重量级的组件,使用使消耗的资源比较多.) 结论:在考虑并发的情况下用Vector(保证线程的安全). 在不考虑并发的 ...

  9. Yii2.0实现微信公众号后台开发

    接入微信 Yii2后台配置 1.在app/config/params.php中配置token参数 return [ //微信接入 'wechat' =>[ 'token' => 'your ...

  10. LPCTSTR —— 摘自百度百科

    LPCTSTR用来表示字符是否使用UNICODE. 如果程序定义了UNICODE或者其他相关的宏,那么这个字符或者字符串将被作为UNICODE字符串,否则就是标准的ANSI字符串. 类型理解:L,表示 ...