版权声明:本文为原创文章,未经允许不得转载。

Spark程序程序job的运行是通过actions算子触发的,每一个action算子其实是一个runJob方法的运行,详见文章

SparkContex源码解读(一)http://www.cnblogs.com/yourarebest/p/5326678.html

1.Spark中Job的提交

以一个简单的runjob为例,源码如下:

def runJobT, U: Unit = {
val start = System.nanoTime
//通过dagScheduler运行job,即将JobSubmitted事件添加到DAGScheduler中的事件执行队列中,并用JobWaiter等待结果的返回
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)详见(1)
waiter.awaitResult() match {
case JobSucceeded =>
logInfo("Job %d finished: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
case JobFailed(exception: Exception) =>
logInfo("Job %d failed: %s, took %f s".format
(waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))
// SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.
val callerStackTrace = Thread.currentThread().getStackTrace.tail
exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)
throw exception
}
}

1.submitJob(rdd, func, partitions, callSite, resultHandler, properties)方法如下:

def submitJobT, U: JobWaiter[U] = {
// Check to make sure we are not launching a task on a partition that does not exist.
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
//如果job正在运行0个task,那么马上返回
return new JobWaiterU
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
//将JobSubmitted事件添加到eventProcessLoop中执行,详见(2)
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}

2.将JobSubmitted事件添加到eventProcessLoop中执行 eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
其中,
(1)JobSubmitted一种DAGScheduler可以处理的事件类型,它的trait DAGSchedulerEvent的一个实现。DAGSchedulerEvent的case子类如下图所示:

DAGScheduler处理的事件类型
(2)eventProcessLoop的类型是DAGSchedulerEventProcessLoop,它是抽象类EventLoop的子类,该类的源码如下:

private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
extends EventLoopDAGSchedulerEvent with Logging {
override def onReceive(event: DAGSchedulerEvent): Unit = {
val timerContext = timer.time()
try {
doOnReceive(event)
} finally {
timerContext.stop()
}
}
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
//对于JobSubmitted,通过 dagScheduler.handleJobSubmitted方法处理
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
case StageCancelled(stageId) =>
dagScheduler.handleStageCancellation(stageId)
case JobCancelled(jobId) =>
dagScheduler.handleJobCancellation(jobId)
case JobGroupCancelled(groupId) =>
dagScheduler.handleJobGroupCancelled(groupId)
case AllJobsCancelled =>
dagScheduler.doCancelAllJobs()
case ExecutorAdded(execId, host) =>
dagScheduler.handleExecutorAdded(execId, host)
case ExecutorLost(execId) =>
dagScheduler.handleExecutorLost(execId, fetchFailed = false)
case BeginEvent(task, taskInfo) =>
dagScheduler.handleBeginEvent(task, taskInfo)
case GettingResultEvent(taskInfo) =>
dagScheduler.handleGetTaskResult(taskInfo)
case completion @ CompletionEvent(task, reason, , , taskInfo, taskMetrics) =>
dagScheduler.handleTaskCompletion(completion)
case TaskSetFailed(taskSet, reason, exception) =>
dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
case ResubmitFailedStages =>
dagScheduler.resubmitFailedStages()
}

3.对于JobSubmitted事件类型,通过 dagScheduler的handleJobSubmitted方法处理,这个方法中关系涉及到Job的Stage、TaskSet(Tasks)的生成,

private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => ,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
try {
(1)//根据jobId生成finalStage,我们在后面具体介绍
(2)Job的提交
//初始化ActiveJob
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
//清除RDD的位置信息
clearCacheLocs()
logInfo("Got job %s (%s) with %d output partitions".format(
job.jobId, callSite.shortForm, partitions.length))
logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")
logInfo("Parents of final stage: " + finalStage.parents)
logInfo("Missing parents: " + getMissingParentStages(finalStage))
val jobSubmissionTime = clock.getTimeMillis()
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.resultOfJob = Some(job)
val stageIds = jobIdToStageIds(jobId).toArray
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(
.latestInfo))
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
...
(3)提交stages,但首先循环提交丢失的父Stage(s),即将丢失的stage加入到waitingStages中
...
...
(4)提交Taskset(tasks)
...
}

由代码(2)处我们可以看到SparkListenerJobStart事件加入到了监听器总线LiveListenerBus中,它的父类SparkListenerBus中定义了具体事件及监听器的映射关系,如下所示:

private[spark] trait SparkListenerBus extends ListenerBus[SparkListener, SparkListenerEvent] {
override def onPostEvent(listener: SparkListener, event: SparkListenerEvent): Unit = {
event match {
case stageSubmitted: SparkListenerStageSubmitted =>
listener.onStageSubmitted(stageSubmitted)
case stageCompleted: SparkListenerStageCompleted =>
listener.onStageCompleted(stageCompleted)
//Job的启动
case jobStart: SparkListenerJobStart =>
listener.onJobStart(jobStart)
case jobEnd: SparkListenerJobEnd =>
listener.onJobEnd(jobEnd)
case taskStart: SparkListenerTaskStart =>
listener.onTaskStart(taskStart)
case taskGettingResult: SparkListenerTaskGettingResult =>
listener.onTaskGettingResult(taskGettingResult)
case taskEnd: SparkListenerTaskEnd =>
listener.onTaskEnd(taskEnd)
case environmentUpdate: SparkListenerEnvironmentUpdate =>
listener.onEnvironmentUpdate(environmentUpdate)
case blockManagerAdded: SparkListenerBlockManagerAdded =>
listener.onBlockManagerAdded(blockManagerAdded)
case blockManagerRemoved: SparkListenerBlockManagerRemoved =>
listener.onBlockManagerRemoved(blockManagerRemoved)
case unpersistRDD: SparkListenerUnpersistRDD =>
listener.onUnpersistRDD(unpersistRDD)
case applicationStart: SparkListenerApplicationStart =>
listener.onApplicationStart(applicationStart)
case applicationEnd: SparkListenerApplicationEnd =>
listener.onApplicationEnd(applicationEnd)
case metricsUpdate: SparkListenerExecutorMetricsUpdate =>
listener.onExecutorMetricsUpdate(metricsUpdate)
case executorAdded: SparkListenerExecutorAdded =>
listener.onExecutorAdded(executorAdded)
case executorRemoved: SparkListenerExecutorRemoved =>
listener.onExecutorRemoved(executorRemoved)
case blockUpdated: SparkListenerBlockUpdated =>
listener.onBlockUpdated(blockUpdated)
case logStart: SparkListenerLogStart => // ignore event log metadata
}
}
}

4.SparkListenerJobStart 事件最后是由JobProgressListener监听器的onJobStart方法执行的,如下所示:

override def onJobStart(jobStart: SparkListenerJobStart): Unit = synchronized {
val jobGroup = for (
props <- Option(jobStart.properties);
group <- Option(props.getProperty(SparkContext.SPARK_JOB_GROUP_ID))//得到属性的值"spark.jobGroup.id"
) yield group
val jobData: JobUIData =
new JobUIData(
jobId = jobStart.jobId,
submissionTime = Option(jobStart.time).filter(_ >= 0),
stageIds = jobStart.stageIds,
jobGroup = jobGroup,
status = JobExecutionStatus.RUNNING)
// A null jobGroupId is used for jobs that are run without a job group
jobGroupToJobIds.getOrElseUpdate(jobGroup.orNull, new HashSet[JobId]).add(jobStart.jobId)
jobStart.stageInfos.foreach(x => pendingStages(x.stageId) = x)
//计算将要运行这个job的的tasks数量,这可能是一个低估因为job start event 引用所有的result stages's的依赖
jobData.numTasks = {
val allStages = jobStart.stageInfos
//过滤掉已经完成的或取消的Stage
val missingStages = allStages.filter(.completionTime.isEmpty)
missingStages.map(
.numTasks).sum
}
//存放jobid以及相关的jobData
jobIdToData(jobStart.jobId) = jobData
//激活的、将要执行的Jobs
activeJobs(jobStart.jobId) = jobData
// 遍历stageIds,更新stageId为key,ActiveJobIds为value的stageIdToActiveJobIds集合
for (stageId <- jobStart.stageIds) {
stageIdToActiveJobIds.getOrElseUpdate(stageId, new HashSet[StageId]).add(jobStart.jobId)
}
//遍历stageInfos
for (stageInfo <- jobStart.stageInfos) {
stageIdToInfo.getOrElseUpdate(stageInfo.stageId, stageInfo)
stageIdToData.getOrElseUpdate((stageInfo.stageId, stageInfo.attemptId), new StageUIData)
}
}

这样我们就启动了Job,WebUI就可以看到该Job的信息了。

【原】Spark中Job的提交源码解读的更多相关文章

  1. 【原】 Spark中Task的提交源码解读

    版权声明:本文为原创文章,未经允许不得转载. 复习内容: Spark中Stage的提交 http://www.cnblogs.com/yourarebest/p/5356769.html Spark中 ...

  2. 【原】Spark中Stage的提交源码解读

    版权声明:本文为原创文章,未经允许不得转载. 复习内容: Spark中Job如何划分为Stage http://www.cnblogs.com/yourarebest/p/5342424.html 1 ...

  3. HttpServlet中service方法的源码解读

    前言     最近在看<Head First Servlet & JSP>这本书, 对servlet有了更加深入的理解.今天就来写一篇博客,谈一谈Servlet中一个重要的方法-- ...

  4. sklearn中LinearRegression使用及源码解读

    sklearn中的LinearRegression 函数原型:class sklearn.linear_model.LinearRegression(fit_intercept=True,normal ...

  5. 【原】Spark不同运行模式下资源分配源码解读

    版权声明:本文为原创文章,未经允许不得转载. 复习内容: Spark中Task的提交源码解读 http://www.cnblogs.com/yourarebest/p/5423906.html Sch ...

  6. 15、Spark Streaming源码解读之No Receivers彻底思考

    在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Appr ...

  7. Spark技术内幕:Stage划分及提交源码分析

    http://blog.csdn.net/anzhsoft/article/details/39859463 当触发一个RDD的action后,以count为例,调用关系如下: org.apache. ...

  8. Spark学习之路 (十六)SparkCore的源码解读(二)spark-submit提交脚本

    一.概述 上一篇主要是介绍了spark启动的一些脚本,这篇主要分析一下Spark源码中提交任务脚本的处理逻辑,从spark-submit一步步深入进去看看任务提交的整体流程,首先看一下整体的流程概要图 ...

  9. Apache Spark源码走读之23 -- Spark MLLib中拟牛顿法L-BFGS的源码实现

    欢迎转载,转载请注明出处,徽沪一郎. 概要 本文就拟牛顿法L-BFGS的由来做一个简要的回顾,然后就其在spark mllib中的实现进行源码走读. 拟牛顿法 数学原理 代码实现 L-BFGS算法中使 ...

随机推荐

  1. android通过泛型获取控件或视图

    @SuppressWarnings("unchecked") public <T extends Fragment> T getFragment(int id) { T ...

  2. css text-overflow溢出文本显示省略号

    <div style="width: 100px; overflow: hidden; text-overflow:ellipsis"> <nobr>当对象 ...

  3. 解决Twitter Bootstrap Tab URL链接问题

    例如这样的一个Tabs 代码: <ul class="nav nav-tabs" id="myTab"> <li class="ac ...

  4. Django初体验

    为什么使用Django 快速开发 使用python 数据库ORM系统 大量内置应用 后台管理系统 admin 用户认证系统 auth 会话系统 sessions 安全性高 表单验证 SQL注入 跨站点 ...

  5. Git权威指南 读笔(1)

    第四章 Git初始化: 设置Git当前用户和邮件地址: $ git config --global user.name $ git config --global user.email 设置Git命令 ...

  6. codevs 1540 银河英雄传说

    题目描述 Description 公元五八○一年,地球居民迁移至金牛座α第二行星,在那里发表银河联邦创立宣言,同年改元为宇宙历元年,并开始向银河系深处拓展. 宇宙历七九九年,银河系的两大军事集团在巴米 ...

  7. 史上最全github使用方法:github入门到精通--备用

    [初识Github] 首先让我们大家一起喊一句“Hello Github”.YEAH!就是这样. Git是一个分布式的版本控制系统,最初由Linus Torvalds编写,用作Linux内核代码的管理 ...

  8. java.util.ArrayList

    /* * Copyright (c) 1997, 2013, Oracle and/or its affiliates. All rights reserved. * ORACLE PROPRIETA ...

  9. BZOJ 3993 [SDOI 2015] 星际战争 解题报告

    首先我们可以二分答案. 假设当前二分出来的答案是 $Ans$ ,那么我们考虑用网络流检验: 设武器为 $X$,第 $i$ 个武器的攻击力为 $B_i$: 设机器人为 $Y$,第 $i$ 个机器人的装甲 ...

  10. loadrunner_Controller技巧_overlay

    在scenario运行期间,我们经常有类似于:总结Vu数变化,Tps 或者response time变化的趋势或者对比response time 和 tps,那么我们就用的到 Controller的图 ...