DAGScheduler最终创建了task set,并提交给了taskScheduler。那先得看看task是怎么定义和执行的。
Task是execution执行的一个单元。

Task: executor执行的基本单元,也是spark操作的最小单位。和java executor的task基本上是相同含义的。
/**
* A unit of execution. We have two kinds of Task's in Spark:
* - [[org.apache.spark.scheduler.ShuffleMapTask]]
* - [[org.apache.spark.scheduler.ResultTask]]
*
* A Spark job consists of one or more stages. The very last stage in a job consists of multiple
* ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
* and sends the task output back to the driver application. A ShuffleMapTask executes the task
* and divides the task output to multiple buckets (based on the task's partitioner).
*
* @param stageId id of the stage this task belongs to
* @param partitionId index of the number in the RDD
*/
private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) extends Serializable {
主要属性:
final def run(attemptId: Long): T = {
context = new TaskContext(stageId, partitionId, attemptId, runningLocally = false)
context.taskMetrics.hostname = Utils.localHostName()
taskThread = Thread.currentThread()
if (_killed) {
kill(interruptThread = false)
}
runTask(context)
def runTask(context: TaskContext): T
// Map output tracker epoch. Will be set by TaskScheduler.
var epoch: Long = -1

var metrics: Option[TaskMetrics] = None

// Task context, to be initialized in run().
@transient protected var context: TaskContext = _
// The actual Thread on which the task is running, if any. Initialized in run().
@volatile @transient private var taskThread: Thread = _

/**
* Handles transmission of tasks and their dependencies, because this can be slightly tricky. We
* need to send the list of JARs and files added to the SparkContext with each task to ensure that
* worker nodes find out about it, but we can't make it part of the Task because the user's code in
* the task might depend on one of the JARs. Thus we serialize each task as multiple objects, by
* first writing out its dependencies.
*/
private[spark] object Task {
/**
* Serialize a task and the current app dependencies (files and JARs added to the SparkContext)
*/
def serializeWithDependencies(
/**
* Deserialize the list of dependencies in a task serialized with serializeWithDependencies,
* and return the task itself as a serialized ByteBuffer. The caller can then update its
* ClassLoaders and deserialize the task.
*
* @return (taskFiles, taskJars, taskBytes)
*/
def deserializeWithDependencies(serializedTask: ByteBuffer)
ShuffleMapTask: 它是对应于transformation操作的task,主要更能是解决提供action操作所需要的数据。依旧是它是被action依赖的task,需要提前执行。
/**
* A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a partitioner
* specified in the ShuffleDependency).
*
* See [[org.apache.spark.scheduler.Task]] for more information.
*
* @param stageId id of the stage this task belongs to
* @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,
* the type should be (RDD[_], ShuffleDependency[_, _, _]).
* @param partition partition of the RDD this task is associated with
* @param locs preferred task execution locations for locality scheduling
*/
private[spark] class ShuffleMapTask(
stageId: Int,
taskBinary: Broadcast[Array[Byte]],
partition: Partition,
@transient private var locs: Seq[TaskLocation])
extends Task[MapStatus](stageId, partition.index) with Logging {
override def runTask(context: TaskContext): MapStatus = {
// Deserialize the RDD using the broadcast variable.
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

metrics = Some(context.taskMetrics)
var writer: ShuffleWriter[Any, Any] = null
try {
val manager = SparkEnv.get.shuffleManager
writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
return writer.stop(success = true).get --HashShuffleWriter
} catch {
case e: Exception =>
if (writer != null) {
writer.stop(success = false)
}
throw e
} finally {
context.markTaskCom pleted()
}
}
ResultTask: 它是与action操作对应的,也就是依赖树的叶子节点上。
/**
* A task that sends back the output to the driver application.
*
* See [[Task]] for more information.
*
* @param stageId id of the stage this task belongs to
* @param taskBinary broadcasted version of the serialized RDD and the function to apply on each
* partition of the given RDD. Once deserialized, the type should be
* (RDD[T], (TaskContext, Iterator[T]) => U).
* @param partition partition of the RDD this task is associated with
* @param locs preferred task execution locations for locality scheduling
* @param outputId index of the task in this job (a job can launch tasks on only a subset of the
* input RDD's partitions).
*/
private[spark] class ResultTask[T, U](
stageId: Int,
taskBinary: Broadcast[Array[Byte]],
partition: Partition,
@transient locs: Seq[TaskLocation],
val outputId: Int)
extends Task[U](stageId, partition.index) with Serializable {
override def runTask(context: TaskContext): U = {
// Deserialize the RDD and the func using the broadcast variables.
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

metrics = Some(context.taskMetrics)
try {
func(context, rdd.iterator(partition, context))
} finally {
context.markTaskCompleted()
}
}







spark 笔记 9: Task/TaskContext的更多相关文章

  1. spark笔记 环境配置

    spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...

  2. Spark分区数、task数目、core数目、worker节点数目、executor数目梳理

    Spark分区数.task数目.core数目.worker节点数目.executor数目梳理 spark隐式创建由操作组成的逻辑上的有向无环图.驱动器执行时,它会把这个逻辑图转换为物理执行计划,然后将 ...

  3. spark 笔记 15: ShuffleManager,shuffle map两端的stage/task的桥梁

    无论是Hadoop还是spark,shuffle操作都是决定其性能的重要因素.在不能减少shuffle的情况下,使用一个好的shuffle管理器也是优化性能的重要手段. ShuffleManager的 ...

  4. spark 笔记 12: Executor,task最后的归宿

    spark的Executor是执行task的容器.和java的executor概念类似. ===================start executor runs task============ ...

  5. spark 笔记 7: DAGScheduler

    在前面的sparkContex和RDD都可以看到,真正的计算工作都是同过调用DAGScheduler的runjob方法来实现的.这是一个很重要的类.在看这个类实现之前,需要对actor模式有一点了解: ...

  6. spark 笔记 5: SparkContext,SparkConf

    SparkContext 是spark的程序入口,相当于熟悉的'main'函数.它负责链接spark集群.创建RDD.创建累加计数器.创建广播变量. ) scheduler.initialize(ba ...

  7. Spark笔记——技术点汇总

    目录 概况 手工搭建集群 引言 安装Scala 配置文件 启动与测试 应用部署 部署架构 应用程序部署 核心原理 RDD概念 RDD核心组成 RDD依赖关系 DAG图 RDD故障恢复机制 Standa ...

  8. Spark技术内幕: Task向Executor提交的源码解析

    在上文<Spark技术内幕:Stage划分及提交源码分析>中,我们分析了Stage的生成和提交.但是Stage的提交,只是DAGScheduler完成了对DAG的划分,生成了一个计算拓扑, ...

  9. 大数据学习——spark笔记

    变量的定义 val a: Int = 1 var b = 2 方法和函数 区别:函数可以作为参数传递给方法 方法: def test(arg: Int): Int=>Int ={ 方法体 } v ...

随机推荐

  1. groovy程序设计

    /********* * groovy中Object类型存在隐式转换 可以不必使用as强转 */ Object munber = 9.343444 def number1 = 2 println mu ...

  2. 第一篇 HTML 认识HTML

    认识HTML 学习一门语言,我们要先了解它,可以不用太资深,但要做到别人问,你能回答得出来! 注:推荐大家去网址:www.w3school.com.cn 前端学习手册(免费的) HTML(超文本标记语 ...

  3. dubbo学习笔记一(服务注册)

    相关的资料 官方文档 官方博客 项目结构 项目说明 [lesson1-config-api] 是一个接口工程,编译后是jar包,被其他工程依赖 [lesson1-config-2-properties ...

  4. 多线程编程-- part5.1 互斥锁之非公平锁-获取与释放

    非公平锁之获取锁 非公平锁和公平锁在获取锁的方法上,流程是一样的:它们的区别主要表现在“尝试获取锁的机制不同”.简单点说,“公平锁”在每次尝试获取锁时,都是采用公平策略(根据等待队列依次排序等待):而 ...

  5. 批处理 使用默认浏览器 打开html文件

    @echo offfor /f "tokens=3,4" %%a in ('"reg query HKEY_CLASSES_ROOT\http\shell\open\co ...

  6. centos 7 私有云盘 OwnCloud 安装搭建脚本

    #!/bin/bash #Build LAMP Server Conf mysql_secure_installation service mariadb restart systemctl enab ...

  7. 2019-2020-1 20199319《Linux内核原理与分析》第四周作业

    MenuOS的构造 基础知识 1.操作系统的两把宝剑:①中断上下文的切换:保存现场和恢复现场:②进程上下文的切换. 2.Linux内核以A.B.C.D方式命名:A和B变得无关紧要,C是内核的真实版本, ...

  8. Firefox 的User Agent 将移除 CPU 架构信息

    Mozilla 计划从 Firefox 的 User Agent(用户代理)和几个支持的 API 中移除 CPU 架构信息,以减少 Firefox 用户的“数字指纹”.Web 浏览器会自动向用户在应用 ...

  9. centos8 网卡命令(centos7也可用)

    nmcli n 查看nmcli状态 nmcli n on 启动nmcli nmcli c  up eth0 启动网卡eth0 nmcli c down eth0 关闭网卡eth0 nmcli d c ...

  10. Flow-based model

    文章1:  NICE: NON-LINEAR INDEPENDENT COMPONENTS ESTIMATION 文章2:Real-valued Non-Volume Preserving (Real ...