spark 笔记 9: Task/TaskContext

DAGScheduler最终创建了task set，并提交给了taskScheduler。那先得看看task是怎么定义和执行的。

Task是execution执行的一个单元。

Task： executor执行的基本单元，也是spark操作的最小单位。和java executor的task基本上是相同含义的。

/**
 * A unit of execution. We have two kinds of Task's in Spark:
 * - [[org.apache.spark.scheduler.ShuffleMapTask]]
 * - [[org.apache.spark.scheduler.ResultTask]]
 *
 * A Spark job consists of one or more stages. The very last stage in a job consists of multiple
 * ResultTasks, while earlier stages consist of ShuffleMapTasks. A ResultTask executes the task
 * and sends the task output back to the driver application. A ShuffleMapTask executes the task
 * and divides the task output to multiple buckets (based on the task's partitioner).
 *
 * @param stageId id of the stage this task belongs to
 * @param partitionId index of the number in the RDD
 */
private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) extends Serializable {

主要属性：

final def run(attemptId: Long): T = {
  context = new TaskContext(stageId, partitionId, attemptId, runningLocally = false)
  context.taskMetrics.hostname = Utils.localHostName()
  taskThread = Thread.currentThread()
  if (_killed) {
    kill(interruptThread = false)
  }
  runTask(context)

def runTask(context: TaskContext): T

// Map output tracker epoch. Will be set by TaskScheduler.
var epoch: Long = -1

var metrics: Option[TaskMetrics] = None

// Task context, to be initialized in run().
@transient protected var context: TaskContext = _

// The actual Thread on which the task is running, if any. Initialized in run().
@volatile @transient private var taskThread: Thread = _

/**
 * Handles transmission of tasks and their dependencies, because this can be slightly tricky. We
 * need to send the list of JARs and files added to the SparkContext with each task to ensure that
 * worker nodes find out about it, but we can't make it part of the Task because the user's code in
 * the task might depend on one of the JARs. Thus we serialize each task as multiple objects, by
 * first writing out its dependencies.
 */
private[spark] object Task {
  /**
   * Serialize a task and the current app dependencies (files and JARs added to the SparkContext)
   */
  def serializeWithDependencies(

/**
 * Deserialize the list of dependencies in a task serialized with serializeWithDependencies,
 * and return the task itself as a serialized ByteBuffer. The caller can then update its
 * ClassLoaders and deserialize the task.
 *
 * @return (taskFiles, taskJars, taskBytes)
 */
def deserializeWithDependencies(serializedTask: ByteBuffer)

ShuffleMapTask：它是对应于transformation操作的task，主要更能是解决提供action操作所需要的数据。依旧是它是被action依赖的task，需要提前执行。

/**
* A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a partitioner
* specified in the ShuffleDependency).
*
* See [[org.apache.spark.scheduler.Task]] for more information.
*
 * @param stageId id of the stage this task belongs to
 * @param taskBinary broadcast version of of the RDD and the ShuffleDependency. Once deserialized,
 *                   the type should be (RDD[_], ShuffleDependency[_, _, _]).
 * @param partition partition of the RDD this task is associated with
 * @param locs preferred task execution locations for locality scheduling
 */
private[spark] class ShuffleMapTask(
    stageId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient private var locs: Seq[TaskLocation])
  extends Task[MapStatus](stageId, partition.index) with Logging {

override def runTask(context: TaskContext): MapStatus = {
  // Deserialize the RDD using the broadcast variable.
  val ser = SparkEnv.get.closureSerializer.newInstance()
  val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

  metrics = Some(context.taskMetrics)
  var writer: ShuffleWriter[Any, Any] = null
  try {
    val manager = SparkEnv.get.shuffleManager
    writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
    writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
    return writer.stop(success = true).get    --HashShuffleWriter
  } catch {
    case e: Exception =>
      if (writer != null) {
        writer.stop(success = false)
      }
      throw e
  } finally {
    context.markTaskCom pleted()
  }
}

ResultTask: 它是与action操作对应的，也就是依赖树的叶子节点上。

/**
 * A task that sends back the output to the driver application.
 *
 * See [[Task]] for more information.
 *
 * @param stageId id of the stage this task belongs to
 * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each
 *                   partition of the given RDD. Once deserialized, the type should be
 *                   (RDD[T], (TaskContext, Iterator[T]) => U).
 * @param partition partition of the RDD this task is associated with
 * @param locs preferred task execution locations for locality scheduling
 * @param outputId index of the task in this job (a job can launch tasks on only a subset of the
 *                 input RDD's partitions).
 */
private[spark] class ResultTask[T, U](
    stageId: Int,
    taskBinary: Broadcast[Array[Byte]],
    partition: Partition,
    @transient locs: Seq[TaskLocation],
    val outputId: Int)
  extends Task[U](stageId, partition.index) with Serializable {

override def runTask(context: TaskContext): U = {
  // Deserialize the RDD and the func using the broadcast variables.
  val ser = SparkEnv.get.closureSerializer.newInstance()
  val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](
    ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)

  metrics = Some(context.taskMetrics)
  try {
    func(context, rdd.iterator(partition, context))
  } finally {
    context.markTaskCompleted()
  }
}

来自为知笔记(Wiz)

spark 笔记 9: Task/TaskContext的更多相关文章

spark笔记环境配置
spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...
Spark分区数、task数目、core数目、worker节点数目、executor数目梳理
Spark分区数.task数目.core数目.worker节点数目.executor数目梳理 spark隐式创建由操作组成的逻辑上的有向无环图.驱动器执行时,它会把这个逻辑图转换为物理执行计划,然后将 ...
spark 笔记 15: ShuffleManager，shuffle map两端的stage/task的桥梁
无论是Hadoop还是spark,shuffle操作都是决定其性能的重要因素.在不能减少shuffle的情况下,使用一个好的shuffle管理器也是优化性能的重要手段. ShuffleManager的 ...
spark 笔记 12: Executor，task最后的归宿
spark的Executor是执行task的容器.和java的executor概念类似. ===================start executor runs task============ ...
spark 笔记 7: DAGScheduler
在前面的sparkContex和RDD都可以看到,真正的计算工作都是同过调用DAGScheduler的runjob方法来实现的.这是一个很重要的类.在看这个类实现之前,需要对actor模式有一点了解: ...
spark 笔记 5: SparkContext，SparkConf
SparkContext 是spark的程序入口,相当于熟悉的'main'函数.它负责链接spark集群.创建RDD.创建累加计数器.创建广播变量. ) scheduler.initialize(ba ...
Spark笔记——技术点汇总
目录概况手工搭建集群引言安装Scala 配置文件启动与测试应用部署部署架构应用程序部署核心原理 RDD概念 RDD核心组成 RDD依赖关系 DAG图 RDD故障恢复机制 Standa ...
Spark技术内幕: Task向Executor提交的源码解析
在上文<Spark技术内幕:Stage划分及提交源码分析>中,我们分析了Stage的生成和提交.但是Stage的提交,只是DAGScheduler完成了对DAG的划分,生成了一个计算拓扑, ...
大数据学习——spark笔记
变量的定义 val a: Int = 1 var b = 2 方法和函数区别:函数可以作为参数传递给方法方法: def test(arg: Int): Int=>Int ={ 方法体 } v ...

随机推荐

git 常用命令语句（个人笔记）
切换账户 git config user.name xxxxx 查看用户名 ex: git config user.name tongjiaojiao git config user.e ...
LLVM 安装教程(包安装)
LLVM 安装教程环境:ubuntu16.04 llvm-4.0 clang-4.0 步骤: 1.依赖库安装 $ sudo apt-get install build-essential curl ...
Linux--Shell 编程-bash,命令替换，if分支嵌套,运算，输入输出
SHELL 编程 shell 是一个命令解释器,侦听用户指令.启动这些指令.将结果返回给用户(交互式的shell) shell 也是一种简单的程序设计语言.利用它可以编写一些系统脚本. ...
Nginx，LVS，HAProxy详解
Nginx/LVS/HAProxy负载均衡软件的优缺点详解 PS:Nginx/LVS/HAProxy是目前使用最广泛的三种负载均衡软件,本人都在多个项目中实施过,参考了一些资料,结合自己的一些使用经验 ...
linux ftp 添加用户及权限管理
Linux下创建用户是很easy的事情了,只不过不经常去做这些操作,时间久了就容易忘记,顺便配置一下FTP.声明:使用Linux版本release 5.6,并以超级管理员root身份运行. 1.创建用 ...
安装haroopad
安装haroopad 1)官网下载安装包 http://pad.haroopress.com/user.html 2)执行安装命令: sudo dpkg -i haroopad-v0.13.1-x64 ...
Jmeter Beanshell 编程简介
简介 Jmeter除了提供丰富的组件以外,还提供脚本支持,可通过编写脚本来丰富Jmeter,实现普通组件无法完成的功能.Beanshell是一种轻量级的Java脚本语言,完全符合Java规范,并且内置 ...
TP5 中的redis 队列
首先我们看一下自己的TP5的框架中的 TP5\vendor\topthink ,这个文件中有没有think-queue这个文件夹,如果没有请安装, 安装这个是要用到Composer的如果没有安装co ...
958. Check Completeness of a Binary Tree
题目来源题目来源 C++代码实现 /** * Definition for a binary tree node. * struct TreeNode { * int val; * TreeNode ...
.NET界面控件DevExpress全新发布v19.1.5|改进Office 2019主题
DevExpress Universal Subscription(又名DevExpress宇宙版或DXperience Universal Suite)是全球使用广泛的.NET用户界面控件套包,De ...

spark 笔记 9: Task/TaskContext

spark 笔记 9: Task/TaskContext的更多相关文章

随机推荐

热门专题