spark 笔记 8: Stage

Stage 是一组独立的任务，他们在一个job中执行相同的功能（function），功能的划分是以shuffle为边界的。DAG调度器以拓扑顺序执行同一个Stage中的task。

/**
 * A stage is a set of independent tasks all computing the same function that need to run as part
 * of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run
 * by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
 * DAGScheduler runs these stages in topological order.
 *
 * Each Stage can either be a shuffle map stage, in which case its tasks' results are input for
 * another stage, or a result stage, in which case its tasks directly compute the action that
 * initiated a job (e.g. count(), save(), etc). For shuffle map stages, we also track the nodes
 * that each output partition is on.
 *
 * Each Stage also has a jobId, identifying the job that first submitted the stage.  When FIFO
 * scheduling is used, this allows Stages from earlier jobs to be computed first or recovered
 * faster on failure.
 *
 * The callSite provides a location in user code which relates to the stage. For a shuffle map
 * stage, the callSite gives the user code that created the RDD being shuffled. For a result
 * stage, the callSite gives the user code that executes the associated action (e.g. count()).
 *
 * A single stage can consist of multiple attempts. In that case, the latestInfo field will
 * be updated for each attempt.
 *
 */
private[spark] class Stage(
    val id: Int,
    val rdd: RDD[_],
    val numTasks: Int,
    val shuffleDep: Option[ShuffleDependency[_, _, _]],  // Output shuffle if stage is a map stage
    val parents: List[Stage],
    val jobId: Int,
    val callSite: CallSite)
  extends Logging {

重要属性：

val isShuffleMap = shuffleDep.isDefined
val numPartitions = rdd.partitions.size
val outputLocs = Array.fill[List[MapStatus]](numPartitions)(Nil)
var numAvailableOutputs = 0

/** Set of jobs that this stage belongs to. */
val jobIds = new HashSet[Int]

/** For stages that are the final (consists of only ResultTasks), link to the ActiveJob. */
var resultOfJob: Option[ActiveJob] = None
var pendingTasks = new HashSet[Task[_]]

def addOutputLoc(partition: Int, status: MapStatus) {

/**
 * Result returned by a ShuffleMapTask to a scheduler. Includes the block manager address that the
 * task ran on as well as the sizes of outputs for each reducer, for passing on to the reduce tasks.
 * The map output sizes are compressed using MapOutputTracker.compressSize.
 */
private[spark] class MapStatus(var location: BlockManagerId, var compressedSizes: Array[Byte])

来自为知笔记(Wiz)

spark 笔记 8: Stage的更多相关文章

spark笔记环境配置
spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...
Spark 资源调度包 stage 类解析
spark 资源调度包 Stage(阶段) 类解析 Stage 概念 Spark 任务会根据 RDD 之间的依赖关系, 形成一个DAG有向无环图, DAG会被提交给DAGScheduler, DAGS ...
spark 笔记 15: ShuffleManager，shuffle map两端的stage/task的桥梁
无论是Hadoop还是spark,shuffle操作都是决定其性能的重要因素.在不能减少shuffle的情况下,使用一个好的shuffle管理器也是优化性能的重要手段. ShuffleManager的 ...
spark 笔记 13: 再看DAGScheduler，stage状态更新流程
当某个task完成后,某个shuffle Stage X可能已完成,那么就可能会一些仅依赖Stage X的Stage现在可以执行了,所以要有响应task完成的状态更新流程. ============= ...
大数据学习——spark笔记
变量的定义 val a: Int = 1 var b = 2 方法和函数区别:函数可以作为参数传递给方法方法: def test(arg: Int): Int=>Int ={ 方法体 } v ...
spark 笔记 16： BlockManager
先看一下原理性的文章:http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/ ,http://jerrys ...
spark 笔记 9: Task/TaskContext
DAGScheduler最终创建了task set,并提交给了taskScheduler.那先得看看task是怎么定义和执行的. Task是execution执行的一个单元. Task: execut ...
spark 笔记 7: DAGScheduler
在前面的sparkContex和RDD都可以看到,真正的计算工作都是同过调用DAGScheduler的runjob方法来实现的.这是一个很重要的类.在看这个类实现之前,需要对actor模式有一点了解: ...
spark 笔记 2： Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...

随机推荐

O003、准备 KVM 实验环境
参考https://www.cnblogs.com/CloudMan6/p/5240770.html KVM 是 OpenStack 使用的最广泛的Hypervisor,本节介绍如何搭建 KVM ...
Qt表格导出图片
概述:qt中把某个控件导出保存为图片导出并不复杂,网上也有一堆方法.但是对于tableview中数据很多的情况下势必会出现滚动条,用传统的截屏抓图势会有滚动条,图片数据展示不全.在这我使用了一种折中方 ...
分布式的几件小事（六）dubbo如何做服务治理、服务降级以及重试
1.服务治理服务治理主要作用是改变运行时服务的行为和选址逻辑,达到限流,权重配置等目的. ①调用链路自动生成一个大型的分布式系统,会由大量的服务组成,那么这些服务之间的依赖关系和调用链路会很复杂, ...
golang利用beego框架orm操作mysql
GO引入orm框架操作mysql 在beego框架中引入orm操作mysql需要进行的步骤: 第一步:导入orm框架依赖,导入mysql数据库的驱动依赖 import ( "github.c ...
11 Python之初识函数
---恢复内容开始--- 1. 什么是函数? f(x) = x + 1 y = x + 1 函数是对功能或者动作的封装 2. 函数的语法和定义 def 函数名(): 函数体调用: 函数名() 3. ...
Android官方网站！
Android官方网站,所有Android相关文档.官方工具.示例,全部都在上面!! http://www.android.com/
多线程编程-- part 9 信号量：Semaphore
Semaphore简介 Semaphore是一个计数信号量,它的本质是一个"共享锁". 信号量维护了一个信号量许可集.线程可以通过调用acquire()来获取信号量的许可:当信号量 ...
工具安装——linux下安装JDK1.8
1.查看Linux环境自带JDK 使用命令:# rpm -qa|grep gcj 显示内容其中包含相应信息# java-x.x.x-gcj-compat-x.x.x.x-xxjpp# java-x.x ...
求二叉搜索树的第k小的节点
题目描述: /** * 给定一棵二叉搜索树,请找出其中的第k小的结点. * 例如, (5,3,7,2,4,6,8)中, * 按结点数值大小顺序第三小结点的值为4. * 这是层序遍历: * 5 * 3 ...
【2017 北京集训 String 改编版】子串
题意你有一个字符串,你需要支持两种操作: 1:在字符串的末尾插入一个字符 \(c\) 2:询问当前字符串的 \([l,r]\) 子串中的不同子串个数为了加大难度,操作会被加密(强制在线). \(n ...

spark 笔记 8: Stage

spark 笔记 8: Stage的更多相关文章

随机推荐

热门专题