spark 笔记 8: Stage

/**
* A stage is a set of independent tasks all computing the same function that need to run as part
* of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run
* by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
* DAGScheduler runs these stages in topological order.
*
* Each Stage can either be a shuffle map stage, in which case its tasks' results are input for
* another stage, or a result stage, in which case its tasks directly compute the action that
* initiated a job (e.g. count(), save(), etc). For shuffle map stages, we also track the nodes
* that each output partition is on.
*
* Each Stage also has a jobId, identifying the job that first submitted the stage. When FIFO
* scheduling is used, this allows Stages from earlier jobs to be computed first or recovered
* faster on failure.
*
* The callSite provides a location in user code which relates to the stage. For a shuffle map
* stage, the callSite gives the user code that created the RDD being shuffled. For a result
* stage, the callSite gives the user code that executes the associated action (e.g. count()).
*
* A single stage can consist of multiple attempts. In that case, the latestInfo field will
* be updated for each attempt.
*
*/
private[spark] class Stage(
val id: Int,
val rdd: RDD[_],
val numTasks: Int,
val shuffleDep: Option[ShuffleDependency[_, _, _]], // Output shuffle if stage is a map stage
val parents: List[Stage],
val jobId: Int,
val callSite: CallSite)
extends Logging {
val isShuffleMap = shuffleDep.isDefined
val numPartitions = rdd.partitions.size
val outputLocs = Array.fill[List[MapStatus]](numPartitions)(Nil)
var numAvailableOutputs = 0
/** Set of jobs that this stage belongs to. */
val jobIds = new HashSet[Int]
/** For stages that are the final (consists of only ResultTasks), link to the ActiveJob. */
var resultOfJob: Option[ActiveJob] = None
var pendingTasks = new HashSet[Task[_]]
def addOutputLoc(partition: Int, status: MapStatus) {
/**
* Result returned by a ShuffleMapTask to a scheduler. Includes the block manager address that the
* task ran on as well as the sizes of outputs for each reducer, for passing on to the reduce tasks.
* The map output sizes are compressed using MapOutputTracker.compressSize.
*/
private[spark] class MapStatus(var location: BlockManagerId, var compressedSizes: Array[Byte])
spark 笔记 8: Stage的更多相关文章
- spark笔记 环境配置
spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...
- Spark 资源调度包 stage 类解析
spark 资源调度包 Stage(阶段) 类解析 Stage 概念 Spark 任务会根据 RDD 之间的依赖关系, 形成一个DAG有向无环图, DAG会被提交给DAGScheduler, DAGS ...
- spark 笔记 15: ShuffleManager,shuffle map两端的stage/task的桥梁
无论是Hadoop还是spark,shuffle操作都是决定其性能的重要因素.在不能减少shuffle的情况下,使用一个好的shuffle管理器也是优化性能的重要手段. ShuffleManager的 ...
- spark 笔记 13: 再看DAGScheduler,stage状态更新流程
当某个task完成后,某个shuffle Stage X可能已完成,那么就可能会一些仅依赖Stage X的Stage现在可以执行了,所以要有响应task完成的状态更新流程. ============= ...
- 大数据学习——spark笔记
变量的定义 val a: Int = 1 var b = 2 方法和函数 区别:函数可以作为参数传递给方法 方法: def test(arg: Int): Int=>Int ={ 方法体 } v ...
- spark 笔记 16: BlockManager
先看一下原理性的文章:http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/ ,http://jerrys ...
- spark 笔记 9: Task/TaskContext
DAGScheduler最终创建了task set,并提交给了taskScheduler.那先得看看task是怎么定义和执行的. Task是execution执行的一个单元. Task: execut ...
- spark 笔记 7: DAGScheduler
在前面的sparkContex和RDD都可以看到,真正的计算工作都是同过调用DAGScheduler的runjob方法来实现的.这是一个很重要的类.在看这个类实现之前,需要对actor模式有一点了解: ...
- spark 笔记 2: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解, ...
随机推荐
- Windows下安装Oracle 11g 2版 64位,从下载,安装,测试连接成功~!
首先进入oracle官网下载文件 点击进入 也可以选择结合PanDownload网页版使用百度链接下载 链接: https://pan.baidu.com/s/1UHJiaMXUrSG2IX793ng ...
- 64位Win7安装Oracle12C临时位置权限错误解决方案
今天装备安装Oracle12C体验一下,结果遇到问题:请确保当前用户具有访问临时位置所需的权限,无法继续安装,上网查了一下,解决方案如下: 第一步: 控制面板>所有控制面板项>管理工具 ...
- Get To Know Linux: The /etc/init.d Directory
If you use Linux you most likely have heard of the init.d directory. But what exactly does this dire ...
- CentOS7.6静默(无图形化界面)安装Oracle 11g
一.准备工作 1.准备CentOS 7 系统环境 由于是使用静默模式(silent)安装的,无需使用图形化界面,我选择了最小安装的服务器版的CentOS 7.安装完成后,只有命令行界面. 2.下载 O ...
- PAT Basic 1020 月饼 (25 分)
月饼是中国人在中秋佳节时吃的一种传统食品,不同地区有许多不同风味的月饼.现给定所有种类月饼的库存量.总售价.以及市场的最大需求量,请你计算可以获得的最大收益是多少. 注意:销售时允许取出一部分库存.样 ...
- 为了保护dll这么做吗?
生成dll时候 附带生成的lib文件
- 【CF765E】Tree Folding
题目大意:给定一棵 N 个节点的无根树,边权都是 1,可以把树上父亲相同的两条长度相同的链合并,问最后是否可以合并成一条链,如果可以,输出链的最小长度,否则输出 -1. 题解: 由于我们不知道最后的 ...
- Linux知识点(二)
1 df 查看磁盘空间使用情况 df: disk free 空余硬盘 1.基本语法 df 项 (功描能述:列出文件系统的整体磁盘使用量,检查文件系统的磁盘空间占用情况)选 2.选项说明 选项 功能 ...
- 拒绝被坑!如何用Python和数据分析鉴别刷单!?
发际线堪忧的小Q,为了守住头发最后的尊严,深入分析了几十款防脱洗发水的评价,最后综合选了一款他认为最完美的防脱洗发水. 一星期后,他没察觉到任何变化. 一个月后,他用卷尺量了量,发际线竟然后退了0.5 ...
- Windows 和 Linux 下生成以当前时间命名的文件
在 Windows.Linux 操作系统,分别利用BAT批处理文件和Shell脚本,生成类似“20110228_082905.txt”以“年月日_时分秒”命名的文件. Windows BAT批处理文件 ...