Spark里面的任务调度:离SparkContext开始
SparkContext这是发达国家Spark入学申请,它负责的相互作用和整个集群,它涉及到创建RDD。accumulators and broadcast variables。理解力Spark架构,我们需要从入口开始。下图是图的官方网站。
DriverProgram就是用户提交的程序,这里边定义了SparkContext的实例。
SparkContext定义在core/src/main/scala/org/apache/spark/SparkContext.scala。
Spark默认的构造函数接受org.apache.spark.SparkConf, 通过这个參数我们能够自己定义本次提交的參数,这个參数会覆盖系统的默认配置。
先上一张与SparkContext相关的类图:
以下是SparkContext很重要的数据成员的定义:
// Create and start the scheduler
private[spark] var taskScheduler = SparkContext.createTaskScheduler(this, master)
private val heartbeatReceiver = env.actorSystem.actorOf(
Props(new HeartbeatReceiver(taskScheduler)), "HeartbeatReceiver")
@volatile private[spark] var dagScheduler: DAGScheduler = _
try {
dagScheduler = new DAGScheduler(this)
} catch {
case e: Exception => throw
new SparkException("DAGScheduler cannot be initialized due to %s".format(e.getMessage))
} // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
taskScheduler.start()
通过createTaskScheduler,我们能够获得不同资源管理类型或者部署类型的调度器。
看一下如今支持的部署方法:
/** Creates a task scheduler based on a given master URL. Extracted for testing. */
private def createTaskScheduler(sc: SparkContext, master: String): TaskScheduler = {
// Regular expression used for local[N] and local[*] master formats
val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r
// Regular expression for local[N, maxRetries], used in tests with failing tasks
val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r
// Regular expression for simulating a Spark cluster of [N, cores, memory] locally
val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
// Regular expression for connecting to Spark deploy clusters
val SPARK_REGEX = """spark://(.*)""".r
// Regular expression for connection to Mesos cluster by mesos:// or zk:// url
val MESOS_REGEX = """(mesos|zk)://.*""".r
// Regular expression for connection to Simr cluster
val SIMR_REGEX = """simr://(.*)""".r // When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1 master match {
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(scheduler, 1)
scheduler.initialize(backend)
scheduler case LOCAL_N_REGEX(threads) =>
def localCpuCount = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(scheduler, threadCount)
scheduler.initialize(backend)
scheduler case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
def localCpuCount = Runtime.getRuntime.availableProcessors()
// local[*, M] means the number of cores on the computer with M failures
// local[N, M] means exactly N threads with M failures
val threadCount = if (threads == "*") localCpuCount else threads.toInt
val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
val backend = new LocalBackend(scheduler, threadCount)
scheduler.initialize(backend)
scheduler case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
scheduler case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
// Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
val memoryPerSlaveInt = memoryPerSlave.toInt
if (sc.executorMemory > memoryPerSlaveInt) {
throw new SparkException(
"Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
memoryPerSlaveInt, sc.executorMemory))
} val scheduler = new TaskSchedulerImpl(sc)
val localCluster = new LocalSparkCluster(
numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt)
val masterUrls = localCluster.start()
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) => {
localCluster.stop()
}
scheduler case "yarn-standalone" | "yarn-cluster" =>
if (master == "yarn-standalone") {
logWarning(
"\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
}
val scheduler = try {
val clazz = Class.forName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
} catch {
// TODO: Enumerate the exact reasons why it can fail
// But irrespective of it, it means we cannot proceed !
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
}
val backend = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
} catch {
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
}
scheduler.initialize(backend)
scheduler case "yarn-client" =>
val scheduler = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClientClusterScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl] } catch {
case e: Exception => {
throw new SparkException("YARN mode not available ? ", e)
}
} val backend = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
} catch {
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
} scheduler.initialize(backend)
scheduler case mesosUrl @ MESOS_REGEX(_) =>
MesosNativeLibrary.load()
val scheduler = new TaskSchedulerImpl(sc)
val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
val url = mesosUrl.stripPrefix("mesos://") // strip scheme from raw Mesos URLs
val backend = if (coarseGrained) {
new CoarseMesosSchedulerBackend(scheduler, sc, url)
} else {
new MesosSchedulerBackend(scheduler, sc, url)
}
scheduler.initialize(backend)
scheduler case SIMR_REGEX(simrUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
scheduler.initialize(backend)
scheduler case _ =>
throw new SparkException("Could not parse Master URL: '" + master + "'")
}
}
}
基本的逻辑从line 20開始。主要通过传入的Master URL来生成Scheduler 和 Scheduler backend。对于常见的Standalone的部署方式,我们看一下是生成的Scheduler 和 Scheduler backend:
case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
scheduler
org.apache.spark.scheduler.TaskSchedulerImpl通过一个SchedulerBackend管理了全部的cluster的调度;它主要实现了通用的逻辑。对于系统刚启动时,须要理解两个接口,一个是initialize,一个是start。
这个也是在SparkContext初始化时调用的:
  def initialize(backend: SchedulerBackend) {
    this.backend = backend
    // temporarily set rootPool name to empty
    rootPool = new Pool("", schedulingMode, 0, 0)
    schedulableBuilder = {
      schedulingMode match {
        case SchedulingMode.FIFO =>
          new FIFOSchedulableBuilder(rootPool)
        case SchedulingMode.FAIR =>
          new FairSchedulableBuilder(rootPool, conf)
      }
    }
    schedulableBuilder.buildPools()
  }
由此可见,初始化主要是SchedulerBackend的初始化。它主要时通过集群的配置来获得调度模式,如今支持的调度模式是FIFO和公平调度,默认的是FIFO。
// default scheduler is FIFO
private val schedulingModeConf = conf.get("spark.scheduler.mode", "FIFO")
val schedulingMode: SchedulingMode = try {
SchedulingMode.withName(schedulingModeConf.toUpperCase)
} catch {
case e: java.util.NoSuchElementException =>
throw new SparkException(s"Unrecognized spark.scheduler.mode: $schedulingModeConf")
}
start的实现例如以下:
  override def start() {
    backend.start()
    if (!isLocal && conf.getBoolean("spark.speculation", false)) {
      logInfo("Starting speculative execution thread")
      import sc.env.actorSystem.dispatcher
      sc.env.actorSystem.scheduler.schedule(SPECULATION_INTERVAL milliseconds,
            SPECULATION_INTERVAL milliseconds) {
        Utils.tryOrExit { checkSpeculatableTasks() }
      }
    }
  }
主要是backend的启动。对于非本地模式。而且设置了spark.speculation为true,那么对于指定时间未返回的task将会启动另外的task来运行。事实上对于一般的应用,这个的确可能会降低任务的运行时间,可是也浪费了集群的计算资源。
因此对于离线应用来说,这个设置是不推荐的。
org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend是Standalone模式的SchedulerBackend。它的定义例如以下:
private[spark] class SparkDeploySchedulerBackend(
scheduler: TaskSchedulerImpl,
sc: SparkContext,
masters: Array[String])
extends CoarseGrainedSchedulerBackend(scheduler, sc.env.actorSystem)
with AppClientListener
with Logging {
看一下它的start:
 override def start() {
    super.start()
    // The endpoint for executors to talk to us
    val driverUrl = "akka.tcp://%s@%s:%s/user/%s".format(
      SparkEnv.driverActorSystemName,
      conf.get("spark.driver.host"),
      conf.get("spark.driver.port"),
      CoarseGrainedSchedulerBackend.ACTOR_NAME)
    val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", "{{CORES}}", "{{WORKER_URL}}")
    val extraJavaOpts = sc.conf.getOption("spark.executor.extraJavaOptions")
      .map(Utils.splitCommandString).getOrElse(Seq.empty)
    val classPathEntries = sc.conf.getOption("spark.executor.extraClassPath").toSeq.flatMap { cp =>
      cp.split(java.io.File.pathSeparator)
    }
    val libraryPathEntries =
      sc.conf.getOption("spark.executor.extraLibraryPath").toSeq.flatMap { cp =>
        cp.split(java.io.File.pathSeparator)
      }
    // Start executors with a few necessary configs for registering with the scheduler
    val sparkJavaOpts = Utils.sparkJavaOpts(conf, SparkConf.isExecutorStartupConf)
    val javaOpts = sparkJavaOpts ++ extraJavaOpts
    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
      args, sc.executorEnvs, classPathEntries, libraryPathEntries, javaOpts)
    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
      sc.ui.appUIAddress, sc.eventLogger.map(_.logDir))
    client = new AppClient(sc.env.actorSystem, masters, appDesc, this, conf)
    client.start()
    waitForRegistration()
  }
接下来,我们将对TaskScheduler。SchedulerBackend和DAG Scheduler进行具体解释。来逐步揭开他们的神奇面纱。
版权声明:本文博主原创文章,博客,未经同意不得转载。
Spark里面的任务调度:离SparkContext开始的更多相关文章
- Spark 资源调度及任务调度
		1. 资源分配 通过SparkSubmit进行提交应用后,首先会创建Client将应用程序(字节码文件.class)包装成Driver,并将其注册到Master.Master收到Client的注册请 ... 
- 【Spark篇】---Spark资源调度和任务调度
		一.前述 Spark的资源调度是个很重要的模块,只要搞懂原理,才能具体明白Spark是怎么执行的,所以尤其重要. 自愿申请的话,本文分粗粒度和细粒度模式分别介绍. 二.具体 Spark资源调度流程图: ... 
- Spark资源调度及任务调度
		1. 资源分配 通过SparkSubmit进行提交应用后,首先会创建Client将应用程序(字节码文件.class)包装成Driver,并将其注册到Master.Master收到Client的注册请 ... 
- 【Spark-core学习之六】 Spark资源调度和任务调度
		环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ... 
- Spark 资源调度 与 任务调度
		Spark 资源调度与任务调度的流程(Standalone): 启动集群后, Worker 节点会向 Master 节点汇报资源情况, Master掌握了集群资源状况. 当 Spark 提交一个 Ap ... 
- 大话Spark(6)-源码之SparkContext原理剖析
		SparkContext是整个spark程序通往集群的唯一通道,他是程序的起点,也是程序的终点. 我们的每一个spark个程序都需要先创建SparkContext,接着调用SparkContext的方 ... 
- spark内核篇-任务调度机制
		在生产环境中,spark 部署方式一般都是 yarn-cluster 模式,本文针对该模式进行讲解,当然大体思路也适用于其他模式 基础概念 一个 spark 应用包含 job.stage.task 三 ... 
- Spark学习(三) -- SparkContext初始化
		标签(空格分隔): Spark 本篇博客以WordCount为例说明Spark Job的提交和运行,包括Spark Application初始化.DAG依赖性分析.任务的调度和派发.中间计算结果的存储 ... 
- 【Spark篇】---Spark中资源和任务调度源码分析与资源配置参数应用
		一.前述 Spark中资源调度是一个非常核心的模块,尤其对于我们提交参数来说,需要具体到某些配置,所以提交配置的参数于源码一一对应,掌握此节对于Spark在任务执行过程中的资源分配会更上一层楼.由于源 ... 
随机推荐
- Zookeeper 1、Zookeeper 定义与工作原理
			1.什么是Zookeeper » Zookeeper 是 Google 的 Chubby一个开源的实现,是 Hadoop 的分布式协调服务 » 它包含一个简单的原语集,分布式应用程序可以基于它实现同步 ... 
- C++ Primer 学习笔记_69_面向对象编程 --继承情况下的类作用域
			面向对象编程 --继承情况下的类作用域 引言: 在继承情况下,派生类的作用域嵌套在基类作用域中:假设不能在派生类作用域中确定名字,就在外围基类作用域中查找该名字的定义. 正是这样的类作用域的层次嵌套使 ... 
- [转载]Matlab中fft与fftshift命令的小结与分析
			http://blog.sina.com.cn/s/blog_68f3a4510100qvp1.html 注:转载请注明出处——by author. 我们知道Fourier分析是信号处理里很重要的技术 ... 
- 重写TextView,实现圆形背景,文本居中显示
			最近,在做考试试题排版,产品提出题号希望显示成圆形背景,序号文本居中显示. (有点问题:文本没有绝对居中,暂时没做处理.) 为此,我采取的方式是重写TextView的onDraw方法,绘制一个圆形背景 ... 
- 【IOS学习基础】文件相关
			一.沙盒(SandBox) 1.沙盒机制 1> 每个应用都有属于自己的存储空间,即沙盒. 2> 应用只能访问自己的沙盒,不可访问其他区域. 3> 如果应用需要进行文件操作,则必须将文 ... 
- 创建mvc
			有几个界面就建几个文件夹 每个文件夹中都有三个文件夹,(models,Controllers,views) 创建一个common 和一个Base文件夹(先建文件夹,可以直接拉进去) common的目的 ... 
- C - The Hardest Problem Ever
			Description Julius Caesar lived in a time of danger and intrigue. The hardest situation Caesar ever ... 
- HDU-1049
			Description An inch worm is at the bottom of a well n inches deep. It has enough energy to climb u i ... 
- 使用NODEJS+REDIS开发一个消息队列以及定时任务处理
			作者:RobanLee 原创文章,转载请注明: 萝卜李 http://www.robanlee.com 源码在这里: https://github.com/robanlee123/RobCron 时间 ... 
- onbeforepaste
			onbeforepaste事件用法 onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').rep ... 
