spark 笔记 5: SparkContext，SparkConf

SparkContext 是spark的程序入口，相当于熟悉的‘main’函数。它负责链接spark集群、创建RDD、创建累加计数器、创建广播变量。

/**
 * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
 * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
 *
 * @param config a Spark Config object describing the application configuration. Any settings in
 *   this config overrides the default configs as well as system properties.
 */

class SparkContext(config: SparkConf) extends Logging {

创建sarpkContext唯一需要的参数就是sparkConf。它是一组K-V属性对，定义如下：

/*
 * Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
 *
 * Most of the time, you would create a SparkConf object with `new SparkConf()`, which will load
 * values from any `spark.*` Java system properties set in your application as well. In this case,
 * parameters you set directly on the `SparkConf` object take priority over system properties.
 *
 * For unit tests, you can also call `new SparkConf(false)` to skip loading external settings and
 * get the same configuration no matter what the system properties are.
 *
 * All setter methods in this class support chaining. For example, you can write
 * `new SparkConf().setMaster("local").setAppName("My app")`.
 *
 * Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
 * by the user. Spark does not support modifying the configuration at runtime.
 *
 * @param loadDefaults whether to also load values from Java system properties
 */
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {

所有可以配置的属性如下：

/**
 * Creates a modified version of a SparkConf with the parameters that can be passed separately
 * to SparkContext, to make it easier to write SparkContext's constructors. This ignores
 * parameters that are passed as the default value of null, instead of throwing an exception
 * like SparkConf would.
 */
private[spark] def updatedConf(
    conf: SparkConf,
    master: String,
    appName: String,
    sparkHome: String = null,
    jars: Seq[String] = Nil,
    environment: Map[String, String] = Map()): SparkConf =
{
  val res = conf.clone()
  res.setMaster(master)
  res.setAppName(appName)
  if (sparkHome != null) {
    res.setSparkHome(sparkHome)
  }
  if (jars != null && !jars.isEmpty) {
    res.setJars(jars)
  }
  res.setExecutorEnv(environment.toSeq)
  res
}

创建RDD的方法是它的主要功能：

类型1）根据scala 的对象创建RDD

// Methods for creating RDDs

/** Distribute a local Scala collection to form an RDD.
 *
 * @note Parallelize acts lazily. If `seq` is a mutable collection and is
 * altered after the call to parallelize and before the first action on the
 * RDD, the resultant RDD will reflect the modified collection. Pass a copy of
 * the argument to avoid this.
 */
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = {
  new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}

类型2）：从存储设备读取数据来创建RDD。

/** Get an RDD for a Hadoop file with an arbitrary InputFormat
  *
  * '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
  * record, directly caching the returned RDD will create many references to the same object.
  * If you plan to directly cache Hadoop writable objects, you should first copy them using
  * a `map` function.
  * */
def hadoopFile[K, V](
    path: String,
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minPartitions: Int = defaultMinPartitions
    ): RDD[(K, V)] = {
  // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
  val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration))
  val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
  new HadoopRDD(
    this,
    confBroadcast,
    Some(setInputPathsFunc),
    inputFormatClass,
    keyClass,
    valueClass,
    minPartitions).setName(path)
}

类型3）从其他RDD创建新的RDD

/** Build the union of a list of RDDs. */
def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, rdds)

/** Build the union of a list of RDDs passed as variable-length arguments. */
def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =
  new UnionRDD(this, Seq(first) ++ rest)

创建累加变量Accumulable：应用程序只能对它最“+=”更新操作但是不能读它的值，只有sparkContex才能使用它的值。

/**
 * A data type that can be accumulated, ie has an commutative and associative "add" operation,
 * but where the result type, `R`, may be different from the element type being added, `T`.
 *
 * You must define how to add data, and how to merge two of these together.  For some data types,
 * such as a counter, these might be the same operation. In that case, you can use the simpler
 * [[org.apache.spark.Accumulator]]. They won't always be the same, though -- e.g., imagine you are
 * accumulating a set. You will add items to the set, and you will union two sets together.
 *
 * @param initialValue initial value of accumulator
 * @param param helper object defining how to add elements of type `R` and `T`
 * @param name human-readable name for use in Spark's web UI
 * @tparam R the full accumulated data (result type)
 * @tparam T partial data that can be added in
 */
class Accumulable[R, T] (
    @transient initialValue: R,
    param: AccumulableParam[R, T],
    val name: Option[String])
  extends Serializable {

它能直接执行一个job：注意它的参数，以及它其实只是调用dagScheduler.runJob

/**
 * Run a function on a given set of partitions in an RDD and pass the results to the given
 * handler function. This is the main entry point for all actions in Spark. The allowLocal
 * flag specifies whether the scheduler can run the computation on the driver rather than
 * shipping it out to the cluster, for short actions like first().
 */
def runJob[T, U: ClassTag](
    rdd: RDD[T],
    func: (TaskContext, Iterator[T]) => U,
    partitions: Seq[Int],
    allowLocal: Boolean,
    resultHandler: (Int, U) => Unit) {
  if (dagScheduler == null) {
    throw new SparkException("SparkContext has been shutdown")
  }
  val callSite = getCallSite
  val cleanedFunc = clean(func)
  logInfo("Starting job: " + callSite.shortForm)
  val start = System.nanoTime
  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
    resultHandler, localProperties.get)
  logInfo(
    "Job finished: " + callSite.shortForm + ", took " + (System.nanoTime - start) / 1e9 + " s")
  rdd.doCheckpoint()
}

/**
 * :: Experimental ::
 * Submit a job for execution and return a FutureJob holding the result.
 */
@Experimental
def submitJob[T, U, R](
    rdd: RDD[T],
    processPartition: Iterator[T] => U,
    partitions: Seq[Int],
    resultHandler: (Int, U) => Unit,
    resultFunc: => R): SimpleFutureAction[R] =
{
  val cleanF = clean(processPartition)
  val callSite = getCallSite
  val waiter = dagScheduler.submitJob(
    rdd,
    (context: TaskContext, iter: Iterator[T]) => cleanF(iter),
    partitions,
    callSite,
    allowLocal = false,
    resultHandler,
    localProperties.get)
  new SimpleFutureAction(waiter, resultFunc)
}

sparkContex的半生对象暴露了它的一些实现方式，比如如何从用户的输入转化到内部实现，值得留意。

/**
 * The SparkContext object contains a number of implicit conversions and parameters for use with
 * various Spark features.
 */
object SparkContext extends Logging {

/** Creates a task scheduler based on a given master URL. Extracted for testing. */
private def createTaskScheduler(sc: SparkContext, master: String): TaskScheduler = {
  // Regular expression used for local[N] and local[*] master formats
  val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r
  // Regular expression for local[N, maxRetries], used in tests with failing tasks
  val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r
  // Regular expression for simulating a Spark cluster of [N, cores, memory] locally
  val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
  // Regular expression for connecting to Spark deploy clusters
  val SPARK_REGEX = """spark://(.*)""".r
  // Regular expression for connection to Mesos cluster by mesos:// or zk:// url
  val MESOS_REGEX = """(mesos|zk)://.*""".r
  // Regular expression for connection to Simr cluster
  val SIMR_REGEX = """simr://(.*)""".r

  // When running locally, don't try to re-execute tasks on failure.
  val MAX_LOCAL_TASK_FAILURES = 1

  master match {
    case "local" =>
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalBackend(scheduler, 1)
      scheduler.initialize(backend)
      scheduler

    case LOCAL_N_REGEX(threads) =>
      def localCpuCount = Runtime.getRuntime.availableProcessors()
      // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
      val backend = new LocalBackend(scheduler, threadCount)
      scheduler.initialize(backend)
      scheduler

    case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
      def localCpuCount = Runtime.getRuntime.availableProcessors()
      // local[*, M] means the number of cores on the computer with M failures
      // local[N, M] means exactly N threads with M failures
      val threadCount = if (threads == "*") localCpuCount else threads.toInt
      val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
      val backend = new LocalBackend(scheduler, threadCount)
      scheduler.initialize(backend)
      scheduler

    case SPARK_REGEX(sparkUrl) =>
      val scheduler = new TaskSchedulerImpl(sc)
      val masterUrls = sparkUrl.split(",").map("spark://" + _)
      val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
      scheduler.initialize(backend)
      scheduler

    case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
      // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
      val memoryPerSlaveInt = memoryPerSlave.toInt
      if (sc.executorMemory > memoryPerSlaveInt) {
        throw new SparkException(
          "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
            memoryPerSlaveInt, sc.executorMemory))
      }

      val scheduler = new TaskSchedulerImpl(sc)
      val localCluster = new LocalSparkCluster(
        numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt)
      val masterUrls = localCluster.start()
      val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
      scheduler.initialize(backend)
      backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) => {
        localCluster.stop()
      }
      scheduler

    case "yarn-standalone" | "yarn-cluster" =>
      if (master == "yarn-standalone") {
        logWarning(
          "\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
      }
      val scheduler = try {
        val clazz = Class.forName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
        val cons = clazz.getConstructor(classOf[SparkContext])
        cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
      } catch {
        // TODO: Enumerate the exact reasons why it can fail
        // But irrespective of it, it means we cannot proceed !
        case e: Exception => {
          throw new SparkException("YARN mode not available ?", e)
        }
      }
      val backend = try {
        val clazz =
          Class.forName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
        val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
        cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
      } catch {
        case e: Exception => {
          throw new SparkException("YARN mode not available ?", e)
        }
      }
      scheduler.initialize(backend)
      scheduler

    case "yarn-client" =>
      val scheduler = try {
        val clazz =
          Class.forName("org.apache.spark.scheduler.cluster.YarnClientClusterScheduler")
        val cons = clazz.getConstructor(classOf[SparkContext])
        cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]

      } catch {
        case e: Exception => {
          throw new SparkException("YARN mode not available ?", e)
        }
      }

      val backend = try {
        val clazz =
          Class.forName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
        val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
        cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
      } catch {
        case e: Exception => {
          throw new SparkException("YARN mode not available ?", e)
        }
      }

      scheduler.initialize(backend)
      scheduler

    case mesosUrl @ MESOS_REGEX(_) =>
      MesosNativeLibrary.load()
      val scheduler = new TaskSchedulerImpl(sc)
      val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
      val url = mesosUrl.stripPrefix("mesos://") // strip scheme from raw Mesos URLs
      val backend = if (coarseGrained) {
        new CoarseMesosSchedulerBackend(scheduler, sc, url)
      } else {
        new MesosSchedulerBackend(scheduler, sc, url)
      }
      scheduler.initialize(backend)
      scheduler

    case SIMR_REGEX(simrUrl) =>
      val scheduler = new TaskSchedulerImpl(sc)
      val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
      scheduler.initialize(backend)
      scheduler

    case _ =>
      throw new SparkException("Could not parse Master URL: '" + master + "'")
  }
}

总的来说，sparkContex是整个spark程序的触发点，负责重要的初始化初始化工作。而它设计到的RDD和DAGScheduler才是重头戏。

来自为知笔记(Wiz)

spark 笔记 5: SparkContext，SparkConf的更多相关文章

spark快速大数据分析学习笔记*初始化sparkcontext(一)
初始化SparkContext 1// 在java中初始化spark import org.apache.spark.SparkConf; import org.apache.spark.api.ja ...
Spark 核心篇-SparkContext
本章内容: 1.功能描述本篇文章就要根据源码分析SparkContext所做的一些事情,用过Spark的开发者都知道SparkContext是编写Spark程序用到的第一个类,足以说明SparkCo ...
Spark分析之SparkContext启动过程分析
SparkContext作为整个Spark的入口,不管是spark.sparkstreaming.spark sql都需要首先创建一个SparkContext对象,然后基于这个SparkContext ...
spark[源码]-sparkContext概述
SparkContext概述 sparkContext是所有的spark应用程序的发动机引擎,就是说你想要运行spark程序就必须创建一个,不然就没的玩了.sparkContext负责初始化很多东西, ...
Spark源码(1): SparkConf
1. 简介 SparkConf类负责管理Spark的所有配置项.在我们使用Spark的过程中,经常需要灵活配置各种参数,来使程序更好.更快地运行,因此也必然要与SparkConf类频繁打交道.了解它的 ...
spark教程(四)-SparkContext 和 RDD 算子
SparkContext SparkContext 是在 spark 库中定义的一个类,作为 spark 库的入口点: 它表示连接到 spark,在进行 spark 操作之前必须先创建一个 Spark ...
Spark笔记(一)
简介 Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎.Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapR ...
spark[源码]-sparkContext详解[一]
spark简述 sparkContext在Spark应用程序的执行过程中起着主导作用,它负责与程序和spark集群进行交互,包括申请集群资源.创建RDD.accumulators及广播变量等.spar ...
spark笔记环境配置
spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...

随机推荐

Django框架——基础教程（总）
1. Django简介 Python下有许多款不同的 Web 框架.Django是重量级选手中最有代表性的一位.许多成功的网站和APP都基于Django. Django是一个开放源代码的Web应用框架 ...
使用maven构建dubbo服务的可执行jar包+Dubbo 程序实例
https://blog.csdn.net/zsg88/article/details/76100482 https://blog.csdn.net/zsg88/article/details/762 ...
Android系统分析之Audio音频流, 音频策略, 输出设备之间的关系
音频流, 音频策略, 输出设备之间的关系只针对 AudioManager.STREAM_VOICE_CALL 音频流类型进行分析涉及到的类: hardware/libhardware_legacy ...
RabbitMQ核心技术总结
RabbitMQ和kafka类似,也是一个消息服务.RabbitMQ是轻量级的,易于部署在内部和云端.RabbitMQ支持多种消息协议,可以部署在分布式集群中,能够满足高规模,高可用性要求.Rabbi ...
dedecms 多级栏目时，调用上级栏目名称和链接
{dede:field name='position' runphp='yes'} $tc="-"; //分隔符 $tw=$GLOBALS['cfg_list_symbol']; ...
textarea 限制输入字数
一般情况下很多人限制textarea的输入字数会使用 onkeyup 或 onchange事件,但是这两种事件都带有明显的不足. onkeyup 事件只能监听键盘事件,而对于用户的粘贴毫无办法:而on ...
为Qtcreator 编译的程序添加管理员权限
(1)创建资源文件 myapp.rc 1 24 uac.manifest (2)创建文件uac.manifest <?xml version="1.0" encoding=& ...
高性能mysql 第5章创建高可用的索引
b-tree索引一定程度上说,mysql只有b-tree索引.他没有bitmap索引.还有一个叫hash索引的,只在Memory存储引擎中才有. b-tree索引跟oracle中的大同小异. mys ...
jQuery attr() prop() data()用法及区别
.attr(),此方法从jq1.0开始一直存在,官方文档写的作用是读/写DOM的attribute值,其实1.6之前有时候是attribute,有时候又是property..prop(),此方法jq1 ...
Wpf自动滚动效果
一.思路 1.使用ScrollView的Scroll.ScrollToVerticalOffset(offset)方法进行滚动 2.ScrollView中放置2个ListView,第一个滚动出边界后, ...

spark 笔记 5: SparkContext，SparkConf

spark 笔记 5: SparkContext，SparkConf的更多相关文章

随机推荐

热门专题