SparkContext 是spark的程序入口,相当于熟悉的‘main’函数。它负责链接spark集群、创建RDD、创建累加计数器、创建广播变量。
/**
* Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
* cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
*
* @param config a Spark Config object describing the application configuration. Any settings in
* this config overrides the default configs as well as system properties.
*/

class SparkContext(config: SparkConf) extends Logging {
创建sarpkContext唯一需要的参数就是sparkConf。它是一组K-V属性对,定义如下:
/*
* Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.
*
* Most of the time, you would create a SparkConf object with `new SparkConf()`, which will load
* values from any `spark.*` Java system properties set in your application as well. In this case,
* parameters you set directly on the `SparkConf` object take priority over system properties.
*
* For unit tests, you can also call `new SparkConf(false)` to skip loading external settings and
* get the same configuration no matter what the system properties are.
*
* All setter methods in this class support chaining. For example, you can write
* `new SparkConf().setMaster("local").setAppName("My app")`.
*
* Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified
* by the user. Spark does not support modifying the configuration at runtime.
*
* @param loadDefaults whether to also load values from Java system properties
*/
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
所有可以配置的属性如下:
/**
* Creates a modified version of a SparkConf with the parameters that can be passed separately
* to SparkContext, to make it easier to write SparkContext's constructors. This ignores
* parameters that are passed as the default value of null, instead of throwing an exception
* like SparkConf would.
*/
private[spark] def updatedConf(
conf: SparkConf,
master: String,
appName: String,
sparkHome: String = null,
jars: Seq[String] = Nil,
environment: Map[String, String] = Map()): SparkConf =
{
val res = conf.clone()
res.setMaster(master)
res.setAppName(appName)
if (sparkHome != null) {
res.setSparkHome(sparkHome)
}
if (jars != null && !jars.isEmpty) {
res.setJars(jars)
}
res.setExecutorEnv(environment.toSeq)
res
}
创建RDD的方法是它的主要功能:
类型1)根据scala 的对象创建RDD
// Methods for creating RDDs

/** Distribute a local Scala collection to form an RDD.
*
* @note Parallelize acts lazily. If `seq` is a mutable collection and is
* altered after the call to parallelize and before the first action on the
* RDD, the resultant RDD will reflect the modified collection. Pass a copy of
* the argument to avoid this.
*/
def parallelize[T: ClassTag](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] = {
new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
}
类型2):从存储设备读取数据来创建RDD。
/** Get an RDD for a Hadoop file with an arbitrary InputFormat
*
* '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using
* a `map` function.
* */
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions
): RDD[(K, V)] = {
// A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
val confBroadcast = broadcast(new SerializableWritable(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
类型3)从其他RDD创建新的RDD
/** Build the union of a list of RDDs. */
def union[T: ClassTag](rdds: Seq[RDD[T]]): RDD[T] = new UnionRDD(this, rdds)

/** Build the union of a list of RDDs passed as variable-length arguments. */
def union[T: ClassTag](first: RDD[T], rest: RDD[T]*): RDD[T] =
new UnionRDD(this, Seq(first) ++ rest)

创建累加变量Accumulable: 应用程序只能对它最“+=”更新操作但是不能读它的值,只有sparkContex才能使用它的值。
/**
* A data type that can be accumulated, ie has an commutative and associative "add" operation,
* but where the result type, `R`, may be different from the element type being added, `T`.
*
* You must define how to add data, and how to merge two of these together. For some data types,
* such as a counter, these might be the same operation. In that case, you can use the simpler
* [[org.apache.spark.Accumulator]]. They won't always be the same, though -- e.g., imagine you are
* accumulating a set. You will add items to the set, and you will union two sets together.
*
* @param initialValue initial value of accumulator
* @param param helper object defining how to add elements of type `R` and `T`
* @param name human-readable name for use in Spark's web UI
* @tparam R the full accumulated data (result type)
* @tparam T partial data that can be added in
*/
class Accumulable[R, T] (
@transient initialValue: R,
param: AccumulableParam[R, T],
val name: Option[String])
extends Serializable {
它能直接执行一个job:注意它的参数,以及它其实只是调用dagScheduler.runJob
/**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark. The allowLocal
* flag specifies whether the scheduler can run the computation on the driver rather than
* shipping it out to the cluster, for short actions like first().
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
allowLocal: Boolean,
resultHandler: (Int, U) => Unit) {
if (dagScheduler == null) {
throw new SparkException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
val start = System.nanoTime
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, allowLocal,
resultHandler, localProperties.get)
logInfo(
"Job finished: " + callSite.shortForm + ", took " + (System.nanoTime - start) / 1e9 + " s")
rdd.doCheckpoint()
}
/**
* :: Experimental ::
* Submit a job for execution and return a FutureJob holding the result.
*/
@Experimental
def submitJob[T, U, R](
rdd: RDD[T],
processPartition: Iterator[T] => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit,
resultFunc: => R): SimpleFutureAction[R] =
{
val cleanF = clean(processPartition)
val callSite = getCallSite
val waiter = dagScheduler.submitJob(
rdd,
(context: TaskContext, iter: Iterator[T]) => cleanF(iter),
partitions,
callSite,
allowLocal = false,
resultHandler,
localProperties.get)
new SimpleFutureAction(waiter, resultFunc)
}
sparkContex的半生对象暴露了它的一些实现方式,比如如何从用户的输入转化到内部实现,值得留意。

/**
* The SparkContext object contains a number of implicit conversions and parameters for use with
* various Spark features.
*/
object SparkContext extends Logging {
/** Creates a task scheduler based on a given master URL. Extracted for testing. */
private def createTaskScheduler(sc: SparkContext, master: String): TaskScheduler = {
// Regular expression used for local[N] and local[*] master formats
val LOCAL_N_REGEX = """local\[([0-9]+|\*)\]""".r
// Regular expression for local[N, maxRetries], used in tests with failing tasks
val LOCAL_N_FAILURES_REGEX = """local\[([0-9]+|\*)\s*,\s*([0-9]+)\]""".r
// Regular expression for simulating a Spark cluster of [N, cores, memory] locally
val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
// Regular expression for connecting to Spark deploy clusters
val SPARK_REGEX = """spark://(.*)""".r
// Regular expression for connection to Mesos cluster by mesos:// or zk:// url
val MESOS_REGEX = """(mesos|zk)://.*""".r
// Regular expression for connection to Simr cluster
val SIMR_REGEX = """simr://(.*)""".r

// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1

master match {
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(scheduler, 1)
scheduler.initialize(backend)
scheduler

case LOCAL_N_REGEX(threads) =>
def localCpuCount = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalBackend(scheduler, threadCount)
scheduler.initialize(backend)
scheduler

case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
def localCpuCount = Runtime.getRuntime.availableProcessors()
// local[*, M] means the number of cores on the computer with M failures
// local[N, M] means exactly N threads with M failures
val threadCount = if (threads == "*") localCpuCount else threads.toInt
val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
val backend = new LocalBackend(scheduler, threadCount)
scheduler.initialize(backend)
scheduler

case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
scheduler

case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
// Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
val memoryPerSlaveInt = memoryPerSlave.toInt
if (sc.executorMemory > memoryPerSlaveInt) {
throw new SparkException(
"Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
memoryPerSlaveInt, sc.executorMemory))
}

val scheduler = new TaskSchedulerImpl(sc)
val localCluster = new LocalSparkCluster(
numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt)
val masterUrls = localCluster.start()
val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
backend.shutdownCallback = (backend: SparkDeploySchedulerBackend) => {
localCluster.stop()
}
scheduler

case "yarn-standalone" | "yarn-cluster" =>
if (master == "yarn-standalone") {
logWarning(
"\"yarn-standalone\" is deprecated as of Spark 1.0. Use \"yarn-cluster\" instead.")
}
val scheduler = try {
val clazz = Class.forName("org.apache.spark.scheduler.cluster.YarnClusterScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
} catch {
// TODO: Enumerate the exact reasons why it can fail
// But irrespective of it, it means we cannot proceed !
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
}
val backend = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
} catch {
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
}
scheduler.initialize(backend)
scheduler

case "yarn-client" =>
val scheduler = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClientClusterScheduler")
val cons = clazz.getConstructor(classOf[SparkContext])
cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]

} catch {
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
}

val backend = try {
val clazz =
Class.forName("org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend")
val cons = clazz.getConstructor(classOf[TaskSchedulerImpl], classOf[SparkContext])
cons.newInstance(scheduler, sc).asInstanceOf[CoarseGrainedSchedulerBackend]
} catch {
case e: Exception => {
throw new SparkException("YARN mode not available ?", e)
}
}

scheduler.initialize(backend)
scheduler

case mesosUrl @ MESOS_REGEX(_) =>
MesosNativeLibrary.load()
val scheduler = new TaskSchedulerImpl(sc)
val coarseGrained = sc.conf.getBoolean("spark.mesos.coarse", false)
val url = mesosUrl.stripPrefix("mesos://") // strip scheme from raw Mesos URLs
val backend = if (coarseGrained) {
new CoarseMesosSchedulerBackend(scheduler, sc, url)
} else {
new MesosSchedulerBackend(scheduler, sc, url)
}
scheduler.initialize(backend)
scheduler

case SIMR_REGEX(simrUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val backend = new SimrSchedulerBackend(scheduler, sc, simrUrl)
scheduler.initialize(backend)
scheduler

case _ =>
throw new SparkException("Could not parse Master URL: '" + master + "'")
}
}
总的来说,sparkContex是整个spark程序的触发点,负责重要的初始化初始化工作。而它设计到的RDD和DAGScheduler才是重头戏。







spark 笔记 5: SparkContext,SparkConf的更多相关文章

  1. spark快速大数据分析学习笔记*初始化sparkcontext(一)

    初始化SparkContext 1// 在java中初始化spark import org.apache.spark.SparkConf; import org.apache.spark.api.ja ...

  2. Spark 核心篇-SparkContext

    本章内容: 1.功能描述 本篇文章就要根据源码分析SparkContext所做的一些事情,用过Spark的开发者都知道SparkContext是编写Spark程序用到的第一个类,足以说明SparkCo ...

  3. Spark分析之SparkContext启动过程分析

    SparkContext作为整个Spark的入口,不管是spark.sparkstreaming.spark sql都需要首先创建一个SparkContext对象,然后基于这个SparkContext ...

  4. spark[源码]-sparkContext概述

    SparkContext概述 sparkContext是所有的spark应用程序的发动机引擎,就是说你想要运行spark程序就必须创建一个,不然就没的玩了.sparkContext负责初始化很多东西, ...

  5. Spark源码(1): SparkConf

    1. 简介 SparkConf类负责管理Spark的所有配置项.在我们使用Spark的过程中,经常需要灵活配置各种参数,来使程序更好.更快地运行,因此也必然要与SparkConf类频繁打交道.了解它的 ...

  6. spark教程(四)-SparkContext 和 RDD 算子

    SparkContext SparkContext 是在 spark 库中定义的一个类,作为 spark 库的入口点: 它表示连接到 spark,在进行 spark 操作之前必须先创建一个 Spark ...

  7. Spark笔记(一)

    简介 Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎.Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapR ...

  8. spark[源码]-sparkContext详解[一]

    spark简述 sparkContext在Spark应用程序的执行过程中起着主导作用,它负责与程序和spark集群进行交互,包括申请集群资源.创建RDD.accumulators及广播变量等.spar ...

  9. spark笔记 环境配置

    spark笔记 spark简介 saprk 有六个核心组件: SparkCore.SparkSQL.SparkStreaming.StructedStreaming.MLlib,Graphx Spar ...

随机推荐

  1. js介绍及语法结构

    javaScript它是一门动态的,弱类型的,解释型面向Web的编程语言.虽然名字里有Java但其它与Java无关.它可以用来增强页面动态效果,实现页面与用户之间的实时,动态交互. javascrip ...

  2. 帝国 cms 修改登录次数的两种方法

    1.找到数据库表 注:我把这里的5改成50了. 2.找打e ==>> config ==>>  config.php ==>> loginnum的5修改一下即可

  3. groovy程序设计

    /********* * groovy中Object类型存在隐式转换 可以不必使用as强转 */ Object munber = 9.343444 def number1 = 2 println mu ...

  4. MyCat配置简述以及mycat全局ID

    Mycat可以直接下载解压,简单配置后可以使用,主要配置项如下: 1. log4j2.xml:配置MyCat日志,包括位置,格式,单个文件大小 2. rule.xml: 配置分片规则 3. schem ...

  5. jquery判断cookie是否存在

    首先请加载jquery库与jquery cookie插件 http://code.jquery.com/jquery-latest.js http://files.cnblogs.com/afish/ ...

  6. Django学习系列10:保存用户输入——编写表单,发送POST请求

    要获取用户输入的待办事项,发送给服务器,这样才能使用某种方式保存待办事项,然后在显示给用户查看. 上次运行测试指出无法保存用户的输入.现在,要使用HTML post请求. 若想让浏览器发送POST请求 ...

  7. 使用SpringBoot做Javaweb时,数据交互遇到的问题

    有段时间没做过javaweb了,有点生疏了,js也忘记得差不多,所以今天下午做前后端交互的时候,传到后台的参数总是为空,前端控制台了报一个String parameter “xxx” is not p ...

  8. 阿里云-docker安装mysql

    1.检查内核版本,必须是3.10及以上 uname ‐r 2.安装docker yum install docker 3.输入y确认安装 4.启动docker:service docker start ...

  9. [LeetCode 92] Reverse Linked List II 翻转单链表II

    对于链表的问题,根据以往的经验一般都是要建一个dummy node,连上原链表的头结点,这样的话就算头结点变动了,我们还可以通过dummy->next来获得新链表的头结点.这道题的要求是只通过一 ...

  10. 胡昊—第6次作业—static关键字、对象

    #题目1:编写一个类Computer,类中含有一个求n的阶乘的方法.将该类打包,并在另一包中的Java文件App.java中引入包,在主类中定义Computer类的对象,调用求n的阶乘的方法(n值由参 ...