【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程
Spark2.1.1
一 Spark Submit本地解析
1.1 现象
提交命令:
spark-submit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
进程:
hadoop 225653 0.0 0.0 11256 364 ? S Aug24 0:00 bash /$spark-dir/bin/spark-class org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
hadoop 225654 0.0 0.0 34424 2860 ? Sl Aug24 0:00 /$jdk_dir/bin/java -Xmx128m -cp /spark-dir/jars/* org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --master local[10] --driver-memory 30g --class app.package.AppClass app-1.0.jar
1.2 执行过程
1.2.1 脚本执行
-bash-4.1$ cat bin/spark-submit
#!/usr/bin/env bash
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi# disable randomized hash for string in Python 3.3+
export PYTHONHASHSEED=0exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
注释:这里执行了另一个脚本spark-class,具体如下:
-bash-4.1$ cat bin/spark-class
...
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")...
CMD=("${CMD[@]:0:$LAST}")
exec "${CMD[@]}"
注释:这里执行java class: org.apache.spark.launcher.Main,并传入参数,具体如下:
1.2.2 代码执行
org.apache.spark.launcher.Main
...
builder = new SparkSubmitCommandBuilder(help);
...
List<String> cmd = builder.buildCommand(env);
...
List<String> bashCmd = prepareBashCommand(cmd, env);
for (String c : bashCmd) {
System.out.print(c);
System.out.print('\0');
}
...
注释:其中会调用SparkSubmitCommandBuilder来生成Spark Submit命令,具体如下:
org.apache.spark.launcher.SparkSubmitCommandBuilder
... private List<String> buildSparkSubmitCommand(Map<String, String> env)
...
addOptionString(cmd, System.getenv("SPARK_SUBMIT_OPTS"));
addOptionString(cmd, System.getenv("SPARK_JAVA_OPTS"));
...
String driverExtraJavaOptions = config.get(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS);
...
if (isClientMode) {
...
addOptionString(cmd, driverExtraJavaOptions);
...
}
... addPermGenSizeOpt(cmd); cmd.add("org.apache.spark.deploy.SparkSubmit"); cmd.addAll(buildSparkSubmitArgs()); return cmd; ...
注释:这里创建了本地命令,其中java class:org.apache.spark.deploy.SparkSubmit,同时会把各种JavaOptions放到启动命令里(比如SPARK_JAVA_OPTS,DRIVER_EXTRA_JAVA_OPTIONS等),具体如下:
org.apache.spark.deploy.SparkSubmit
def main(args: Array[String]): Unit = {
val appArgs = new SparkSubmitArguments(args) //parse command line parameter
if (appArgs.verbose) {
// scalastyle:off println
printStream.println(appArgs)
// scalastyle:on println
}
appArgs.action match {
case SparkSubmitAction.SUBMIT => submit(appArgs)
case SparkSubmitAction.KILL => kill(appArgs)
case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
}
}
private def submit(args: SparkSubmitArguments): Unit = {
val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args) //merge all parameters from: command line, properties file, system property, etc...
def doRunMain(): Unit = {
...
runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)
...
}
...
private[deploy] def prepareSubmitEnvironment(args: SparkSubmitArguments)
: (Seq[String], Seq[String], Map[String, String], String) = {
if (deployMode == CLIENT || isYarnCluster) {
childMainClass = args.mainClass
...
if (isYarnCluster) {
childMainClass = "org.apache.spark.deploy.yarn.Client"
...
private def runMain(
childArgs: Seq[String],
childClasspath: Seq[String],
sysProps: Map[String, String],
childMainClass: String,
verbose: Boolean): Unit = {
// scalastyle:off println
if (verbose) {
printStream.println(s"Main class:\n$childMainClass")
printStream.println(s"Arguments:\n${childArgs.mkString("\n")}")
printStream.println(s"System properties:\n${sysProps.mkString("\n")}")
printStream.println(s"Classpath elements:\n${childClasspath.mkString("\n")}")
printStream.println("\n")
}
// scalastyle:on println
val loader =
if (sysProps.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
new ChildFirstURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
} else {
new MutableURLClassLoader(new Array[URL](0),
Thread.currentThread.getContextClassLoader)
}
Thread.currentThread.setContextClassLoader(loader)
for (jar <- childClasspath) {
addJarToClasspath(jar, loader)
}
for ((key, value) <- sysProps) {
System.setProperty(key, value)
}
var mainClass: Class[_] = null
try {
mainClass = Utils.classForName(childMainClass)
} catch {
...
val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
...
mainMethod.invoke(null, childArgs.toArray)
...
注释:这里首先会解析命令行参数,比如mainClass,准备运行环境包括System Property以及classpath等,然后使用一个新的classloader:ChildFirstURLClassLoader来加载用户的mainClass,然后反射调用mainClass的main方法,这样用户的app.package.AppClass的main方法就开始执行了。
org.apache.spark.SparkConf
class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Serializable {
import SparkConf._
/** Create a SparkConf that loads defaults from system properties and the classpath */
def this() = this(true)
...
if (loadDefaults) {
loadFromSystemProperties(false)
}
private[spark] def loadFromSystemProperties(silent: Boolean): SparkConf = {
// Load any spark.* system properties
for ((key, value) <- Utils.getSystemProperties if key.startsWith("spark.")) {
set(key, value, silent)
}
this
}
注释:这里可以看到spark是怎样加载配置的
1.2.3 --verbose
spark-submit --master local[*] --class app.package.AppClass --jars /$other-dir/other.jar --driver-memory 1g --verbose app-1.0.jar
输出示例:
Main class:
app.package.AppClass
Arguments:System properties:
spark.executor.logs.rolling.maxSize -> 1073741824
spark.driver.memory -> 1g
spark.driver.extraLibraryPath -> /$hadoop-dir/lib/native
spark.eventLog.enabled -> true
spark.eventLog.compress -> true
spark.executor.logs.rolling.time.interval -> daily
SPARK_SUBMIT -> true
spark.app.name -> app.package.AppClass
spark.driver.extraJavaOptions -> -XX:+PrintGCDetails -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:-UseCompressedClassPointers -XX:CompressedClassSpaceSize=3G -XX:+PrintGCTimeStamps -Xloggc:/export/Logs/hadoop/g1gc.log
spark.jars -> file:/$other-dir/other.jar
spark.sql.adaptive.enabled -> true
spark.submit.deployMode -> client
spark.executor.logs.rolling.maxRetainedFiles -> 10
spark.executor.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
spark.eventLog.dir -> hdfs://myhdfs/spark/history
spark.master -> local[*]
spark.sql.crossJoin.enabled -> true
spark.driver.extraClassPath -> /usr/lib/hadoop/lib/hadoop-lzo.jar
Classpath elements:
file:/$other-dir/other.jar
file:/app-1.0.jar
启动时添加--verbose参数后,可以输出所有的运行时信息,有助于判断问题。
【原创】大数据基础之Spark(1)Spark Submit即Spark任务提交过程的更多相关文章
- 【原创】大数据基础之Zookeeper(2)源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
- 【原创】大数据基础之Spark(6)Spark Rdd Sort实现原理
spark 2.1.1 spark中可以通过RDD.sortBy来对分布式数据进行排序,具体是如何实现的?来看代码: org.apache.spark.rdd.RDD /** * Return thi ...
- 【原创】大数据基础之Spark(2)Spark on Yarn:container memory allocation容器内存分配
spark 2.1.1 最近spark任务(spark on yarn)有一个报错 Diagnostics: Container [pid=5901,containerID=container_154 ...
- 【原创】大数据基础之Hive(5)hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
- 【原创】大数据基础之Spark(9)spark部署方式yarn/mesos
1 下载解压 https://spark.apache.org/downloads.html $ wget http://mirrors.shu.edu.cn/apache/spark/spark-2 ...
- 大数据基础知识问答----spark篇,大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
- 大数据与可靠性会碰撞出什么样的Spark?
可靠性工程领域的可靠性评估,可靠性仿真计算,健康检测与预管理(PHM)技术,可靠性试验,都需要大规模数据来进行支撑才能产生好的效果,以往这些数据都是不全并且收集困难,而随着互联网+的大数据时代的来临, ...
- 【原创】大数据基础之词频统计Word Count
对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...
- 【原创】大数据基础之Flink(1)简介、安装、使用
Flink 1.7 官方:https://flink.apache.org/ 一 简介 Apache Flink is an open source platform for distributed ...
随机推荐
- 在Winform开发框架中使用DevExpress的内置图标资源
在开发Winform程序界面的时候,我们往往会使用一些较好看的图表,以便能够为我们的程序界面增色,良好的图标设置可以让界面看起来更加美观舒服,而且也比较容易理解,图标我们可以通过一些网站获取各种场景的 ...
- OI用语一览表
术语 含义 A/AC 通过 AAA树 Top-tree ABC AtCoder Beginner Contest AFO 退役 AG 银牌 AGC AtCoder Grand Contest AK 通 ...
- (hdu)4858 项目管理 (vector)
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4858 Problem Description 我们建造了一个大项目!这个项目有n个节点,用很多边连接起 ...
- win10设置操作备忘
添加密码, 更改密码: Win键-->左侧用户图标-->更改帐户设置-->登陆选项-->添加密码 | 更改密码
- vue中使用Base64和md5和rsa
https://blog.csdn.net/benben513624/article/details/88113459(copy) https://www.cnblogs.com/myfate/p/1 ...
- CodeForces 461B Appleman and T
题目链接:http://codeforces.com/contest/461/problem/B 题目大意: 给定一课树,树上的节点有黑的也有白的,有这样一种分割树的方案,分割后每个子图只含有一个黑色 ...
- Java多线程处理某个线程超时的问题
ExecutorService exec = Executors.newFixedThreadPool(4); List<Future<Integer>> futures = ...
- AVL树探秘
本文首发于我的公众号 Linux云计算网络(id: cloud_dev) ,专注于干货分享,号内有 10T 书籍和视频资源,后台回复 「1024」 即可领取,欢迎大家关注,二维码文末可以扫. 一.AV ...
- <Android基础> (五) 广播机制
1)接收系统广播:a.动态注册监听网络变化 b.静态注册实现开机启动 2)发送自定义广播:a.发送标准广播 b.发送有序广播 3)使用本地广播 第五章 5.1 广播机制 Android中的每个程序都可 ...
- Pandas系列(六)-时间序列详解
内容目录 1. 基础概述 2. 转换时间戳 3. 生成时间戳范围 4. DatetimeIndex 5. DateOffset对象 6. 与时间序列相关的方法 6.1 移动 6.2 频率转换 6.3 ...