通过 spark.files 传入spark任务依赖的文件源码分析

版本：spak2.3

相关源码：org.apache.spark.SparkContext

在创建spark任务时候，往往会指定一些依赖文件，通常我们可以在spark-submit脚本使用--files /path/to/file指定来实现。

但是公司产品的架构是通过livy来调spark任务，livy的实现其实是对spark-submit的一个包装，所以如何指定依赖文件归根到底还是在spark这边。既然不能通过命令行--files指定，那在编程中怎么指定？任务在各个节点上运行时又是如何获取到这些文件的呢？

根据spark-submit的参数传递源码分析得知，spark-submit --files其实是由参数"spark.files"接收，所以在代码中可以通过sparkConf设置该参数。

比如：

SparkConf conf = new SparkConf();

conf.set("spark.files","/path/to/file");

//如果文件是放在hdfs上，可以通过conf.set("spark.files","hdfs:/path/to/file")指定，注意这里只需要加上个hdfs的schema即可，不需要ip port

spark官网关于该参数的解释：

spark.files　　Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed.

具体怎么读取用户指定的文件相关源码在SparkContext.scala中，如下（--jars指定依赖jar包同理）：

def jars: Seq[String] = _jars

def files: Seq[String] = _files

...

_jars = Utils.getUserJars(_conf)

_files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))

  .toSeq.flatten

...

// Add each JAR given through the constructor

if (jars != null) {

  jars.foreach(addJar)

}

if (files != null) {

  files.foreach(addFile)

}

addFile实现如下：

/**

* Add a file to be downloaded with this Spark job on every node.

*

* If a file is added during execution, it will not be available until the next TaskSet starts.

*

* @param path can be either a local file, a file in HDFS (or other Hadoop-supported

* filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,

* use `SparkFiles.get(fileName)` to find its download location.

* @param recursive if true, a directory can be given in `path`. Currently directories are

* only supported for Hadoop-supported filesystems.

*    1. 文件会下载到每一个节点

*    2. 如果在运行中增加文件，那么只有到下一批taskset开始执行时有效

*    3. 文件的位置可以是本地文件，HDFS文件或者其他hadoop支持的文件系统上，HTTP,HTTPS或者FTP URI也可以。在spark jobs中可以通过

*        SparkFiles.get(fileName)访问此文件

*    4. 如果要递归获取文件，那么可以给定一个目录，但是这种方式只对Hadoop-supported filesystems有效。

*/

def addFile(path: String, recursive: Boolean): Unit = {

val uri = new Path(path).toUri

val schemeCorrectedPath = uri.getScheme match {

    //如果路径中不指定schema，也就是null.

    //在命令行指定--files 时候，--files /home/kong/log4j.properties等同于--files local:/home/kong/log4j.properties

  case null | "local" => new File(path).getCanonicalFile.toURI.toString

  case _ => path

}

val hadoopPath = new Path(schemeCorrectedPath)

val scheme = new URI(schemeCorrectedPath).getScheme

if (!Array("http", "https", "ftp").contains(scheme)) {

  val fs = hadoopPath.getFileSystem(hadoopConfiguration)

  val isDir = fs.getFileStatus(hadoopPath).isDirectory

  if (!isLocal && scheme == "file" && isDir) {

    throw new SparkException(s"addFile does not support local directories when not running " +

      "local mode.")

  }

  if (!recursive && isDir) {

    throw new SparkException(s"Added file $hadoopPath is a directory and recursive is not " +

      "turned on.")

  }

} else {

  // SPARK-17650: Make sure this is a valid URL before adding it to the list of dependencies

  Utils.validateURL(uri)

}

val key = if (!isLocal && scheme == "file") {

  env.rpcEnv.fileServer.addFile(new File(uri.getPath))

} else {

  schemeCorrectedPath

}

val timestamp = System.currentTimeMillis

if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {

  logInfo(s"Added file $path at $key with timestamp $timestamp")

  // Fetch the file locally so that closures which are run on the driver can still use the

  // SparkFiles API to access files.

  Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,

    env.securityManager, hadoopConfiguration, timestamp, useCache = false)

  postEnvironmentUpdate()

}

}

在addJar和addFile方法的最后都调用了postEnvironmentUpdate方法，而且在SparkContext初始化过程的
最后也会调用postEnvironmentUpdate，代码如下：

  /** Post the environment update event once the task scheduler is ready */

  private def postEnvironmentUpdate() {

    if (taskScheduler != null) {

      val schedulingMode = getSchedulingMode.toString

      val addedJarPaths = addedJars.keys.toSeq

      val addedFilePaths = addedFiles.keys.toSeq

        // 通过调用SparkEnv的方法environmentDetails将环境的JVM参数、Spark 属性、系统属性、classPath等信息设置为环境明细信息。

      val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,

        addedFilePaths)

        // 生成SparkListenerEnvironmentUpdate事件，并投递到事件总线

      val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)

      listenerBus.post(environmentUpdate)

    }

  }

environmentDetails方法：

  /**

   * Return a map representation of jvm information, Spark properties, system properties, and

   * class paths. Map keys define the category, and map values represent the corresponding

   * attributes as a sequence of KV pairs. This is used mainly for SparkListenerEnvironmentUpdate.

   */

  private[spark]

  def environmentDetails(

      conf: SparkConf,

      schedulingMode: String,

      addedJars: Seq[String],

      addedFiles: Seq[String]): Map[String, Seq[(String, String)]] = {

    import Properties._

    val jvmInformation = Seq(

      ("Java Version", s"$javaVersion ($javaVendor)"),

      ("Java Home", javaHome),

      ("Scala Version", versionString)

    ).sorted

    // Spark properties

    // This includes the scheduling mode whether or not it is configured (used by SparkUI)

    val schedulerMode =

      if (!conf.contains("spark.scheduler.mode")) {

        Seq(("spark.scheduler.mode", schedulingMode))

      } else {

        Seq.empty[(String, String)]

      }

    val sparkProperties = (conf.getAll ++ schedulerMode).sorted

    // System properties that are not java classpaths

    val systemProperties = Utils.getSystemProperties.toSeq

    val otherProperties = systemProperties.filter { case (k, _) =>

      k != "java.class.path" && !k.startsWith("spark.")

    }.sorted

    // Class paths including all added jars and files

    val classPathEntries = javaClassPath

      .split(File.pathSeparator)

      .filterNot(_.isEmpty)

      .map((_, "System Classpath"))

    val addedJarsAndFiles = (addedJars ++ addedFiles).map((_, "Added By User"))

    val classPaths = (addedJarsAndFiles ++ classPathEntries).sorted

    Map[String, Seq[(String, String)]](

      "JVM Information" -> jvmInformation,

      "Spark Properties" -> sparkProperties,

      "System Properties" -> otherProperties,

      "Classpath Entries" -> classPaths)

  }

通过 spark.files 传入spark任务依赖的文件源码分析的更多相关文章

Spark技术内幕：Stage划分及提交源码分析
http://blog.csdn.net/anzhsoft/article/details/39859463 当触发一个RDD的action后,以count为例,调用关系如下: org.apache. ...
Spark大师之路：广播变量（Broadcast）源码分析
概述最近工作上忙死了……广播变量这一块其实早就看过了,一直没有贴出来. 本文基于Spark 1.0源码分析,主要探讨广播变量的初始化.创建.读取以及清除. 类关系 BroadcastManager类 ...
65、Spark Streaming：数据接收原理剖析与源码分析
一.数据接收原理二.源码分析入口包org.apache.spark.streaming.receiver下ReceiverSupervisorImpl类的onStart()方法 ### overr ...
Spark(二)【sc.textfile的分区策略源码分析】
sparkcontext.textFile()返回的是HadoopRDD! 关于HadoopRDD的官方介绍,使用的是旧版的hadoop api ctrl+F12搜索 HadoopRDD的getPar ...
小白都能看懂的 Spring 源码揭秘之依赖注入(DI)源码分析
目录前言依赖注入的入口方法依赖注入流程分析 AbstractBeanFactory#getBean AbstractBeanFactory#doGetBean AbstractAutowireC ...
Spark技术内幕: Task向Executor提交的源码解析
在上文<Spark技术内幕:Stage划分及提交源码分析>中,我们分析了Stage的生成和提交.但是Stage的提交,只是DAGScheduler完成了对DAG的划分,生成了一个计算拓扑, ...
Spark Scheduler模块源码分析之DAGScheduler
本文主要结合Spark-1.6.0的源码,对Spark中任务调度模块的执行过程进行分析.Spark Application在遇到Action操作时才会真正的提交任务并进行计算.这时Spark会根据Ac ...
Spark源码分析之三：Stage划分
继上篇<Spark源码分析之Job的调度模型与运行反馈>之后,我们继续来看第二阶段--Stage划分. Stage划分的大体流程如下图所示: 前面提到,对于JobSubmitted事件,我 ...
spark 源码分析之十七 -- Spark磁盘存储剖析
上篇文章 spark 源码分析之十六 -- Spark内存存储剖析主要剖析了Spark 的内存存储.本篇文章主要剖析磁盘存储. 总述磁盘存储相对比较简单,相关的类关系图如下: 我们先从依赖类 Di ...

随机推荐

Solr与tomcat搭建(搭建好)
https://pan.baidu.com/s/1kXagNYJ 密码:hgxd
编写安全 PHP 应用程序的七个习惯
编写安全 PHP 应用程序的七个习惯在提及安全性问题时,需要注意,除了实际的平台和操作系统安全性问题之外,您还需要确保编写安全的应用程序.在编写 PHP 应用程序时,请应用下面的七个习惯以确保应 ...
python对ASC码的加减
一般使用这两个函数 sum = ord('A') //结果为65 ord()函数返回值是int 字符要加' '否则会当作变量来看 sum = chr(65) //结果为A 不是char是chr()
Python学习第一课——if-else
#if 基本语句 if 1==1: print("如果条件为真,if执行该语句") else: print("如果条件为假,if则执行这条语句") #if 多重 ...
多线程分析之Semaphore
Semaphore分析由来网上看了许多讲解Semaphore的,用Semaphore来实现顺序打印字母,但是可能大家都没有清楚具体的原因,所以来给大家分析下为什么可以使用Semaphore来实现顺序 ...
.net设置浏览器缓存和跨域的几种方法
.自定义过滤器属性 public class NoCacheAttribute : FilterAttribute, IActionFilter { public void OnActionExecu ...
微信web版接口api(转)
安卓微信的api,个人微信开发API协议,微信 ipad sdk,微信ipad协议,微信web版接口api,微信网页版接口,微信电脑版sdk,微信开发sdk,微信开发API,微信协议,微信接口文档sd ...
tomcat点击startup.bat出现闪退，启动不成功的解决办法
问题描述:tomcat点击startup.bat出现命令行闪退的情况打开startup.bat,在第一行加入 SET JAVA_HOME=D:\jdk\jdk1.8.0_121[jdk路径] SET ...
Spark 读 Hive（不在一个 yarn 集群）
方法一 1. 找到目标 Hive 的 hive-site.xml 文件,拷贝到 spark 的 conf 下面. 在我的情况下 /etc/hive/conf/hive-site.xml -> / ...
Django博客开发-数据建模与样式设定
开发流程介绍之前Django的学习过程当中已经把基本Django开发学完了,现在以Django 的博客项目完成一遍课程的回顾和总结.同时来一次完整开发的Django体验. 一个产品从研究到编码我们要 ...

通过 spark.files 传入spark任务依赖的文件源码分析

版本：spak2.3

通过 spark.files 传入spark任务依赖的文件源码分析的更多相关文章

随机推荐

热门专题