sparkR原理

　　sparkR在spark2.0里面，RDD后端代码位于org.apache.spark.rdd中，R语言相关的位于org.apache.spark.api.r中。

从入口开始，./bin/sparkR里面只有四句话，调用的是这个

exec "$SPARK_HOME"/bin/spark-submit sparkr-shell-main "$@"

spark-submit里面是个一句话的shell脚本

exec "$SPARK_HOME"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"

好了，入口是org.apache.spark.deploy.SparkSubmit这个类，该类中的main方法中调用具体方法

case SparkSubmitAction.SUBMIT => submit(appArgs)

/** 

 * Submit the application using the provided parameters.  * 

 * This runs in two steps. First, we prepare the launch environment by setting up 

 * the appropriate classpath, system properties, and application arguments for 

 * running the child main class based on the cluster manager and the deploy mode. 

 * Second, we use this launch environment to invoke the main method of the child         * main class. 

*/

 private def submit(args: SparkSubmitArguments): Unit = {

submit方法准备classpath、系统属性、运行参数,然后按照这些调用下面的方法运行

runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)

该方法主要两步，第一步调用下面方法进行准备

val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)

第二部会调用

runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)

进行执行。

在第一步中将sparkR的R相关代码打包成zip文件，然后设置将要运行的主类

如果是SPARKR-SHELL则调用org.apache.spark.api.r.RBackend

如果是纯粹client模式，则调用org.apache.spark.deploy.RRunner，其调用形式如下，例如

Usage: RRunner <main R file> [app arguments]

sun.java.command=com.aliyun.odps.cupid.runtime.Main --class org.apache.spark.deploy.RRunner --primary-r-file testOdpsRdd.R --arg testOdpsRdd.R

RBackend基于netty用来在R和java之间的通讯

Runner里面会调用启动RBackend，然后启动processBuilder去执行R脚本，也就是这句话:

new ProcessBuilder(Seq(rCommand, rFileNormalized) ++ otherArgs)

如何让spark worker识别sparkR代码呢？在R语言中变量R_PROFILE_USER ，用来初始化R运行环境，sparkR相关代码被打包提交到计算集群以后，在计算节点上面首先设置这个数值指向到初始化脚本${SPARK_HOME}/sparkr/SparkR/profile/general.R，这个脚本中识别路径，并且把解压后sparkR的代码安装到当前R环境中。下面是其代码

.First <- function() {

  packageDir <- Sys.getenv("SPARKR_PACKAGE_DIR")

  .libPaths(c(packageDir, .libPaths()))

  Sys.setenv(NOAWT=1)

}

下面的代码来自于prepareSubmitEnvironment

// In YARN mode for an R app, add the SparkR package archive to archives 

// that can be distributed with the job

 if (args.isR && clusterManager == YARN) { 

  val rPackagePath = RUtils.localSparkRPackagePath 

  if (rPackagePath.isEmpty) { 

     printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") 

   } 

   val rPackageFile =     RPackageUtils.zipRLibraries(new File(rPackagePath.get), SPARKR_PACKAGE_ARCHIVE) 

    if (!rPackageFile.exists()) { 

       printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") 

    } 

    val localURI = Utils.resolveURI(rPackageFile.getAbsolutePath)  

    // Assigns a symbol link name "sparkr" to the shipped package. 

    args.archives = mergeFileLists(args.archives, localURI.toString + "#sparkr") 

} 

 // If we're running a R app, set the main class to our specific R runner

 if (args.isR && deployMode == CLIENT) { 

  if (args.primaryResource == SPARKR_SHELL) { 

    args.mainClass = "org.apache.spark.api.r.RBackend"

   } else { 

    // If a R file is provided, add it to the child arguments and list of files to deploy.        // Usage: RRunner <main R file> [app arguments] 

    args.mainClass = "org.apache.spark.deploy.RRunner" 

    args.childArgs = ArrayBuffer(args.primaryResource) ++ args.childArgs             args.files = mergeFileLists(args.files, args.primaryResource) 

}

 }  

    if (isYarnCluster && args.isR) { 

 // In yarn-cluster mode for a R app, add primary resource to files 

 // that can be distributed with the job 

  args.files = mergeFileLists(args.files, args.primaryResource)

 }

对于普通scala/java作业，standalone情况下直接调用下面类

// In legacy standalone cluster mode, use Client as a wrapper around the user 
class childMainClass = "org.apache.spark.deploy.Client"

在client模式下直接提交用户应用主类运行，这里的主类如果是SPARKR_SHELL的话就是org.apache.spark.api.r.RBackend

直接提交文件执行则调用org.apache.spark.deploy.RRunner

 // In client mode, launch the application main class directly

 // In addition, add the main application jar and any added jars (if any) to the classpath

 if (deployMode == CLIENT) {

   childMainClass = args.mainClass 

  if (isUserJar(args.primaryResource)) { 

    childClasspath += args.primaryResource 

}   if (args.jars != null) {

 childClasspath ++= args.jars.split(",")

}   if (args.childArgs != null) {

 childArgs ++= args.childArgs

} 

}

在yarnCluster模式调度情况下，使用org.apache.spark.deploy.yarn.Client

这个类包装用户的类进行提交

// In yarn-cluster mode, use yarn.Client as a wrapper around the user class 

if (isYarnCluster) {

   childMainClass = "org.apache.spark.deploy.yarn.Client"

   if (args.isPython) {

     childArgs += ("--primary-py-file", args.primaryResource) 

    if (args.pyFiles != null) {

       childArgs += ("--py-files", args.pyFiles) 

    } 

    childArgs += ("--class", "org.apache.spark.deploy.PythonRunner") 

} else if (args.isR) {

     val mainFile = new Path(args.primaryResource).getName 

   childArgs += ("--primary-r-file", mainFile) 

    childArgs += ("--class", "org.apache.spark.deploy.RRunner") 

} else { 

    if (args.primaryResource != SPARK_INTERNAL) {

       childArgs += ("--jar", args.primaryResource) 

    } 

    childArgs += ("--class", args.mainClass) 

  }   if (args.childArgs != null) {

     args.childArgs.foreach { arg => childArgs += ("--arg", arg)

}

   }

 }

org.apache.spark.deploy.yarn.Client

Py调用spark过程，在python/pyspark/context.py下面存在

class SparkContext(object)

其中的_jvm成员作为py4j的调用存在，其初始化

233 if not SparkContext._gateway:

234    SparkContext._gateway=gateway or launch_gateway()

235    SparkContext._jvm=SparkContext._gateway.jvm

其调用后端方法

207         # Create a temporary directory inside spark.local.dir:

208         local_dir = self._jvm.org.apache.spark.util.Utils.getLocalDir(self._jsc.sc().conf())

209         self._temp_dir = \

210             self._jvm.org.apache.spark.util.Utils.createTempDir(local_dir, "pyspark") \

211                 .getAbsolutePath()

sparkR原理的更多相关文章

奇异值分解(SVD)原理与在降维中的应用
奇异值分解(Singular Value Decomposition,以下简称SVD)是在机器学习领域广泛应用的算法,它不光可以用于降维算法中的特征分解,还可以用于推荐系统,以及自然语言处理等领域.是 ...
node.js学习（三）简单的node程序&&模块简单使用&&commonJS规范&&深入理解模块原理
一.一个简单的node程序 1.新建一个txt文件 2.修改后缀修改之后会弹出这个,点击"是" 3.运行test.js 源文件使用node.js运行之后的. 如果该路径下没有该 ...
线性判别分析LDA原理总结
在主成分分析(PCA)原理总结中,我们对降维算法PCA做了总结.这里我们就对另外一种经典的降维方法线性判别分析(Linear Discriminant Analysis, 以下简称LDA)做一个总结. ...
[原] KVM 虚拟化原理探究（1）— overview
KVM 虚拟化原理探究- overview 标签(空格分隔): KVM 写在前面的话本文不介绍kvm和qemu的基本安装操作,希望读者具有一定的KVM实践经验.同时希望借此系列博客,能够对KVM底层 ...
H5单页面手势滑屏切换原理
H5单页面手势滑屏切换是采用HTML5 触摸事件(Touch) 和 CSS3动画(Transform,Transition)来实现的,效果图如下所示,本文简单说一下其实现原理和主要思路. 1.实现原理 ...
.NET Core中间件的注册和管道的构建（1）---- 注册和构建原理
.NET Core中间件的注册和管道的构建(1)---- 注册和构建原理 0x00 问题的产生管道是.NET Core中非常关键的一个概念,很多重要的组件都以中间件的形式存在,包括权限管理.会话管理 ...
python自动化测试（2）-自动化基本技术原理
python自动化测试(2) 自动化基本技术原理 1 概述在之前的文章里面提到过:做自动化的首要本领就是要会透过现象看本质 ,落实到实际的IT工作中就是透过界面看数据. 掌握上面的这样的本领 ...
CRC、反码求和校验原理分析
3月份开始从客户端转后台,算是幸运的进入全栈工程师的修炼阶段.这段时间一边是老项目的客户端加服务器两边的维护和交接,一边是新项目加加加班赶工,期间最长经历了连续工作三天只睡了四五个小时的煎熬,人生也算 ...
菜鸟学Struts2——Struts工作原理
在完成Struts2的HelloWorld后,对Struts2的工作原理进行学习.Struts2框架可以按照模块来划分为Servlet Filters,Struts核心模块,拦截器和用户实现部分,其中 ...

随机推荐

使用日志服务LogHub替换Kafka
https://yq.aliyun.com/articles/35979#index_section
STL——内存基本处理工具
STL定义有五个全局函数,作用于未初始化空间上,这样的功能对于容器的实现很有帮助.前两个函数是用于构造的construct()和用于析构的destroy(),另三个函数是uninitialized_c ...
获取设置dom属性
getAttribute():获取dom节点属性,带一个参数,表示要获取的属性使用方法:object.getAttribute("id"); setAttribute():设置do ...
JavaScript网站设计实践（三）设计有特色的主页，给主页链接添加JavaScript动画脚本
一.主页一般都会比较有特色,现在在网站设计(二)实现的基础上,来给主页添加一点动画效果. 1.这里实现的动画效果是:当鼠标悬停在其中某个超链接时,会显示出属于该页面的背景缩略图,让用户知道这个链接的页 ...
android百度地图定位开发
一.activity import android.app.Activity; import android.graphics.Point;import android.graphics.PointF ...
servlet案例
1.重定向方式1:在servlet中写:response.setStatus(302); response.setHeader("Location","路径& ...
关于properties文件在项目中的使用
这个是当时在学习JDBC的时候老师给讲的.web项目中把一些常用的用户名和密码都填写到一个对应的配置文件中,这样每次修改密码或者用户名的时候就可以直接修改这个配置文件了,不用动源码. 老师讲了两种读取 ...
Logcat中报内存泄漏MemoryLeak的一次分析
有时候运行APP的时候Logcat中会报错,提示资源没有释放,Memory leak, 于是打开Android Studio在Android Monitor工具栏点开,有Logcat和Monitors ...
[01] Oracle数据库简介
Oracle关系型数据库:建立在关系模型上. Oracle10g:g(grid)网格技术,网格计算(Grid Computing)通过网络共享,将大量的计算机连接起来,联合各个计算机的多余处理能力,产 ...
iOS NSMutableArray替换某个元素
A * a1 = [A new]; A * a2 = [A new]; A * a3 = [A new]; A * a4 = [A new]; NSMutableArray *arr = [[NSMu ...

sparkR原理

sparkR原理的更多相关文章

随机推荐

热门专题