Spark 代码走读之 Cache

Spark是基于内存的计算模型，但是当compute chain非常长或者某个计算代价非常大时，能将某些计算的结果进行缓存就显得很方便了。Spark提供了两种缓存的方法 Cache 和 checkPoint。本章只关注 Cache (基于spark-core_2.10)，在后续的章节中会提到 checkPoint.

主要从以下三方面来看

persist时发生什么
执行action时如何去缓存及读取缓存
如何释放缓存

定义缓存

spark的计算是lazy的，只有在执行action时才真正去计算每个RDD的数据。要使RDD缓存，必须在执行某个action之前定义RDD.persist()，此时也就定义了缓存，但是没有真正去做缓存。RDD.persist会调用到SparkContext.persistRDD(rdd)，同时将RDD注册到ContextCleaner中（后面会讲到这个ContextCleaner）。

def persist(newLevel: StorageLevel): this.type = {

    // TODO: Handle changes of StorageLevel

    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {

      throw new UnsupportedOperationException(

        "Cannot change storage level of an RDD after it was already assigned a level")

    }

    sc.persistRDD(this)

    // Register the RDD with the ContextCleaner for automatic GC-based cleanup

    sc.cleaner.foreach(_.registerRDDForCleanup(this))

    storageLevel = newLevel

    this

  }

sc.persistRDD很简单，将（rdd.id, rdd）加到persistentRdds中。persistentRDDs一个HashMap，key就是rdd.id，value是一个包含时间戳的对rdd的弱引用。persistentRDDs用来跟踪已经被标记为persist的RDD的引用的。

所以在定义缓存阶段，做了两件事：一是设置了rdd的StorageLevel，而是将rdd加到了persistentRdds中并在ContextCleaner中注册。

缓存

当执行到某个action时，真正计算才开始，这时会调用DAGScheduler.submitJob去提交job，通过rdd.iterator()来计算partition。

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {

    if (storageLevel != StorageLevel.NONE) {

      SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)

    } else {

      computeOrReadCheckpoint(split, context)

    }

  }

iterator的逻辑很清楚，如果srorageLevel被标记过了就去CacheManager取，否则自己compute或者从checkPoint读取。

在cacheManager.getOrCompute中，通过RDDBlockId尝试去BlockManager中得到缓存的数据。如果缓存得不到（第一次计算），并调用computeOrReadCheckPoint去计算，并将结果cache起来，cache是通过putInBlockManger实现。根据StorageLevel，如果是缓存在内存中，会将结果存在MemoryStore的一个HashMap中，如果是在disk，结果通过DiskStore.put方法存到磁盘的某个文件夹中。这个文件及最终由Utils中的方法确定

private def getOrCreateLocalRootDirsImpl(conf: SparkConf): Array[String] = {

    if (isRunningInYarnContainer(conf)) {

      // If we are in yarn mode, systems can have different disk layouts so we must set it

      // to what Yarn on this system said was available. Note this assumes that Yarn has

      // created the directories already, and that they are secured so that only the

      // user has access to them.

      getYarnLocalDirs(conf).split(",")

    } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {

      conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)

    } else {

      // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user

      // configuration to point to a secure directory. So create a subdirectory with restricted

      // permissions under each listed directory.

      Option(conf.getenv("SPARK_LOCAL_DIRS"))

        .getOrElse(conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")))

        .split(",")

        .flatMap { root =>

          try {

            val rootDir = new File(root)

            if (rootDir.exists || rootDir.mkdirs()) {

              val dir = createTempDir(root)

              chmod700(dir)

              Some(dir.getAbsolutePath)

            } else {

              logError(s"Failed to create dir in $root. Ignoring this directory.")

              None

            }

          } catch {

            case e: IOException =>

            logError(s"Failed to create local root dir in $root. Ignoring this directory.")

            None

          }

        }

        .toArray

    }

  }

如果已经缓存了，那么cacheManager.getOrCompute在调用blockManger.get(RDDBlockId)时会返回结果。get会先调用getLocal在本地获取，如果本地没有则调用getRemote去远程寻找，getRemote会call BlockMangerMaster.getLocation得到缓存的地址。

释放

Spark通过调用rdd.unpersit来释放缓存，这是通过SparkContext.unpersistRDD来实现的。在unpersistRDD中，rdd会从persistentRdds中移除，并通知BlockManagerMaster去删除数据缓存。BlockManagerMaster会通过消息机制告诉exectutor去删除内存或者disk上的缓存数据。

那么问题来了，如果用户不通过手动来unpersit，那缓存岂不是越积越多，最后爆掉吗？

是的，你的想法完全合理。因此Spark会自动删除不在scope内的缓存。“不在scope”指的是在用户程序中已经没有了该RDD的引用，RDD的数据是不可读取的。这里就要用到之前提到的ContextCleaner。ContextCleaner存了CleanupTaskWeakReference弱引用及存放该引用的队列。当系统发生GC将没有强引用的rdd对象回收后，这个弱引用会加入到队列中。ContextCleaner起了单独的一个线程轮询该队列，将队列中的弱引用取出，根据引用中的rddId触发sc.unpersistRDD。通过这样Spark能及时的将已经垃圾回收的RDD对应的cache进行释放。这里要清楚rdd与数据集的关系，rdd只是一个定义了计算逻辑的对象，对象本身不会包含其所代表的数据，数据要通过rdd.compute计算得到。所以系统回收rdd，只是回收了rdd对象，并没有回收rdd代表的数据集。

此外，SparkContext中还有一个MetadataCleaner，该cleaner会移除persistentRdds中的过期的rdd。（笔者一直没清楚这个移除和cache释放有什么关系？？）

Reference:

https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

http://blog.csdn.net/yueqian_zhu/article/details/48177353

http://www.cnblogs.com/jiaan-geng/p/5189177.html

Spark 代码走读之 Cache的更多相关文章

UNIMRCP 代码走读
基于UNIMRCP1.5.0的代码走读与填坑记录 1. server启动配置加载入口:unimrcp_server.c static apt_bool_t unimrcp_server_load ...
Spark代码调优（一）
环境极其恶劣情况下: import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.sp ...
Spark代码中设置appName在client模式和cluster模式中不一样问题
问题 Spark应用名在使用yarn-cluster模式提交时不生效,在使用yarn-client模式提交时生效,如图1所示,第一个应用是使用yarn-client模式提交的,正确显示我们代码里设置的 ...
Spark代码Eclipse远程调试
我们在编写Spark Application或者是阅读源码的时候,我们很想知道代码的运行情况,比如参数设置的是否正确等等.用Logging方式来调试是一个可以选择的方式,但是,logging方式调试代 ...
Github提交Spark代码
记录下提交过程,易忘供查询用.内容源自田总的分享. 1)在github上fork一份最新的master代码 2)用社区代码库创建本地仓库 git clone https://github.com/ap ...
WebRTC代码走读（八）：代码目录结构
转载注明出处http://blog.csdn.net/wanghorse ├── ./base //基础平台库,包括线程.锁.socket等 ├── ./build //编译脚本,gyp ├── ./ ...
Qt Creator插件工作流程代码走读
Qt Creator有个很风骚的插件管理器PluginManager,还有个很骚包的插件说明PluginSpec.基本上,所有的Qt程序的入口都是传统的C程序一样,代码流程从main()函数开始. ...
本地开发spark代码上传spark集群服务并运行
打包 :右击.export.Java .jar File 把TestSpark.jar包上传到spark集群服务器的 spark_home下的myApp下: 提交spark任务: cd /usr/lo ...
lda spark 代码官方文档
http://spark.apache.org/docs/1.6.1/mllib-clustering.html#latent-dirichlet-allocation-lda http://spar ...

随机推荐

Matrix(坑)
https://github.com/florent37/Android-3D-Layout
网站出现502 bad getway
最近项目之余,领导叫解决下系统网站经常出现502的问题,作为小头头的我,怎能不顶上. 流程开始走起,先查nginx,嗯,配置是大众的.是不是缓存溢出了呢.调节buffer的值 .貌似也没什么影响啊.5 ...
漫谈 Google 的 Native Client 技术（一）---- 历史动力篇（Web 本地计算发展史）
转自:http://hzx5.blog.163.com/blog/static/40744388201172522313463/ 漫谈 Google 的 Native Client 技术(一)---- ...
记录python爬取猫眼票房排行榜(带stonefont字体网页),保存到text文件,csv文件和MongoDB数据库中
猫眼票房排行榜页面显示如下: 注意右边的票房数据显示,爬下来的数据是这样显示的: 网页源代码中是这样显示的: 这是因为网页中使用了某种字体的缘故,分析源代码可知: 亲测可行: 代码中获取的是国内票房榜 ...
cmake处理多源文件目录
cmake处理多源文件目录假设我们的源文件全部在src中,则我们需要在子文件src中建立文件CmakeLists.txt,内容如下: AUX_SOURCE_DIRECTORY(. DIR_TEST_ ...
在 CentOS7 上配置 nginx 虚拟主机
创建配置文件保存目录,其中 sites-available 用来实际保存配置文件,sites-enabled 用来保存符号链接 : mkdir /etc/nginx/sites-available m ...
pandas之cut(),qcut()
功能:将数据进行离散化可参见博客:https://blog.csdn.net/missyougoon/article/details/83986511 , 例子简易好懂 1.pd.cut函数有7个参 ...
xml00
<?xml verson="1.0" encoding="ISO-8859-1"?> xml声明<note> <to>jon ...
java的classLoader原理理解和分析
java的classLoader原理理解和分析学习了:http://blog.csdn.net/tangkund3218/article/details/50088249 ClassNotFound ...
[Angular] Getting to Know the @Attribute Decorator in Angular
So when you using input binding in Angular, it will always check for update. If you want to improve ...

Spark 代码走读之 Cache

定义缓存

缓存

释放

Spark 代码走读之 Cache的更多相关文章

随机推荐

热门专题