Spark 代码走读之 Cache

Spark是基于内存的计算模型，但是当compute chain非常长或者某个计算代价非常大时，能将某些计算的结果进行缓存就显得很方便了。Spark提供了两种缓存的方法 Cache 和 checkPoint。本章只关注 Cache (基于spark-core_2.10)，在后续的章节中会提到 checkPoint.

主要从以下三方面来看

persist时发生什么
执行action时如何去缓存及读取缓存
如何释放缓存

定义缓存

spark的计算是lazy的，只有在执行action时才真正去计算每个RDD的数据。要使RDD缓存，必须在执行某个action之前定义RDD.persist()，此时也就定义了缓存，但是没有真正去做缓存。RDD.persist会调用到SparkContext.persistRDD(rdd)，同时将RDD注册到ContextCleaner中（后面会讲到这个ContextCleaner）。

def persist(newLevel: StorageLevel): this.type = {

    // TODO: Handle changes of StorageLevel

    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel) {

      throw new UnsupportedOperationException(

        "Cannot change storage level of an RDD after it was already assigned a level")

    }

    sc.persistRDD(this)

    // Register the RDD with the ContextCleaner for automatic GC-based cleanup

    sc.cleaner.foreach(_.registerRDDForCleanup(this))

    storageLevel = newLevel

    this

  }

sc.persistRDD很简单，将（rdd.id, rdd）加到persistentRdds中。persistentRDDs一个HashMap，key就是rdd.id，value是一个包含时间戳的对rdd的弱引用。persistentRDDs用来跟踪已经被标记为persist的RDD的引用的。

所以在定义缓存阶段，做了两件事：一是设置了rdd的StorageLevel，而是将rdd加到了persistentRdds中并在ContextCleaner中注册。

缓存

当执行到某个action时，真正计算才开始，这时会调用DAGScheduler.submitJob去提交job，通过rdd.iterator()来计算partition。

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {

    if (storageLevel != StorageLevel.NONE) {

      SparkEnv.get.cacheManager.getOrCompute(this, split, context, storageLevel)

    } else {

      computeOrReadCheckpoint(split, context)

    }

  }

iterator的逻辑很清楚，如果srorageLevel被标记过了就去CacheManager取，否则自己compute或者从checkPoint读取。

在cacheManager.getOrCompute中，通过RDDBlockId尝试去BlockManager中得到缓存的数据。如果缓存得不到（第一次计算），并调用computeOrReadCheckPoint去计算，并将结果cache起来，cache是通过putInBlockManger实现。根据StorageLevel，如果是缓存在内存中，会将结果存在MemoryStore的一个HashMap中，如果是在disk，结果通过DiskStore.put方法存到磁盘的某个文件夹中。这个文件及最终由Utils中的方法确定

private def getOrCreateLocalRootDirsImpl(conf: SparkConf): Array[String] = {

    if (isRunningInYarnContainer(conf)) {

      // If we are in yarn mode, systems can have different disk layouts so we must set it

      // to what Yarn on this system said was available. Note this assumes that Yarn has

      // created the directories already, and that they are secured so that only the

      // user has access to them.

      getYarnLocalDirs(conf).split(",")

    } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {

      conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)

    } else {

      // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user

      // configuration to point to a secure directory. So create a subdirectory with restricted

      // permissions under each listed directory.

      Option(conf.getenv("SPARK_LOCAL_DIRS"))

        .getOrElse(conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")))

        .split(",")

        .flatMap { root =>

          try {

            val rootDir = new File(root)

            if (rootDir.exists || rootDir.mkdirs()) {

              val dir = createTempDir(root)

              chmod700(dir)

              Some(dir.getAbsolutePath)

            } else {

              logError(s"Failed to create dir in $root. Ignoring this directory.")

              None

            }

          } catch {

            case e: IOException =>

            logError(s"Failed to create local root dir in $root. Ignoring this directory.")

            None

          }

        }

        .toArray

    }

  }

如果已经缓存了，那么cacheManager.getOrCompute在调用blockManger.get(RDDBlockId)时会返回结果。get会先调用getLocal在本地获取，如果本地没有则调用getRemote去远程寻找，getRemote会call BlockMangerMaster.getLocation得到缓存的地址。

释放

Spark通过调用rdd.unpersit来释放缓存，这是通过SparkContext.unpersistRDD来实现的。在unpersistRDD中，rdd会从persistentRdds中移除，并通知BlockManagerMaster去删除数据缓存。BlockManagerMaster会通过消息机制告诉exectutor去删除内存或者disk上的缓存数据。

那么问题来了，如果用户不通过手动来unpersit，那缓存岂不是越积越多，最后爆掉吗？

是的，你的想法完全合理。因此Spark会自动删除不在scope内的缓存。“不在scope”指的是在用户程序中已经没有了该RDD的引用，RDD的数据是不可读取的。这里就要用到之前提到的ContextCleaner。ContextCleaner存了CleanupTaskWeakReference弱引用及存放该引用的队列。当系统发生GC将没有强引用的rdd对象回收后，这个弱引用会加入到队列中。ContextCleaner起了单独的一个线程轮询该队列，将队列中的弱引用取出，根据引用中的rddId触发sc.unpersistRDD。通过这样Spark能及时的将已经垃圾回收的RDD对应的cache进行释放。这里要清楚rdd与数据集的关系，rdd只是一个定义了计算逻辑的对象，对象本身不会包含其所代表的数据，数据要通过rdd.compute计算得到。所以系统回收rdd，只是回收了rdd对象，并没有回收rdd代表的数据集。

此外，SparkContext中还有一个MetadataCleaner，该cleaner会移除persistentRdds中的过期的rdd。（笔者一直没清楚这个移除和cache释放有什么关系？？）

Reference:

https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

http://blog.csdn.net/yueqian_zhu/article/details/48177353

http://www.cnblogs.com/jiaan-geng/p/5189177.html

Spark 代码走读之 Cache的更多相关文章

UNIMRCP 代码走读
基于UNIMRCP1.5.0的代码走读与填坑记录 1. server启动配置加载入口:unimrcp_server.c static apt_bool_t unimrcp_server_load ...
Spark代码调优（一）
环境极其恶劣情况下: import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.sp ...
Spark代码中设置appName在client模式和cluster模式中不一样问题
问题 Spark应用名在使用yarn-cluster模式提交时不生效,在使用yarn-client模式提交时生效,如图1所示,第一个应用是使用yarn-client模式提交的,正确显示我们代码里设置的 ...
Spark代码Eclipse远程调试
我们在编写Spark Application或者是阅读源码的时候,我们很想知道代码的运行情况,比如参数设置的是否正确等等.用Logging方式来调试是一个可以选择的方式,但是,logging方式调试代 ...
Github提交Spark代码
记录下提交过程,易忘供查询用.内容源自田总的分享. 1)在github上fork一份最新的master代码 2)用社区代码库创建本地仓库 git clone https://github.com/ap ...
WebRTC代码走读（八）：代码目录结构
转载注明出处http://blog.csdn.net/wanghorse ├── ./base //基础平台库,包括线程.锁.socket等 ├── ./build //编译脚本,gyp ├── ./ ...
Qt Creator插件工作流程代码走读
Qt Creator有个很风骚的插件管理器PluginManager,还有个很骚包的插件说明PluginSpec.基本上,所有的Qt程序的入口都是传统的C程序一样,代码流程从main()函数开始. ...
本地开发spark代码上传spark集群服务并运行
打包 :右击.export.Java .jar File 把TestSpark.jar包上传到spark集群服务器的 spark_home下的myApp下: 提交spark任务: cd /usr/lo ...
lda spark 代码官方文档
http://spark.apache.org/docs/1.6.1/mllib-clustering.html#latent-dirichlet-allocation-lda http://spar ...

随机推荐

Python爬虫1-----urllib模块
1.加载urllib模块的request from urllib import request 2.相关函数: (1)urlopen函数:读取网页 webpage=request.urlopen(ur ...
struct timeval和gettimeofday
struct timeval和gettimeofday() struct timeval结构体在time.h中的定义为: struct timeval { time_t tv_sec; /* Seco ...
UEditor如何读取数据库信息？
你用的什么语言,服务器端生成的时候,直接写在里面就可以了啊,比如 <textarea name="content" cols="800" rows=&qu ...
CNN卷机网络在自然语言处理问题上的应用
首先申明本人的英语很搓,看英文非常吃力,只能用这种笨办法来方便下次阅读.有理解错误的地方,请别喷我. 什么是卷积和什么是卷积神经网络就不讲了,自行google.从在自然语言处理的应用开始(SO, HO ...
hdu 2435dinic算法模板+最小割性质
hdu2435最大流最小割 2014-03-22 我来说两句来源:hdu2435最大流最小割收藏我要投稿 2435 There is a war 题意: 给你一个有向图,其中可以有一条边是无敌的 ...
[using_microsoft_infopath_2010]Chapter10 与SharePoint Designer工作流集成
本章概要: 1.创建工作流集成表单 2.允许工作流手动触发 3.创建自定义动作按钮 4.使用人物设计器 5.维护工作流人物表单
crm高速开发之Entity
我们在后台代码里面操作Entity的时候,基本上是这样写的: /* 创建者:菜刀居士的博客 * 创建日期:2014年07月5号 */ namespace Net.CRM.Entity { ...
POJ 3243
Babystep算法.具体为什么,我也不太明白,好像资料不多. #include <iostream> #include <cstdio> #include <cstri ...
POI进行ExcelSheet的拷贝
POI进行ExcelSheet的拷贝学习了:http://www.360doc.com/content/17/0508/20/42823223_652205632.shtml,这个也需要改改这个: ...
hdu 4888 2014多校第三场1002 Redraw Beautiful Drawings 网络流
思路:一開始以为是高斯消元什么的.想让队友搞,结果队友说不好搞,可能是网络流.我恍然,思路立刻就有了. 我们建一个二部图.左边是行,右边是列,建个源点与行建边,容量是该行的和.列与新建的汇点建边.容量 ...

Spark 代码走读之 Cache

定义缓存

缓存

释放

Spark 代码走读之 Cache的更多相关文章

随机推荐

热门专题