首先rdd.checkpoint()本身并没有执行任何的写操作,只是做checkpointDir是否为空,然后生成一个ReliableRDDCheckpointData对象checkpointData,这个对象完成checkpoint的大部分工作。

/**
* 只是生成了一个ReliableRDDCheckpointData的对象,并没有具体的实质操作
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
def checkpoint(): Unit = RDDCheckpointData.synchronized {
// NOTE: we use a global lock here due to complexities downstream with ensuring
// children RDD partitions point to the correct parent partitions. In the future
// we should revisit this consideration.
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new ReliableRDDCheckpointData(this))
}
}

真正触发checkpoint操作的是rdd调用完checkpoint之后执行完的第一个action操作。

  /**
* Run a function on a given set of partitions in an RDD and pass the results to the given
* handler function. This is the main entry point for all actions in Spark.
*/
def runJob[T, U: ClassTag](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
resultHandler: (Int, U) => Unit): Unit = {
if (stopped.get()) {
throw new IllegalStateException("SparkContext has been shutdown")
}
val callSite = getCallSite
val cleanedFunc = clean(func)
logInfo("Starting job: " + callSite.shortForm)
if (conf.getBoolean("spark.logLineage", false)) {
logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
}
dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
progressBar.foreach(_.finishAll())
rdd.doCheckpoint()
}

其中调用rdd.doCheckpoint(),doCheckpoint代码如下:

/**
* Performs the checkpointing of this RDD by saving this. It is called after a job using this RDD
* has completed (therefore the RDD has been materialized and potentially stored in memory).
* doCheckpoint() is called recursively on the parent RDDs.
*
* checkpointData.get.checkpoint()方法执行具体的写操作,由sc的action触发。如果本身没有checkpoint就根据依赖关系依次往上找。
*/
private[spark] def doCheckpoint(): Unit = {
RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
if (!doCheckpointCalled) {
doCheckpointCalled = true
if (checkpointData.isDefined) {
if (checkpointAllMarkedAncestors) {
// TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint
// them in parallel.
// Checkpoint parents first because our lineage will be truncated after we
// checkpoint ourselves
dependencies.foreach(_.rdd.doCheckpoint())
}
checkpointData.get.checkpoint()
} else {
dependencies.foreach(_.rdd.doCheckpoint())
}
}
}
}

其中checkpointData.get.checkpoint执行了最基本的写任务,docheckpoint的任务职能是如果该rdd执行过checkpoint操作,如果是把该RDD的祖先都checkpoint了,那么就根据依赖关系一次checkpoint操作。如果RDD本身没有

调用过checkpoint操作,那么就根据依赖关系一次checkpoint操作。

接下来看checkpointData.get.checkpoint的具体实现,其中主要功能在于ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)方法。

  /**
* Materialize this RDD and write its content to a reliable DFS.
* This is called immediately after the first action invoked on this RDD has completed.
*
* writeRDDToCheckpointDirectory方法将RDD写到指定目录
*/
protected override def doCheckpoint(): CheckpointRDD[T] = {
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir) // Optionally clean our checkpoint files if the reference is out of scope
if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
rdd.context.cleaner.foreach { cleaner =>
cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
}
} logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
newRDD
}

以下是ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)的方法实现。主要包含两本分,写partition数据和写partitioner。具体如下:

  /**
* Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
* 写RDD到hdfs,包括partition数据和partitioner数据
*/
def writeRDDToCheckpointDirectory[T: ClassTag](
originalRDD: RDD[T],
checkpointDir: String,
blockSize: Int = -1): ReliableCheckpointRDD[T] = { val sc = originalRDD.sparkContext // Create the output path for the checkpoint
val checkpointDirPath = new Path(checkpointDir)
val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
if (!fs.mkdirs(checkpointDirPath)) {
throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
} // Save to file, and reload it as an RDD
val broadcastedConf = sc.broadcast(
new SerializableConfiguration(sc.hadoopConfiguration))
// TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
sc.runJob(originalRDD,
writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _) if (originalRDD.partitioner.nonEmpty) {
writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
} val newRDD = new ReliableCheckpointRDD[T](
sc, checkpointDirPath.toString, originalRDD.partitioner)
if (newRDD.partitions.length != originalRDD.partitions.length) {
throw new SparkException(
s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +
s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")
}
newRDD
}

写partition数据:

sc.runJob(originalRDD,
writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)
/**
* Write an RDD partition's data to a checkpoint file.
*/
def writePartitionToCheckpointFile[T: ClassTag](
path: String,
broadcastedConf: Broadcast[SerializableConfiguration],
blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {
val env = SparkEnv.get
val outputDir = new Path(path)
val fs = outputDir.getFileSystem(broadcastedConf.value.value) val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())
val finalOutputPath = new Path(outputDir, finalOutputName)
val tempOutputPath =
new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}") val bufferSize = env.conf.getInt("spark.buffer.size", 65536) val fileOutputStream = if (blockSize < 0) {
fs.create(tempOutputPath, false, bufferSize)
} else {
// This is mainly for testing purpose
fs.create(tempOutputPath, false, bufferSize,
fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
}
val serializer = env.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
serializeStream.writeAll(iterator)
} {
serializeStream.close()
} if (!fs.rename(tempOutputPath, finalOutputPath)) {
if (!fs.exists(finalOutputPath)) {
logInfo(s"Deleting tempOutputPath $tempOutputPath")
fs.delete(tempOutputPath, false)
throw new IOException("Checkpoint failed: failed to save output of task: " +
s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")
} else {
// Some other copy of this task must've finished before us and renamed it
logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")
if (!fs.delete(tempOutputPath, false)) {
logWarning(s"Error deleting ${tempOutputPath}")
}
}
}
}

111

写partitioner如下:

/**
* Write a partitioner to the given RDD checkpoint directory. This is done on a best-effort
* basis; any exception while writing the partitioner is caught, logged and ignored.
*/
private def writePartitionerToCheckpointDir(
sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit = {
try {
val partitionerFilePath = new Path(checkpointDirPath, checkpointPartitionerFileName)
val bufferSize = sc.conf.getInt("spark.buffer.size", 65536)
val fs = partitionerFilePath.getFileSystem(sc.hadoopConfiguration)
val fileOutputStream = fs.create(partitionerFilePath, false, bufferSize)
val serializer = SparkEnv.get.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
serializeStream.writeObject(partitioner)
} {
serializeStream.close()
}
logDebug(s"Written partitioner to $partitionerFilePath")
} catch {
case NonFatal(e) =>
logWarning(s"Error writing partitioner $partitioner to $checkpointDirPath")
}
}

spark checkpoint机制的更多相关文章

  1. Spark checkpoint机制简述

    本文主要简述spark checkpoint机制,快速把握checkpoint机制的来龙去脉,至于源码可以参考我的下一篇文章. 1.Spark core的checkpoint 1)为什么checkpo ...

  2. 深入浅出Spark的Checkpoint机制

    1 Overview 当第一次碰到 Spark,尤其是 Checkpoint 的时候难免有点一脸懵逼,不禁要问,Checkpoint 到底是什么.所以,当我们在说 Checkpoint 的时候,我们到 ...

  3. Spark cache、checkpoint机制笔记

    Spark学习笔记总结 03. Spark cache和checkpoint机制 1. RDD cache缓存 当持久化某个RDD后,每一个节点都将把计算的分片结果保存在内存中,并在对此RDD或衍生出 ...

  4. 60、Spark Streaming:缓存与持久化机制、Checkpoint机制

    一.缓存与持久化机制 与RDD类似,Spark Streaming也可以让开发人员手动控制,将数据流中的数据持久化到内存中.对DStream调用persist()方法,就可以让Spark Stream ...

  5. RDD之七:Spark容错机制

    引入 一般来说,分布式数据集的容错性有两种方式:数据检查点和记录数据的更新. 面向大规模数据分析,数据检查点操作成本很高,需要通过数据中心的网络连接在机器之间复制庞大的数据集,而网络带宽往往比内存带宽 ...

  6. 【Spark】Spark容错机制

    引入 一般来说,分布式数据集的容错性有两种方式:数据检查点和记录数据的更新. 面向大规模数据分析,数据检查点操作成本非常高,须要通过数据中心的网络连接在机器之间复制庞大的数据集,而网络带宽往往比内存带 ...

  7. Spark检查点机制

    Spark中对于数据的保存除了持久化操作之外,还提供了一种检查点的机制,检查点(本质是通过将RDD写入Disk做检查点)是为了通过lineage(血统)做容错的辅助,lineage过长会造成容错成本过 ...

  8. 【mysql】关于checkpoint机制

    一.简介 思考一下这个场景:如果重做日志可以无限地增大,同时缓冲池也足够大,那么是不需要将缓冲池中页的新版本刷新回磁盘.因为当发生宕机时,完全可以通过重做日志来恢复整个数据库系统中的数据到宕机发生的时 ...

  9. Spark工作机制简述

    Spark工作机制 主要模块 调度与任务分配 I/O模块 通信控制模块 容错模块 Shuffle模块 调度层次 应用 作业 Stage Task 调度算法 FIFO FAIR(公平调度) Spark应 ...

随机推荐

  1. NGINX: 限制连接的实践 (Defense DDOS)

    参考: [ nginx防止DDOS攻击配置 ] 关于限制用户连接,Nginx 提供的模块: [ ngx_http_limit_req_module ] [ ngx_http_limit_conn_mo ...

  2. 04-plis属性列表

      源代码下载链接:04-plis属性列表.zip27.8 KB // MJPerson.h // //  MJPerson.h //  04-plis属性列表 // //  Created by a ...

  3. 【洛谷 P1653】 猴子 (并查集)

    题目链接 没删除调试输出,原地炸裂,\(80\)->\(0\).如果你要问剩下的\(20\)呢?答:数组开小了. 这题正向删边判连通性是很不好做的,因为我们并不会并查集的逆操作.于是可以考虑把断 ...

  4. [bzoj1251]序列终结者——splay

    题目大意 网上有许多题,就是给定一个序列,要你支持几种操作:A.B.C.D.一看另一道题,又是一个序列 要支持几种操作:D.C.B.A.尤其是我们这里的某人,出模拟试题,居然还出了一道这样的,真是没技 ...

  5. php发送请求

    $opts = array( 'http'=>array( 'method'=>"GET", 'timeout'=>10, ));$context = strea ...

  6. pandas 读写sql数据库

    如何从数据库中读取数据到DataFrame中? 使用pandas.io.sql模块中的sql.read_sql_query(sql_str,conn)和sql.read_sql_table(table ...

  7. 热安装NGINX并支持多站点SSL

    https://www.moonfly.net/801.html http://www.centoscn.com/image-text/config/2015/0423/5251.html 1.查看n ...

  8. 复选框回显、全选、非全选、cookie处理数据、json数组对象转换处理学习笔记参考的页面

    <%@include file="/common/head.jsp"%> <%@ page contentType="text/html; charse ...

  9. 我的一次安装oracle的过程

    1.在装oracle之前,先安装.net3.5 2.然后正常安装oracle,一直next 3.装完oracle后,安装plsql dev工具,打开工具,发现没有connect as,是需要进行一些配 ...

  10. Image.FromStream与Image.FromFile使用区别

    将一个图片加载并显示在picturebox上,一般情况下得到预期的结果,然而对于同一个filepath, 若连续两次调用下面的语句系统将会报错(如用户多次选择加载同一张图片使用Image.FromFi ...