16.Spark Streaming源码解读之数据清理机制解析
原创文章,转载请注明:转载自 听风居士博客(http://www.cnblogs.com/zhouyf/)
本期内容:
一、Spark Streaming 数据清理总览
二、Spark Streaming 数据清理过程详解
三、Spark Streaming 数据清理的触发机制
Spark Streaming不像普通Spark 的应用程序,普通Spark程序运行完成后,中间数据会随着SparkContext的关闭而被销毁,而Spark Streaming一直在运行,不断计算,每一秒中在不断运行都会产生大量的中间数据,所以需要对对象及元数据需要定期清理。每个batch duration运行时不断触发job后需要清理rdd和元数据。下面我们就结合源码详细解析一下Spark Streaming程序的数据清理机制。
一、数据清理总览
Spark Streaming 运行过程中,随着时间不断产生Job,当job运行结束后,需要清理相应的数据(RDD,元数据信息,Checkpoint数据),Job由JobGenerator定时产生,数据的清理也是有JobGenerator负责。
JobGenerator负责数据清理控制的代码位于一个消息循环体eventLoop中:
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)override protected def onError(e: Throwable): Unit = {jobScheduler.reportError("Error in job generator", e)}}eventLoop.start()
/** Processes all events */private def processEvent(event: JobGeneratorEvent) {logDebug("Got event " + event)event match {case GenerateJobs(time) => generateJobs(time)case ClearMetadata(time) => clearMetadata(time)case DoCheckpoint(time, clearCheckpointDataLater) =>doCheckpoint(time, clearCheckpointDataLater)case ClearCheckpointData(time) => clearCheckpointData(time)}}
/** Clear DStream metadata for the given `time`. */private def clearMetadata(time: Time) {ssc.graph.clearMetadata(time)// If checkpointing is enabled, then checkpoint,// else mark batch to be fully processedif (shouldCheckpoint) {eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))} else {// If checkpointing is not enabled, then delete metadata information about// received blocks (block data not saved in any case). Otherwise, wait for// checkpointing of this batch to complete.val maxRememberDuration = graph.getMaxInputStreamRememberDuration()jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)markBatchFullyProcessed(time)}}
def clearMetadata(time: Time) {logDebug("Clearing metadata for time " + time)this.synchronized {outputStreams.foreach(_.clearMetadata(time))}logDebug("Cleared old metadata for time " + time)}
private[streaming] def clearMetadata(time: Time) {val unpersistData = ssc.conf.getBoolean("spark.streaming.unpersist", true)//获取需要清理的RDDval oldRDDs = generatedRDDs.filter(_._1 <= (time - rememberDuration))logDebug("Clearing references to old RDDs: [" +oldRDDs.map(x => s"${x._1} -> ${x._2.id}").mkString(", ") + "]")//将要清除的RDD从generatedRDDs 中清除generatedRDDs --= oldRDDs.keysif (unpersistData) {logDebug(s"Unpersisting old RDDs: ${oldRDDs.values.map(_.id).mkString(", ")}")oldRDDs.values.foreach { rdd =>//将RDD 从persistence列表中移除rdd.unpersist(false)// Explicitly remove blocks of BlockRDDrdd match {case b: BlockRDD[_] =>logInfo(s"Removing blocks of RDD $b of time $time")//移除RDD的block 数据b.removeBlocks()case _ =>}}}logDebug(s"Cleared ${oldRDDs.size} RDDs that were older than " +s"${time - rememberDuration}: ${oldRDDs.keys.mkString(", ")}")//清除依赖的DStreamdependencies.foreach(_.clearMetadata(time))}
if (shouldCheckpoint) {eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))} else {// If checkpointing is not enabled, then delete metadata information about// received blocks (block data not saved in any case). Otherwise, wait for// checkpointing of this batch to complete.val maxRememberDuration = graph.getMaxInputStreamRememberDuration()jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)markBatchFullyProcessed(time)}
def cleanupOldBatches(cleanupThreshTime: Time, waitForCompletion: Boolean): Unit = synchronized {require(cleanupThreshTime.milliseconds < clock.getTimeMillis())val timesToCleanup = timeToAllocatedBlocks.keys.filter { _ < cleanupThreshTime }.toSeqlogInfo(s"Deleting batches: ${timesToCleanup.mkString(" ")}")if (writeToLog(BatchCleanupEvent(timesToCleanup))) {//将要删除的Batch数据清除timeToAllocatedBlocks --= timesToCleanup//清理WAL日志writeAheadLogOption.foreach(_.clean(cleanupThreshTime.milliseconds, waitForCompletion))} else {logWarning("Failed to acknowledge batch clean up in the Write Ahead Log.")}}
def cleanup(batchThreshTime: Time): Unit = synchronized {val timesToCleanup = batchTimeToInputInfos.keys.filter(_ < batchThreshTime)logInfo(s"remove old batch metadata: ${timesToCleanup.mkString(" ")}")batchTimeToInputInfos --= timesToCleanup}
/** Clear DStream checkpoint data for the given `time`. */private def clearCheckpointData(time: Time) {ssc.graph.clearCheckpointData(time)// All the checkpoint information about which batches have been processed, etc have// been saved to checkpoints, so its safe to delete block metadata and data WAL filesval maxRememberDuration = graph.getMaxInputStreamRememberDuration()jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)markBatchFullyProcessed(time)}
def clearCheckpointData(time: Time) {logInfo("Clearing checkpoint data for time " + time)this.synchronized {outputStreams.foreach(_.clearCheckpointData(time))}logInfo("Cleared checkpoint data for time " + time)}
private[streaming] def clearCheckpointData(time: Time) {logDebug("Clearing checkpoint data")checkpointData.cleanup(time)dependencies.foreach(_.clearCheckpointData(time))logDebug("Cleared checkpoint data")}
def cleanup(time: Time) {// 获取需要清理的Checkpoint 文件 时间timeToOldestCheckpointFileTime.remove(time) match {case Some(lastCheckpointFileTime) =>//获取需要删除的文件val filesToDelete = timeToCheckpointFile.filter(_._1 < lastCheckpointFileTime)logDebug("Files to delete:\n" + filesToDelete.mkString(","))filesToDelete.foreach {case (time, file) =>try {val path = new Path(file)if (fileSystem == null) {fileSystem = path.getFileSystem(dstream.ssc.sparkContext.hadoopConfiguration)}//删除文件fileSystem.delete(path, true)timeToCheckpointFile -= timelogInfo("Deleted checkpoint file '" + file + "' for time " + time)} catch {case e: Exception =>logWarning("Error deleting old checkpoint file '" + file + "' for time " + time, e)fileSystem = null}}case None =>logDebug("Nothing to delete")}}
_eventLoop = eventLoopif (_eventLoop != null) {_eventLoop.post(JobCompleted(job, clock.getTimeMillis()))}
private def processEvent(event: JobSchedulerEvent) {try {event match {case JobStarted(job, startTime) => handleJobStart(job, startTime)case JobCompleted(job, completedTime) => handleJobCompletion(job, completedTime)case ErrorReported(m, e) => handleError(m, e)}} catch {case e: Throwable =>reportError("Error in job scheduler", e)}}
def onBatchCompletion(time: Time) {eventLoop.post(ClearMetadata(time))}
// All done, print successval finishTime = System.currentTimeMillis()logInfo("Checkpoint for time " + checkpointTime + " saved to file '" + checkpointFile +"', took " + bytes.length + " bytes and " + (finishTime - startTime) + " ms")//调用JobGenerator的方法进行checkpoint数据清理jobGenerator.onCheckpointCompletion(checkpointTime, clearCheckpointDataLater)return
def onCheckpointCompletion(time: Time, clearCheckpointDataLater: Boolean) {if (clearCheckpointDataLater) {eventLoop.post(ClearCheckpointData(time))}}
private def clearMetadata(time: Time) {ssc.graph.clearMetadata(time)if (shouldCheckpoint) {//发送DoCheckpoint消息,并进行相应的Checkpoint数据清理eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))} else {val maxRememberDuration = graph.getMaxInputStreamRememberDuration()jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)markBatchFullyProcessed(time)}}
private def doCheckpoint(time: Time, clearCheckpointDataLater: Boolean) {if (shouldCheckpoint && (time - graph.zeroTime).isMultipleOf(ssc.checkpointDuration)) {logInfo("Checkpointing graph for time " + time)ssc.graph.updateCheckpointData(time)checkpointWriter.write(new Checkpoint(ssc, time), clearCheckpointDataLater)}}
def write(checkpoint: Checkpoint, clearCheckpointDataLater: Boolean) {try {val bytes = Checkpoint.serialize(checkpoint, conf)- //将参数clearCheckpointDataLater传入CheckpoitWriteHandler
executor.execute(new CheckpointWriteHandler(checkpoint.checkpointTime, bytes, clearCheckpointDataLater))logInfo("Submitted checkpoint of time " + checkpoint.checkpointTime + " writer queue")} catch {case rej: RejectedExecutionException =>logError("Could not submit checkpoint task to the thread pool executor", rej)}}
16.Spark Streaming源码解读之数据清理机制解析的更多相关文章
- Spark Streaming源码解读之数据清理内幕彻底解密
本期内容 : Spark Streaming数据清理原理和现象 Spark Streaming数据清理代码解析 Spark Streaming一直在运行的,在计算的过程中会不断的产生RDD ,如每秒钟 ...
- Spark Streaming源码解读之JobScheduler内幕实现和深度思考
本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...
- Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考
本节的主要内容: 一.数据接受架构和设计模式 二.接受数据的源码解读 Spark Streaming不断持续的接收数据,具有Receiver的Spark 应用程序的考虑. Receiver和Drive ...
- 15、Spark Streaming源码解读之No Receivers彻底思考
在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Appr ...
- 第12课:Spark Streaming源码解读之Executor容错安全性
一.Spark Streaming 数据安全性的考虑: Spark Streaming不断的接收数据,并且不断的产生Job,不断的提交Job给集群运行.所以这就涉及到一个非常重要的问题数据安全性. S ...
- Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
本期内容 : 数据接收架构设计模式 数据接收源码彻底研究 一.Spark Streaming数据接收设计模式 Spark Streaming接收数据也相似MVC架构: 1. Mode相当于Rece ...
- Spark Streaming源码解读之Receiver生成全生命周期彻底研究和思考
本期内容 : Receiver启动的方式设想 Receiver启动源码彻底分析 多个输入源输入启动,Receiver启动失败,只要我们的集群存在就希望Receiver启动成功,运行过程中基于每个Tea ...
- Spark Streaming源码解读之Job动态生成和深度思考
本期内容 : Spark Streaming Job生成深度思考 Spark Streaming Job生成源码解析 Spark Core中的Job就是一个运行的作业,就是具体做的某一件事,这里的JO ...
- 11.Spark Streaming源码解读之Driver中的ReceiverTracker架构设计以及具体实现彻底研究
上篇文章详细解析了Receiver不断接收数据的过程,在Receiver接收数据的过程中会将数据的元信息发送给ReceiverTracker: 本文将详细解析ReceiverTracker的的架构 ...
随机推荐
- centos 前端环境搭建
Node.js 安装 wget 下载安装 yum -y install gcc make gcc-c++ openssl-devel wget node v6.11.0 下载 wget https:/ ...
- SQL Server 2008如何开启数据库的远程连接
SQL Server 2008默认是不允许远程连接的,如果想要在本地用SSMS连接远程服务器上的SQL Server 2008,远程连接数据库.需要做两个部分的配置: 1,SQL Server Man ...
- WebSocket解释及如何兼容低版本浏览器
WebSocket类似HTTP 协议,是为了弥补HTTP 协议的缺陷:通信只能由客户端发起,HTTP 协议做不到服务器主动向客户端推送信息. WebSocket 协议在2008年诞生,2011年成为国 ...
- [DeeplearningAI笔记]序列模型3.9-3.10语音辨识/CTC损失函数/触发字检测
5.3序列模型与注意力机制 觉得有用的话,欢迎一起讨论相互学习~Follow Me 3.9语音辨识 Speech recognition 问题描述 对于音频片段(audio clip)x ,y生成文本 ...
- Windows 安装 RabbitMQ
RabbitMQ概述 RabbitMQ是流行的开源消息队列系统,是AMQP(Advanced Message Queuing Protocol高级消息队列协议)的标准实现,用erlang语言开发.Ra ...
- JAVA 日期处理工具类 DateUtils
package com.genlot.common.utils; import java.sql.Timestamp;import java.text.ParseException;import ja ...
- 用create-react-app来快速配置react
最近在学react,然后感觉自己之前用的express+gulp+webpack+ejs的工作环境还是基于html+js+css这种三层架构的应用,完全跟react不是一回事. 愚蠢的我居然在原先的这 ...
- idea中tomcat乱码问题解决
在idea中经常遇到jsp的乱码问题,原因是编码不是UTF-8的问题,这次来彻底解决idea的编码问题 首先设置idea编辑器的编码: File-Setting设置如下 然后配置tomcat的编码问题 ...
- Linux轻量级自动运维工具-Ansible浅析【转】
转自 Linux轻量级自动运维工具-Ansible浅析 - ~微风~ - 51CTO技术博客http://weiweidefeng.blog.51cto.com/1957995/1895261 Ans ...
- Sklearn-GridSearchCV网格搜索
GridSearchCV,它存在的意义就是自动调参,只要把参数输进去,就能给出最优化的结果和参数.但是这个方法适合于小数据集,一旦数据的量级上去了,很难得出结果.这个时候就是需要动脑筋了.数据量比较大 ...