# if open wal
org.apache.spark.SparkException: Could not read data from write ahead log record FileBasedWriteAheadLogSegment

SparkStreaming开启了checkpoint wal后有时会出现如上报错,但不会影响整体程序,只会丢失报错的那个job的数据。其根本原因是wal文件被删了,被sparkstreaming自己的清除机制删掉了。通常意味着一定程度流式程序上存在速率不匹配或堆积问题。

查看driver日志可发现类似如下的日志:

-- :: INFO  [Logging.scala:] Attempting to clear  old log files in hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata older than 1490248380000:
-- :: INFO [Logging.scala:] Attempting to clear old log files in hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata older than 1490248470000: hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata/log-1490248404471-1490248464471
-- :: INFO [Logging.scala:] Cleared log files in hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata older than 1490248470000
-- :: ERROR [Logging.scala:] Task in stage 35.0 failed times; aborting job
-- :: ERROR [Logging.scala:] Error running job streaming job ms.
org.apache.spark.SparkException: Job aborted due to stage failure: Task in stage 35.0 failed times, most recent failure: Lost task 41.3 in stage 35.0 (TID , alps60): org.apache.spark.SparkException: Could not read data from write ahead log record FileBasedWriteAheadLogSegment(hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedData/0/log-1490248403649-1490248463649,44333482,118014)
at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$(WriteAheadLogBackedBlockRDD.scala:)

可以发现 1490248403649 的日志被删除程序删除了(cleared log older than 1490248470000),然后这个wal就报错了。

Spark官方文档没有任何关于这个的配置,因此直接看源码。(spark很多这样的坑,得看源码才知道如何hack或有些隐藏配置)。

1.FileBasedWriteAheadLogSegment 类中根据日志搜索发现了clean方法(后面的逻辑就是具体删除逻辑,暂不关心),核心就是如何调整这个threshTime了。

 /**
* Delete the log files that are older than the threshold time.
*
* Its important to note that the threshold time is based on the time stamps used in the log
* files, which is usually based on the local system time. So if there is coordination necessary
* between the node calculating the threshTime (say, driver node), and the local system time
* (say, worker node), the caller has to take account of possible time skew.
*
* If waitForCompletion is set to true, this method will return only after old logs have been
* deleted. This should be set to true only for testing. Else the files will be deleted
* asynchronously.
*/
def clean(threshTime: Long, waitForCompletion: Boolean): Unit = {
val oldLogFiles = synchronized {
val expiredLogs = pastLogs.filter { _.endTime < threshTime }
pastLogs --= expiredLogs
expiredLogs
}
logInfo(s"Attempting to clear ${oldLogFiles.size} old log files in $logDirectory " +
s"older than $threshTime: ${oldLogFiles.map { _.path }.mkString("\n")}")

2.一步步看调用追踪出去,ReceivedBlockHandler -> ReceiverSupervisorImpl -> CleanUpOldBlocks 。这里有个和ReceiverTracker通信的rpc,因此直接搜索CleanUpOldBlocks -> ReceiverTracker -> JobGenerator

在JobGenerator.clearCheckpointData 中有这么一段逻辑

 /** Clear DStream checkpoint data for the given `time`. */
private def clearCheckpointData(time: Time) {
ssc.graph.clearCheckpointData(time) // All the checkpoint information about which batches have been processed, etc have
// been saved to checkpoints, so its safe to delete block metadata and data WAL files
val maxRememberDuration = graph.getMaxInputStreamRememberDuration()
jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)
jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)
markBatchFullyProcessed(time)
}

发现了 ssc.graph有个 maxRememberDuration 的成员属性!这就意味着有机会通过ssc去修改它。

搜索一下代码便发现了相关方法:

jssc.remember(new Duration(2 * 3600 * 1000));

反思:

从之前的日志我们发现默认的清除间隔是几十秒左右,但是在代码中我们可以发现这个参数只能被设置一次(每次设置都会检查当前为null才生效,初始值为null)。所以问题来了,这几十秒在哪里设置的?代码一时没找到,于是项目直接搜索 remember,发现了在DStream里的初始化代码(其中slideDuration初始化来自InputDStream)。根据计算,我们的batchInterval为15s,其他两个没有设置,则checkpointDuration 为15s,rememberDuration为30s。

override def slideDuration: Duration = {
if (ssc == null) throw new Exception("ssc is null")
if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null")
ssc.graph.batchDuration
} /**
* Initialize the DStream by setting the "zero" time, based on which
* the validity of future times is calculated. This method also recursively initializes
* its parent DStreams.
*/
private[streaming] def initialize(time: Time) {
if (zeroTime != null && zeroTime != time) {
throw new SparkException("ZeroTime is already initialized to " + zeroTime
+ ", cannot initialize it again to " + time)
}
zeroTime = time // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
logInfo("Checkpoint interval automatically set to " + checkpointDuration)
} // Set the minimum value of the rememberDuration if not already set
var minRememberDuration = slideDuration
if (checkpointDuration != null && minRememberDuration <= checkpointDuration) {
// times 2 just to be sure that the latest checkpoint is not forgotten (#paranoia)
minRememberDuration = checkpointDuration * 2
}
if (rememberDuration == null || rememberDuration < minRememberDuration) {
rememberDuration = minRememberDuration
} // Initialize the dependencies
dependencies.foreach(_.initialize(zeroTime))
}

SparkStreaming “Could not read data from write ahead log record” 报错分析解决的更多相关文章

  1. sass-loader使用data引入公用文件或全局变量报错

    报错信息: ValidationError: Invalid options object. Sass Loader has been initialised using an options obj ...

  2. vue调用组件,组件回调给data中的数组赋值,报错Invalid prop type check failed for prop value. Expecte

    报错信息: 代码信息:调用一个tree组件,选择一些信息 <componentsTree ref="typeTreeComponent" @treeCheck="t ...

  3. 详细解读 :java.sql.SQLException: Connection is read-only. Queries leading to data modification are not allowed,Java报错之Connection is read-only.

    问题分析: 实际开发项目中,进行insert的时候,产生这个问题是Spring框架的一个安全权限保护方法,对于方法调用的事物保护,一般配置如下: <!-- 事务管理 属性 --> < ...

  4. filebeat+kafka+SparkStreaming程序报错及解决办法

    // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s. // :: WARN BlockManage ...

  5. @Data注解使用后get set报错解决方法

    Maven项目中已经导入相关的lombok.jar包但是使用后仍提示无set/get方法 .在idea中安装如下插件,安装后重启idea可用不报错. 转载于:https://www.cnblogs.c ...

  6. The data property "dialogVisble" is already declared as a prop. Use prop default value instead报错原因

    vue中使用props传递数据就不能在子组件的data中用同样的名字(比如dialogVisble)了,否则会报错.解决方法直接去掉data中的相同名字改为其他的.

  7. jQuery Ajax请求(关于火狐下SyntaxError: missing ] after element list ajax返回json,var json = eval("("+data+")"); 报错)

    $.ajax({    contentType: "application/x-www-form-urlencoded;charset=UTF-8" ,    type: &quo ...

  8. 1125MySQL Sending data导致查询很慢的问题详细分析

    -- 问题1 tablename使用主键索引反而比idx_ref_id慢的原因EXPLAIN SELECT SQL_NO_CACHE COUNT(id) FROM dbname.tbname FORC ...

  9. HBase的Write Ahead Log (WAL) —— 整体架构、线程模型

    解决的问题 HBase的Write Ahead Log (WAL)提供了一种高并发.持久化的日志保存与回放机制.每一个业务数据的写入操作(PUT / DELETE)执行前,都会记账在WAL中. 如果出 ...

随机推荐

  1. 清北合肥day1

    题目: 1.给出一个由0,1组成的环 求最少多少次交换(任意两个位置)使得0,1靠在一起 n<=1000 2.两个数列,支持在第一个数列上区间+1,-1 每次花费为1 求a变成b的最小代价 n& ...

  2. 【Ruby】Mac gem的一些坑

    前言 自上一次升级MacOS系统后出现jekyll无法构建的问题,当时处理半天.谁知道最近又升级了MacOS,荒废博客多时,今天吝啬写了一篇准备发布,构建报错,问题重新.还是记录下,以防下次升级出问题 ...

  3. 关于mac的一些常用操作记录

    之前记录过一个关于mac远程连接window机,实现共享文件的记录,今天记录一些常用的操作,会持续更新. 1.谷歌浏览器 f12的操作 command+option+i 打开调试面板 2.打开指定位置 ...

  4. Codeforces 1000G Two-Paths 树形动态规划 LCA

    原文链接https://www.cnblogs.com/zhouzhendong/p/9246484.html 题目传送门 - Codeforces 1000G Two-Paths 题意 给定一棵有 ...

  5. Hadoop |集群的搭建

    Hadoop组成 HDFS(Hadoop Distributed File System)架构概述 NameNode目录--主刀医生(nn):  DataNode(dn)数据: Secondary N ...

  6. 李宏毅机器学习笔记6:Why deep、Semi-supervised

    李宏毅老师的机器学习课程和吴恩达老师的机器学习课程都是都是ML和DL非常好的入门资料,在YouTube.网易云课堂.B站都能观看到相应的课程视频,接下来这一系列的博客我都将记录老师上课的笔记以及自己对 ...

  7. linux学习笔记 4建立用户

    一般用法  #useradd mysql 含义 创建 mysql用户 特殊用法 1> #useradd -d /usr/cjh -m cjh 含义:创建cjh用户 产生一个主目录 /usr/cj ...

  8. java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration] with root cause

    在用Java web和hbase连接时,报出来这个错误,一开始我以为是hbase导入的包位置不对,不从build path那里导入,而是从WEB-INF下的lib中导入,尝试了之后结果仍是这个错误 十 ...

  9. 彻底理解this 的值到底是什么?

    作者:方应杭 来源:知乎 你可能遇到过这样的 JS 面试题: var obj = { foo: function(){ console.log(this) } } var bar = obj.foo ...

  10. padding和margin设置成百分比

    Margin和Padding是我们在网页设计经常使用到的CSS样式,他们分别是间距和填充,一个作用于盒子外面,一个作用于盒子里面,默认的情况下,这些属性的值都会被计算在盒子的面积里面,在网页开发中的流 ...