SparkStreaming “Could not read data from write ahead log record” 报错分析解决

# if open wal

org.apache.spark.SparkException: Could not read data from write ahead log record FileBasedWriteAheadLogSegment

SparkStreaming开启了checkpoint wal后有时会出现如上报错，但不会影响整体程序，只会丢失报错的那个job的数据。其根本原因是wal文件被删了，被sparkstreaming自己的清除机制删掉了。通常意味着一定程度流式程序上存在速率不匹配或堆积问题。

查看driver日志可发现类似如下的日志：

-- :: INFO  [Logging.scala:] Attempting to clear  old log files in hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata older than 1490248380000:

-- :: INFO  [Logging.scala:] Attempting to clear  old log files in hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata older than 1490248470000: hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata/log-1490248404471-1490248464471

-- :: INFO  [Logging.scala:] Cleared log files in hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedBlockMetadata older than 1490248470000

-- :: ERROR [Logging.scala:] Task  in stage 35.0 failed  times; aborting job

-- :: ERROR [Logging.scala:] Error running job streaming job  ms.

org.apache.spark.SparkException: Job aborted due to stage failure: Task  in stage 35.0 failed  times, most recent failure: Lost task 41.3 in stage 35.0 (TID , alps60): org.apache.spark.SparkException: Could not read data from write ahead log record FileBasedWriteAheadLogSegment(hdfs://alps-cluster/tmp/banyan/checkpoint/RhinoWechatConsumer/receivedData/0/log-1490248403649-1490248463649,44333482,118014)

        at org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$(WriteAheadLogBackedBlockRDD.scala:)

可以发现 1490248403649 的日志被删除程序删除了（cleared log older than 1490248470000），然后这个wal就报错了。

Spark官方文档没有任何关于这个的配置，因此直接看源码。（spark很多这样的坑，得看源码才知道如何hack或有些隐藏配置）。

1.FileBasedWriteAheadLogSegment 类中根据日志搜索发现了clean方法（后面的逻辑就是具体删除逻辑，暂不关心），核心就是如何调整这个threshTime了。

 /**

   * Delete the log files that are older than the threshold time.

   *

   * Its important to note that the threshold time is based on the time stamps used in the log

   * files, which is usually based on the local system time. So if there is coordination necessary

   * between the node calculating the threshTime (say, driver node), and the local system time

   * (say, worker node), the caller has to take account of possible time skew.

   *

   * If waitForCompletion is set to true, this method will return only after old logs have been

   * deleted. This should be set to true only for testing. Else the files will be deleted

   * asynchronously.

   */

  def clean(threshTime: Long, waitForCompletion: Boolean): Unit = {

    val oldLogFiles = synchronized {

      val expiredLogs = pastLogs.filter { _.endTime < threshTime }

      pastLogs --= expiredLogs

      expiredLogs

    }

    logInfo(s"Attempting to clear ${oldLogFiles.size} old log files in $logDirectory " +

      s"older than $threshTime: ${oldLogFiles.map { _.path }.mkString("\n")}")

2.一步步看调用追踪出去，ReceivedBlockHandler -> ReceiverSupervisorImpl -> CleanUpOldBlocks 。这里有个和ReceiverTracker通信的rpc，因此直接搜索CleanUpOldBlocks -> ReceiverTracker -> JobGenerator

在JobGenerator.clearCheckpointData 中有这么一段逻辑

 /** Clear DStream checkpoint data for the given `time`. */

  private def clearCheckpointData(time: Time) {

    ssc.graph.clearCheckpointData(time)

    // All the checkpoint information about which batches have been processed, etc have

    // been saved to checkpoints, so its safe to delete block metadata and data WAL files

    val maxRememberDuration = graph.getMaxInputStreamRememberDuration()

    jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)

    jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)

    markBatchFullyProcessed(time)

  }

发现了 ssc.graph有个 maxRememberDuration 的成员属性！这就意味着有机会通过ssc去修改它。

搜索一下代码便发现了相关方法：

jssc.remember(new Duration(2 * 3600 * 1000));

反思：

从之前的日志我们发现默认的清除间隔是几十秒左右，但是在代码中我们可以发现这个参数只能被设置一次（每次设置都会检查当前为null才生效，初始值为null）。所以问题来了，这几十秒在哪里设置的？代码一时没找到，于是项目直接搜索 remember，发现了在DStream里的初始化代码（其中slideDuration初始化来自InputDStream）。根据计算，我们的batchInterval为15s，其他两个没有设置，则checkpointDuration 为15s，rememberDuration为30s。

override def slideDuration: Duration = {

    if (ssc == null) throw new Exception("ssc is null")

    if (ssc.graph.batchDuration == null) throw new Exception("batchDuration is null")

    ssc.graph.batchDuration

  }

 /**

   * Initialize the DStream by setting the "zero" time, based on which

   * the validity of future times is calculated. This method also recursively initializes

   * its parent DStreams.

   */

  private[streaming] def initialize(time: Time) {

    if (zeroTime != null && zeroTime != time) {

      throw new SparkException("ZeroTime is already initialized to " + zeroTime

        + ", cannot initialize it again to " + time)

    }

    zeroTime = time

    // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger

    if (mustCheckpoint && checkpointDuration == null) {

      checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt

      logInfo("Checkpoint interval automatically set to " + checkpointDuration)

    }

    // Set the minimum value of the rememberDuration if not already set

    var minRememberDuration = slideDuration

    if (checkpointDuration != null && minRememberDuration <= checkpointDuration) {

      // times 2 just to be sure that the latest checkpoint is not forgotten (#paranoia)

      minRememberDuration = checkpointDuration * 2

    }

    if (rememberDuration == null || rememberDuration < minRememberDuration) {

      rememberDuration = minRememberDuration

    }

    // Initialize the dependencies

    dependencies.foreach(_.initialize(zeroTime))

  }

SparkStreaming “Could not read data from write ahead log record” 报错分析解决的更多相关文章

sass-loader使用data引入公用文件或全局变量报错
报错信息: ValidationError: Invalid options object. Sass Loader has been initialised using an options obj ...
vue调用组件，组件回调给data中的数组赋值，报错Invalid prop type check failed for prop value. Expecte
报错信息: 代码信息:调用一个tree组件,选择一些信息 <componentsTree ref="typeTreeComponent" @treeCheck="t ...
详细解读 :java.sql.SQLException: Connection is read-only. Queries leading to data modification are not allowed,Java报错之Connection is read-only.
问题分析: 实际开发项目中,进行insert的时候,产生这个问题是Spring框架的一个安全权限保护方法,对于方法调用的事物保护,一般配置如下:  < ...
filebeat+kafka+SparkStreaming程序报错及解决办法
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s. // :: WARN BlockManage ...
@Data注解使用后get set报错解决方法
Maven项目中已经导入相关的lombok.jar包但是使用后仍提示无set/get方法 .在idea中安装如下插件,安装后重启idea可用不报错. 转载于:https://www.cnblogs.c ...
The data property "dialogVisble" is already declared as a prop. Use prop default value instead报错原因
vue中使用props传递数据就不能在子组件的data中用同样的名字(比如dialogVisble)了,否则会报错.解决方法直接去掉data中的相同名字改为其他的.
jQuery Ajax请求（关于火狐下SyntaxError: missing ] after element list ajax返回json，var json = eval("("+data+")"); 报错）
$.ajax({ contentType: "application/x-www-form-urlencoded;charset=UTF-8" , type: &quo ...
1125MySQL Sending data导致查询很慢的问题详细分析
-- 问题1 tablename使用主键索引反而比idx_ref_id慢的原因EXPLAIN SELECT SQL_NO_CACHE COUNT(id) FROM dbname.tbname FORC ...
HBase的Write Ahead Log (WAL) —— 整体架构、线程模型
解决的问题 HBase的Write Ahead Log (WAL)提供了一种高并发.持久化的日志保存与回放机制.每一个业务数据的写入操作(PUT / DELETE)执行前,都会记账在WAL中. 如果出 ...

随机推荐

Select2 多层次赋值时异步赋值的问题
场景: 当选择人员时加载人员,选择部门时加载部门.所以在人员下,选择人员A后,如果选择部门,会触发二级select 重新获取数据. 问题: 使用select2()方法进行绑定远程数据后,对第二个sel ...
php7安装参数编译
系统:Centos6.8 软件包:php-7.0.14.tar.gz yum install bzip2 bzip2-devel -y yum install curl curl-devel -y y ...
一起学Hadoop——TotalOrderPartitioner类实现全局排序
Hadoop排序,从大的范围来说有两种排序,一种是按照key排序,一种是按照value排序.如果按照value排序,只需在map函数中将key和value对调,然后在reduce函数中在对调回去.从小 ...
hbase0.94.11版本和hbase1.4.9版本的benchamark区别
1.起初使用ycsb对hbase进行benchmark,分别在100%写的情况下检测写性能:在100%读的情况下检测读的性能.实验数据如下: 2.新版本的habse写性能竟然不如老版本.!!!.于是我 ...
python--return小练习
#返回单个值,return a:#一个return后的语句不再执行,def calc_sum(*args): ax = 0 for n in args: ax = ax + nprint(ax); r ...
Java程序员如何选择未来的职业路线
一.程序员的特性技术出身的职场人特性很明显,与做市场.业务出身的职场人区别尤其明显.IT行业中常见的一些职场角色:老板.项目经理.产品经理.需求分析师.设计师.开发工程师.运维工程师等.开发工程师具 ...
设计模式之单例模式及应用demo
单例模式是创建型模式之一. 单例模式顾名思义是单例的,也就是只有一个实例化对象,这都来源于它的私有化构造函数. 单例模式特点: 1.单例类只能有一个实例. 2.单例类必须自己创建自己的唯一实例. 3. ...
day76 auth模块用户验证,
概要: form组件回顾: (1) 创建form组件对应的类,比如LoginForm (2) views.login: if get请求: form_obj=LoginForm() return re ...
day64 url用法以及django的路由系统
此篇博客是以备后查的,用到的时候记得过来查找即可! 路由系统:就是我们的django项目创建的时候自带的那个urls.py 它本身里面是映射的对应关系,一个大的列表里面,一个个元祖,元祖里面是url或 ...
day4 字符串的操作
今天是第四天,一如既往的每天都有不会做的内容,然后还是那种你使劲的绞尽脑汁都想不出来的问题,而且还得是别人提示着,讲着,演示着才能明白的,过后自己还得使劲捉摸才能慢慢吃透.一开始还挺顺利的,还以为自己 ...

SparkStreaming “Could not read data from write ahead log record” 报错分析解决

SparkStreaming “Could not read data from write ahead log record” 报错分析解决的更多相关文章

随机推荐

热门专题