sparkstreaming的状态计算-updateStateByKey源码

转发请注明原创地址：https://www.cnblogs.com/dongxiao-yang/p/11358781.html

本文基于spark源码版本为2.4.3

在流式计算中通常会有状态计算的需求，即当前计算结果不仅依赖于目前收到数据还需要之前结果进行合并计算的场景，由于sparkstreaming的mini-batch机制，必须将之前的状态结果存储在RDD中并在下一次batch计算时将其取出进行合并，这就是updateStateByKey方法的用处。

简单用例：

  def main(args: Array[String]): Unit = {

    val host = "localhost"

    val port = "8001"

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size

    val sparkConf = new SparkConf().setMaster("local[4]").setAppName("NetworkWordCount")

    val ssc = new StreamingContext(sparkConf, Seconds(10))

    ssc.checkpoint("/Users/dyang/Desktop/checkpoittmp")

    val lines = ssc.socketTextStream(host, port.toInt, StorageLevel.MEMORY_AND_DISK_SER)

    val words = lines.flatMap(_.split(" "))

    val wordCounts: DStream[(String, Int)] = words.map(x => (x, 1))
   //.reduceByKey(_ + _)

    val totalCounts = wordCounts.updateStateByKey{(values:Seq[Int],state:Option[Int])=>   Some(values.sum + state.getOrElse(0))}

    totalCounts.print()

    ssc.start()

    ssc.awaitTermination()

  }

　　上面例子展示了一个简单的wordcount版本的有状态统计，在updateStateByKey的作用下，应用会记住每个word之前count的总和并把下次到来的数据进行累加.

updateStateByKey拥有不同的参数封装版本，比较全的一个定义如下

  /**

   * Return a new "state" DStream where the state for each key is updated by applying

   * the given function on the previous state of the key and the new values of each key.

   * In every batch the updateFunc will be called for each state even if there are no new values.

   * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.

   * @param updateFunc State update function. Note, that this function may generate a different

   *                   tuple with a different key than the input key. Therefore keys may be removed

   *                   or added in this way. It is up to the developer to decide whether to

   *                   remember the partitioner despite the key being changed.

   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new

   *                    DStream

   * @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.

   * @tparam S State type

   */

  def updateStateByKey[S: ClassTag](

      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

      partitioner: Partitioner,

      rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {

    val cleanedFunc = ssc.sc.clean(updateFunc)

    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {

      cleanedFunc(it)

    }

    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)

  }

　　其中，参数里的updateFunc的是用户原本传入函数updateFunc: (Seq[V], Option[S]) => Option[S]的一次转化：

    val cleanedUpdateF: (Seq[V], Option[S]) => Option[S] = sparkContext.clean(updateFunc)

    val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {

      iterator.flatMap(t => {

        cleanedUpdateF(t._2, t._3).map(s => (t._1, s))

      })

    }

    updateStateByKey(newUpdateFunc, partitioner, true)

　　最终updateStateByKey的结果是将一个PairDStreamFunctions转化成了一个StateDStream。对于所有的Dstream，compute(time)方法都是他们生成每个duration RDD的具体实现

  override def compute(validTime: Time): Option[RDD[(K, S)]] = {

    // Try to get the previous state RDD

    getOrCompute(validTime - slideDuration) match {

      case Some(prevStateRDD) =>    // If previous state RDD exists

        // Try to get the parent RDD

        parent.getOrCompute(validTime) match {

          case Some(parentRDD) =>    // If parent RDD exists, then compute as usual

            computeUsingPreviousRDD (validTime, parentRDD, prevStateRDD)

          case None =>     // If parent RDD does not exist

            // Re-apply the update function to the old state RDD

            val updateFuncLocal = updateFunc

            val finalFunc = (iterator: Iterator[(K, S)]) => {

              val i = iterator.map(t => (t._1, Seq.empty[V], Option(t._2)))

              updateFuncLocal(validTime, i)

            }

            val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)

            Some(stateRDD)

        }

      case None =>    // If previous session RDD does not exist (first input data)

        // Try to get the parent RDD

        parent.getOrCompute(validTime) match {

          case Some(parentRDD) =>   // If parent RDD exists, then compute as usual

            initialRDD match {

              case None =>

                // Define the function for the mapPartition operation on grouped RDD;

                // first map the grouped tuple to tuples of required type,

                // and then apply the update function

                val updateFuncLocal = updateFunc

                val finalFunc = (iterator: Iterator[(K, Iterable[V])]) => {

                  updateFuncLocal (validTime,

                    iterator.map (tuple => (tuple._1, tuple._2.toSeq, None)))

                }

                val groupedRDD = parentRDD.groupByKey(partitioner)

                val sessionRDD = groupedRDD.mapPartitions(finalFunc, preservePartitioning)

                // logDebug("Generating state RDD for time " + validTime + " (first)")

                Some (sessionRDD)

              case Some (initialStateRDD) =>

                computeUsingPreviousRDD(validTime, parentRDD, initialStateRDD)

            }

          case None => // If parent RDD does not exist, then nothing to do!

            // logDebug("Not generating state RDD (no previous state, no parent)")

            None

        }

    }

  }

这里需要解释一下parent的含义：parent，是本 DStream 上游依赖的 DStream，从上面updateStateByKey最后对StateDstream实例化代码可知，它将self也就是生成PairDStreamFunctions的Dstream本身传了进来构造了Dstream之间的DAG关系。

每个Dstream内部通过一个HashMap[Time, RDD[T]] ()来管理已经生成过的RDD列表， key 是一个 Time；这个 Time 是与用户指定的 batchDuration 对齐了的时间 —— 如每 15s 生成一个 batch 的话，那么这里的 key 的时间就是 08h:00m:00s，08h:00m:15s 这种，所以其实也就代表是第几个 batch。generatedRDD 的 value 就是 RDD 的实例，所以parent.getOrCompute(validTime)这个调用表示了获取经过上游Dstream的transfer操作后生成对应的RDD。

上述源码已经带了非常详细的注释，排除掉各种parentRDD/（prevStateRDD/initialRDD）不完整的边界情况之后，方法进入到了合并当前数据和历史状态的方法：computeUsingPreviousRDD

  private [this] def computeUsingPreviousRDD(

      batchTime: Time,

      parentRDD: RDD[(K, V)],

      prevStateRDD: RDD[(K, S)]) = {

    // Define the function for the mapPartition operation on cogrouped RDD;

    // first map the cogrouped tuple to tuples of required type,

    // and then apply the update function

    val updateFuncLocal = updateFunc

    val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {

      val i = iterator.map { t =>

        val itr = t._2._2.iterator

        val headOption = if (itr.hasNext) Some(itr.next()) else None

        (t._1, t._2._1.toSeq, headOption)

      }

      updateFuncLocal(batchTime, i)

    }

    val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)

    val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)

    Some(stateRDD)

  }

这个方法首先将当前数据parentRDD和prevStateRDD进行了cogroup运算，返回的数据类型位RDD[(K, (Iterable[V], Iterable[S]))]，其中K是DStream的key的类型，value类型是当前数据的terable[V]和历史状态的Iterable[S])的二元Tuple，为了匹配这个参数类型spark将前面的updateFunc: (Iterator[(K, Seq[V], Option[S])])继续进行了封装

   val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))])

反过来看就是，最初形式为(K, (Iterable[V], Iterable[S]))的RDD数据经过一次封装变成了(Iterator[(K, Seq[V], Option[S])]格式再经过第二次封装变成了对用户自定义状态函数updateFunc: (Seq[V], Option[S]) => Option[S]的调用并返回RDD[(K, S)]格式的RDD。

注：

1 在spark源码中存在大量的隐式转换，比如updateStateByKey方法并不存在Dstream而是PairDStreamFunctions对象内，这是由于DStream的伴生对象中有一个隐式转换

  implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])

      (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):

    PairDStreamFunctions[K, V] = {

    new PairDStreamFunctions[K, V](stream)

  }

　　所有符合DStream[(K, V)]类型的key-value都会通过这个隐式转换适配成PairDStreamFunctions对象

2 在使用状态算子的时候必须打开checkpoint功能，程序启动器就无法通过条件检查报错：

java.lang.IllegalArgumentException: requirement failed: The checkpoint directory has not been set. Please set it by StreamingContext.checkpoint()

参考文献：

1 DStream 生成 RDD 实例详解

2 Spark源码中隐式转换的使用

sparkstreaming的状态计算-updateStateByKey源码的更多相关文章

基于HDFS的SparkStreaming案例实战和内幕源码解密
一:Spark集群开发环境准备启动HDFS,如下图所示: 通过web端查看节点正常启动,如下图所示: 2.启动Spark集群,如下图所示: 通过web端查看集群启动正常,如下图所示: 3.启动sta ...
Flume推送数据到SparkStreaming案例实战和内幕源码解密
本期内容: 1. Flume on HDFS案例回顾 2. Flume推送数据到Spark Streaming实战 3. 原理绘图剖析 1. Flume on HDFS案例回顾上节课要求大家自己安装 ...
Spark Streaming updateStateByKey案例实战和内幕源码解密
本节课程主要分二个部分: 一.Spark Streaming updateStateByKey案例实战二.Spark Streaming updateStateByKey源码解密第一部分: upda ...
Vue源码探究-状态初始化
Vue源码探究-状态初始化 Vue源码探究-源码文件组织 Vue源码探究-虚拟DOM的渲染本篇代码位于vue/src/core/instance/state.js 继续随着核心类的初始化展开探索其他 ...
从Linux源码看TIME_WAIT状态的持续时间
从Linux源码看TIME_WAIT状态的持续时间前言笔者一直以为在Linux下TIME_WAIT状态的Socket持续状态是60s左右.线上实际却存在TIME_WAIT超过100s的Socket ...
[源码解析] 深度学习流水线并行 GPipe(3) ----重计算
[源码解析] 深度学习流水线并行 GPipe(3) ----重计算目录 [源码解析] 深度学习流水线并行 GPipe(3) ----重计算 0x00 摘要 0x01 概述 1.1 前文回顾 1.2 ...
[源码解析] PyTorch 流水线并行实现 (4)--前向计算
[源码解析] PyTorch 流水线并行实现 (4)--前向计算目录 [源码解析] PyTorch 流水线并行实现 (4)--前向计算 0x00 摘要 0x01 论文 1.1 引论 1.1.1 数据 ...
[源码解析] PyTorch 流水线并行实现 (5)--计算依赖
[源码解析] PyTorch 流水线并行实现 (5)--计算依赖目录 [源码解析] PyTorch 流水线并行实现 (5)--计算依赖 0x00 摘要 0x01 前文回顾 0x02 计算依赖 0x0 ...
[源码解析] TensorFlow 分布式之 MirroredStrategy 分发计算
[源码解析] TensorFlow 分布式之 MirroredStrategy 分发计算目录 [源码解析] TensorFlow 分布式之 MirroredStrategy 分发计算 0x1. 运行 ...

随机推荐

Window10下Python3.7的wordcloud库的安装与基本使用
1.进入Python官网→点击Pypl→搜索“wordcloud”.如下图所示: 2.使用cmd安装,具体操作如下: 使用 pip list 查看是否安装成功
Linux系统上对其他用户隐藏进程的简单方法
mount -o remount,rw,hidepid=2 /proc 我使用的是多用户系统,大部分的用户通过ssh客户端访问他们的资源.我如何(怎么样)避免泄露进程信息给他们?如何(怎么样)在Deb ...
深度排序模型概述（二）PNN/NFM/AFM
在CTR预估中,为了解决稀疏特征的问题,学者们提出了FM模型来建模特征之间的交互关系.但是FM模型只能表达特征之间两两组合之间的关系,无法建模两个特征之间深层次的关系或者说多个特征之间的交互关系,因此 ...
SSH环境搭建之Hibernate环境搭建篇
SSH环境搭建之Hibernate环境搭建篇搭建有两种方式: 1.使用IntelliJ IDEA或者MyEclipse的逆向工程(关系模型 -> 对象模型),我使用的是IntelliJ IDE ...
使用html2canvas在手机端独立实现h5页面转图片
需求方便用户把每日消息的海报图片分享到微信朋友圈进行消息扩散实现方案使用html2canvas 插件,html2canvas 1.0.0-alpha.11 ,github地址:https://g ...
mysql 表字段与关键字相同的话
desc is a reserved keyword (short for DESCENDING in ORDER BY). Enlose it into backticks: INSERT INTO ...
嵌入式linux修改日期时间
命令格式为: date -s 时间字符串例如只修改系统的日期,不修改时间(时分秒) date -s 2012-08-02 或只修改时间不修改日期 date -s 10:08:00 当然也可以同时修改 ...
<<代码大全>>阅读笔记之二变量名的力量
1.变量命名的注意事项 1)可理解性变量要望文知义,看到这个变量不用看其他的代码就知道这个变量表示什么意思好的变量命:currentDate, heartRate 糟糕的变量名:newButton ...
Microsoft.Practices.Unity使用配置文件总是报错The type name or alias could not be resolved.
Type name could not be resolved. Please check config file http://stackoverflow.com/questions/1493564 ...
O(1)快速乘与O(log)快速乘
//O(1)快速乘 inline LL quick_mul(LL x,LL y,LL MOD){ x=x%MOD,y=y%MOD; return ((x*y-(LL)(((long d ...

sparkstreaming的状态计算-updateStateByKey源码

sparkstreaming的状态计算-updateStateByKey源码的更多相关文章

随机推荐

热门专题