14:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密
首先简单解释一下什么是state(状态)管理?我们以wordcount为例。每个batchInterval会计算当前batch的单词计数,那如果需要计算从流开始到目前为止的单词出现的次数,该如计算呢?SparkStreaming提供了两种方法:updateStateByKey和mapWithState 。mapWithState 是1.6版本新增功能,目前属于实验阶段。mapWithState具官方说性能较updateStateByKey提升10倍。那么我们来看看他们到底是如何实现的。
object UpdateStateByKeyDemo {def main(args: Array[String]) {val conf = new SparkConf().setAppName("UpdateStateByKeyDemo")val ssc = new StreamingContext(conf,Seconds(20))//要使用updateStateByKey方法,必须设置Checkpoint。ssc.checkpoint("/checkpoint/")val socketLines = ssc.socketTextStream("localhost",9999)socketLines.flatMap(_.split(",")).map(word=>(word,1)).updateStateByKey((currValues:Seq[Int],preValue:Option[Int]) =>{val currValue = currValues.sum //将目前值相加Some(currValue + preValue.getOrElse(0)) //目前值的和加上历史值}).print()ssc.start()ssc.awaitTermination()ssc.stop()}}
implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):PairDStreamFunctions[K, V] = {new PairDStreamFunctions[K, V](stream)}
def updateStateByKey[S: ClassTag](updateFunc: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)] = ssc.withScope {updateStateByKey(updateFunc, defaultPartitioner())}
def updateStateByKey[S: ClassTag](updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],partitioner: Partitioner,rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)}
private [this] def computeUsingPreviousRDD (parentRDD: RDD[(K, V)], prevStateRDD: RDD[(K, S)]) = {// Define the function for the mapPartition operation on cogrouped RDD;// first map the cogrouped tuple to tuples of required type,// and then apply the update functionval updateFuncLocal = updateFuncval finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {val i = iterator.map { t =>val itr = t._2._2.iteratorval headOption = if (itr.hasNext) Some(itr.next()) else None(t._1, t._2._1.toSeq, headOption)}updateFuncLocal(i)}val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)Some(stateRDD)}
object StatefulNetworkWordCount {def main(args: Array[String]) {if (args.length < 2) {System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")System.exit(1)}StreamingExamples.setStreamingLogLevels()val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")// Create the context with a 1 second batch sizeval ssc = new StreamingContext(sparkConf, Seconds(1))ssc.checkpoint(".")// Initial state RDD for mapWithState operationval initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))// Create a ReceiverInputDStream on target ip:port and count the// words in input stream of \n delimited test (eg. generated by 'nc')val lines = ssc.socketTextStream(args(0), args(1).toInt)val words = lines.flatMap(_.split(" "))val wordDstream = words.map(x => (x, 1))// Update the cumulative count using mapWithState// This will give a DStream made of state (which is the cumulative count of the words)val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {val sum = one.getOrElse(0) + state.getOption.getOrElse(0)val output = (word, sum)state.update(sum)output}val stateDstream = wordDstream.mapWithState(StateSpec.function(mappingFunc).initialState(initialRDD))stateDstream.print()ssc.start()ssc.awaitTermination()}}
mapWithState接收的参数是一个StateSpec对象。在StateSpec中封装了状态管理的函数
mapWithState函数中创建了MapWithStateDStreamImpl对象
def mapWithState[StateType: ClassTag, MappedType: ClassTag](spec: StateSpec[K, V, StateType, MappedType]): MapWithStateDStream[K, V, StateType, MappedType] = {new MapWithStateDStreamImpl[K, V, StateType, MappedType](self,spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]])}
/** Internal implementation of the [[MapWithStateDStream]] */private[streaming] class MapWithStateDStreamImpl[KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag](dataStream: DStream[(KeyType, ValueType)],spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {private val internalStream =new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)override def slideDuration: Duration = internalStream.slideDurationoverride def dependencies: List[DStream[_]] = List(internalStream)override def compute(validTime: Time): Option[RDD[MappedType]] = {internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }}
/** Method that generates a RDD for the given time */override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {// Get the previous state or create a new empty state RDDval prevStateRDD = getOrCompute(validTime - slideDuration) match {case Some(rdd) =>if (rdd.partitioner != Some(partitioner)) {// If the RDD is not partitioned the right way, let us repartition it using the// partition index as the key. This is to ensure that state RDD is always partitioned// before creating another state RDD using itMapWithStateRDD.createFromRDD[K, V, S, E](rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)} else {rdd}case None =>MapWithStateRDD.createFromPairRDD[K, V, S, E](spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),partitioner,validTime)}// Compute the new state RDD with previous state RDD and partitioned data RDD// Even if there is no data RDD, use an empty one to create a new state RDDval dataRDD = parent.getOrCompute(validTime).getOrElse {context.sparkContext.emptyRDD[(K, V)]}val partitionedDataRDD = dataRDD.partitionBy(partitioner)val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>(validTime - interval).milliseconds}Some(new MapWithStateRDD(prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))}
override def compute(partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]val prevStateRDDIterator = prevStateRDD.iterator(stateRDDPartition.previousSessionRDDPartition, context)val dataIterator = partitionedDataRDD.iterator(stateRDDPartition.partitionedDataRDDPartition, context)- //prevRecord 代表一个分区的数据
val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else Noneval newRecord = MapWithStateRDDRecord.updateRecordWithData(prevRecord,dataIterator,mappingFunction,batchTime,timeoutThresholdTime,removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled)Iterator(newRecord)}
private[streaming] case class MapWithStateRDDRecord[K, S, E](var stateMap: StateMap[K, S], var mappedData: Seq[E])
def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](prevRecord: Option[MapWithStateRDDRecord[K, S, E]],dataIterator: Iterator[(K, V)],mappingFunction: (Time, K, Option[V], State[S]) => Option[E],batchTime: Time,timeoutThresholdTime: Option[Long],removeTimedoutData: Boolean): MapWithStateRDDRecord[K, S, E] = {// 创建一个新的 state map 从过去的Recoord中复制 (如果存在) 否则创建一下空的StateMap对象val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }val mappedData = new ArrayBuffer[E]- //状态
val wrappedState = new StateImpl[S]()// Call the mapping function on each record in the data iterator, and accordingly// update the states touched, and collect the data returned by the mapping functiondataIterator.foreach { case (key, value) =>//获取key对应的状态wrappedState.wrap(newStateMap.get(key))- //调用mappingFunction获取返回值
val returned = mappingFunction(batchTime, key, Some(value), wrappedState)//维护newStateMap的值if (wrappedState.isRemoved) {newStateMap.remove(key)} else if (wrappedState.isUpdated|| (wrappedState.exists && timeoutThresholdTime.isDefined)) {newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)}mappedData ++= returned}// Get the timed out state records, call the mapping function on each and collect the// data returnedif (removeTimedoutData && timeoutThresholdTime.isDefined) {newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>wrappedState.wrapTimingOutState(state)val returned = mappingFunction(batchTime, key, None, wrappedState)mappedData ++= returnednewStateMap.remove(key)}}MapWithStateRDDRecord(newStateMap, mappedData)}
14:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密的更多相关文章
- Spark Streaming源码解读之State管理之UpdataStateByKey和MapWithState解密
本期内容 : UpdateStateByKey解密 MapWithState解密 Spark Streaming是实现State状态管理因素: 01. Spark Streaming是按照整个Bach ...
- Spark Streaming源码解读之JobScheduler内幕实现和深度思考
本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...
- Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考
本节的主要内容: 一.数据接受架构和设计模式 二.接受数据的源码解读 Spark Streaming不断持续的接收数据,具有Receiver的Spark 应用程序的考虑. Receiver和Drive ...
- 15、Spark Streaming源码解读之No Receivers彻底思考
在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Appr ...
- Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
本期内容 : 数据接收架构设计模式 数据接收源码彻底研究 一.Spark Streaming数据接收设计模式 Spark Streaming接收数据也相似MVC架构: 1. Mode相当于Rece ...
- Spark Streaming源码解读之Receiver生成全生命周期彻底研究和思考
本期内容 : Receiver启动的方式设想 Receiver启动源码彻底分析 多个输入源输入启动,Receiver启动失败,只要我们的集群存在就希望Receiver启动成功,运行过程中基于每个Tea ...
- Spark Streaming源码解读之生成全生命周期彻底研究与思考
本期内容 : DStream与RDD关系彻底研究 Streaming中RDD的生成彻底研究 问题的提出 : 1. RDD是怎么生成的,依靠什么生成 2.执行时是否与Spark Core上的RDD执行有 ...
- Spark Streaming源码解读之Job动态生成和深度思考
本期内容 : Spark Streaming Job生成深度思考 Spark Streaming Job生成源码解析 Spark Core中的Job就是一个运行的作业,就是具体做的某一件事,这里的JO ...
- 16.Spark Streaming源码解读之数据清理机制解析
原创文章,转载请注明:转载自 听风居士博客(http://www.cnblogs.com/zhouyf/) 本期内容: 一.Spark Streaming 数据清理总览 二.Spark Streami ...
随机推荐
- CSS文字溢出部分自动用"..."代替
CSS文字溢出部分自动用"..."代替 如html部分: <h4><马尔代夫双鱼岛Olhuveli4 晚6 日自助游></h4> <p&g ...
- Java设计模式の代理模式
目录 代理模式 1.1.静态代理 1.2.动态代理 1.3.Cglib代理 代理模式 代理(Proxy)是一种设计模式,提供了对目标对象另外的访问方式;即通过代理对象访问目标对象.这样做的好处是 ...
- Java项目中读取properties文件,以及六种获取路径的方法
下面1-4的内容是网上收集的相关知识,总结来说,就是如下几个知识点: 最常用读取properties文件的方法 InputStream in = getClass().getResourceAsStr ...
- 构造+分块思想 Codeforces Round #319 (Div. 1) C
http://codeforces.com/contest/576/problem/C 题目大意: 给你一个曼哈顿距离的图,然后要求你找到一个链,链穿了所有的点 然后要求这链的长度<=25*10 ...
- HDU 5144 三分
开始推导用公式求了好久(真的蠢),发现精度有点不够. 其实这种凸线上求点类的应该上三分法的,当作入门吧... /** @Date : 2017-09-23 21:15:57 * @FileName: ...
- (2.1)windows下Nutch1.7的安装
酒店评论情感分析系统(二)——Nutch安装 一.需求部分 Nutch是Java开发的所以需要下载Java JDK. 下载地址http://java.sun.com/javase/downloads/ ...
- 引用类型 ( 对象定义 )——Array 类型
本文地址:http://www.cnblogs.com/veinyin/p/7607293.html 一个数组中可以存储不同类型的值,可以混合存储数字.字符串.对象等 1 创建数组 1.1 构造函数 ...
- mac终端配色
1. 终端输入 ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)" 2. brew installxz ...
- python算法之近似熵、互近似熵算法
理论基础 近似熵? 定义:近似熵是一个随机复杂度,反应序列相邻的m个点所连成折线段的模式的互相近似的概率与由m+1个点所连成的折线段的模式相互近似的概率之差. 作用:用来描述复杂系统的不规则性,越是不 ...
- 343.Integer Break---dp
题目链接:https://leetcode.com/problems/integer-break/description/ 题目大意:给定一个自然数,将其分解,对其分解的数作乘积,找出最大的乘积结果. ...