1. updateStateByKey 解释:

    以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加

    在有新的数据信息进入或更新时。能够让用户保持想要的不论什么状。使用这个功能须要完毕两步:

    1) 定义状态:能够是随意数据类型

    2) 定义状态更新函数:用一个函数指定怎样使用先前的状态。从输入流中的新值更新状态。

    对于有状态操作,要不断的把当前和历史的时间切片的RDD累加计算,随着时间的流失,计算的数据规模会变得越来越大。

  2. updateStateByKey源代码:

    /**

    • Return a new “state” DStream where the state for each key is updated by applying
    • the given function on the previous state of the key and the new values of the key.
    • org.apache.spark.Partitioner is used to control the partitioning of each RDD.
    • @param updateFunc State update function. If this function returns None, then
    • corresponding state key-value pair will be eliminated.
    • @param partitioner Partitioner for controlling the partitioning of each RDD in the new
    • DStream.
    • @param initialRDD initial state value of each key.
    • @tparam S State type

      */

      def updateStateByKey[S: ClassTag](

      updateFunc: (Seq[V], Option[S]) => Option[S],

      partitioner: Partitioner,

      initialRDD: RDD[(K, S)]

      ): DStream[(K, S)] = {

      val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {

      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))

      }

      updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)

      }
  3. 代码实现

    • StatefulNetworkWordCount

      object StatefulNetworkWordCount {
      def main(args: Array[String]) {
      if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
      } Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount)
      } val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
      } val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local")
      // Create the context with a 1 second batch size
      val ssc = new StreamingContext(sparkConf, Seconds(1))
      ssc.checkpoint(".") // Initial RDD input to updateStateByKey
      val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the
      // words in input stream of \n delimited test (eg. generated by 'nc')
      val lines = ssc.socketTextStream(args(0), args(1).toInt)
      val words = lines.flatMap(_.split(" "))
      val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey
      // This will give a Dstream made of state (which is the cumulative count of the words)
      val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
      stateDstream.print()
      ssc.start()
      ssc.awaitTermination()
      }
      }
    • NetworkWordCount

import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._ object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
} val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//使用updateStateByKey前须要设置checkpoint
ssc.checkpoint("hdfs://master:8020/spark/checkpoint") val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => {
//通过Spark内部的reduceByKey按key规约。然后这里传入某key当前批次的Seq/List,再计算当前批次的总和
val currentCount = currValues.sum
// 已累加的值
val previousCount = prevValueState.getOrElse(0)
// 返回累加后的结果。是一个Option[Int]类型
Some(currentCount + previousCount)
} val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1)) //val currWordCounts = pairs.reduceByKey(_ + _)
//currWordCounts.print() val totalWordCounts = pairs.updateStateByKey[Int](addFunc)
totalWordCounts.print() ssc.start()
ssc.awaitTermination()
}
}
  • WebPagePopularityValueCalculator
package com.spark.streaming

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext} /**
* ━━━━━━神兽出没━━━━━━
*    ┏┓   ┏┓
*   ┏┛┻━━━┛┻┓
*   ┃       ┃
*   ┃   ━   ┃
*   ┃ ┳┛ ┗┳ ┃
*   ┃       ┃
*   ┃   ┻   ┃
*   ┃       ┃
*   ┗━┓   ┏━┛
*     ┃   ┃神兽保佑, 永无BUG!
*      ┃   ┃Code is far away from bug with the animal protecting
*     ┃   ┗━━━┓
*     ┃       ┣┓
*     ┃       ┏┛
*     ┗┓┓┏━┳┓┏┛
*      ┃┫┫ ┃┫┫
*      ┗┻┛ ┗┻┛
* ━━━━━━感觉萌萌哒━━━━━━
* Module Desc:
* User: wangyue
* DateTime: 15-11-9上午10:50
*/
object WebPagePopularityValueCalculator { private val checkpointDir = "popularity-data-checkpoint"
private val msgConsumerGroup = "user-behavior-topic-message-consumer-group" def main(args: Array[String]) { if (args.length < 2) {
println("Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) ")
System.exit(1)
} val Array(zkServers, processingInterval) = args
val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator") val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt))
//using updateStateByKey asks for enabling checkpoint
ssc.checkpoint(checkpointDir) val kafkaStream = KafkaUtils.createStream(
//Spark streaming context
ssc,
//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...
zkServers,
//kafka message consumer group ID
msgConsumerGroup,
//Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
Map("user-behavior-topic" -> 3))
val msgDataRDD = kafkaStream.map(_._2) //for debug use only
//println("Coming data in this interval...")
//msgDataRDD.print()
// e.g page37|5|1.5119122|-1
val popularityData = msgDataRDD.map { msgLine => {
val dataArr: Array[String] = msgLine.split("\\|")
val pageID = dataArr(0)
//calculate the popularity value
val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1
(pageID, popValue)
}
} //sum the previous popularity value and current value
//定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加,得到最新的热度值。 val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {
iterator.flatMap(t => {
val newValue: Double = t._2.sum
val stateValue: Double = t._3.getOrElse(0);
Some(newValue + stateValue)
}.map(sumedValue => (t._1, sumedValue)))
} val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00))) //调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。
val stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue,
new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD) //set the checkpoint interval to avoid too frequently data checkpoint which may
//may significantly reduce operation throughput
stateDStream.checkpoint(Duration(8 * processingInterval.toInt * 1000)) //after calculation, we need to sort the result and only show the top 10 hot pages
//最后得到最新结果后,须要对结果进行排序。最后打印热度值最高的 10 个网页。 stateDStream.foreachRDD { rdd => {
val sortedData = rdd.map { case (k, v) => (v, k) }.sortByKey(false)
val topKData = sortedData.take(10).map { case (v, k) => (k, v) }
topKData.foreach(x => {
println(x)
})
}
} ssc.start()
ssc.awaitTermination()
}
}

參考文章:

http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters

http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation

尊重原创,未经同意不得转载:

http://blog.csdn.net/stark_summer/article/details/47666337

spark streaming updateStateByKey 使用方法的更多相关文章

  1. Spark Streaming updateStateByKey案例实战和内幕源码解密

    本节课程主要分二个部分: 一.Spark Streaming updateStateByKey案例实战二.Spark Streaming updateStateByKey源码解密 第一部分: upda ...

  2. spark streaming updateStateByKey 用法

    object NetworkWordCount { def main(args: Array[String]) { ) { System.err.println("Usage: Networ ...

  3. Spark Streaming updateStateByKey和mapWithState源码解密

    本篇从二个方面进行源码分析: 一.updateStateByKey解密 二.mapWithState解密 通过对Spark研究角度来研究jvm.分布式.图计算.架构设计.软件工程思想,可以学到很多东西 ...

  4. 55、Spark Streaming:updateStateByKey以及基于缓存的实时wordcount程序

    一.updateStateByKey 1.概述 SparkStreaming 7*24 小时不间断的运行,有时需要管理一些状态,比如wordCount,每个batch的数据不是独立的而是需要累加的,这 ...

  5. Spark Streaming状态管理函数updateStateByKey和mapWithState

    Spark Streaming状态管理函数updateStateByKey和mapWithState 一.状态管理函数 二.mapWithState 2.1关于mapWithState 2.2mapW ...

  6. spark streaming - kafka updateStateByKey 统计用户消费金额

    场景 餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现 从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...

  7. Spark Streaming中空batches处理的两种方法(转)

    原文链接:Spark Streaming中空batches处理的两种方法 Spark Streaming是近实时(near real time)的小批处理系统.对给定的时间间隔(interval),S ...

  8. Spark之 Spark Streaming整合kafka(并演示reduceByKeyAndWindow、updateStateByKey算子使用)

    Kafka0.8版本基于receiver接受器去接受kafka topic中的数据(并演示reduceByKeyAndWindow的使用) 依赖 <dependency> <grou ...

  9. kafka broker Leader -1引起spark Streaming不能消费的故障解决方法

    一.问题描述:Kafka生产集群中有一台机器cdh-003由于物理故障原因挂掉了,并且系统起不来了,使得线上的spark Streaming实时任务不能正常消费,重启实时任务都不行.查看kafka t ...

随机推荐

  1. Mysql的事务、视图、索引、备份和恢复

    事务 事务是作为单个逻辑工作单元执行的一系列操作,一个逻辑工作单元必须具备四个属性.即:原子性.一致性.隔离性.持久性,这些特性通常简称为ACID.   原子性(Atomicity) 事务是不可分割的 ...

  2. 8.Layers Editor

    图层编辑 Ventuz5中有两种类型的场景,分别是2D图层和3D图层.3D图层包含Content和Hierarchy,而2D图层只包含Content.默认情况下,图层编辑器显示在Ventuz中的左上角 ...

  3. reactnative(2) - Navigator 使用案例

    'use strict'; import React, { Component } from 'react'; import { AppRegistry, ScrollView, StyleSheet ...

  4. <stddef.h>

    Common definitions 定义类型: ptrdiff_t 两指针相减的结果,signed integer size_t sizeof操作符的结果,unsigned integer max_ ...

  5. [ SHOI 2014 ] 概率充电器

    \(\\\) \(Description\) 一个含\(N\)个元器件的树形结构充电器,第\(i\)个元器件有\(P_i\)的概率直接从外部被充电,连接\(i,j\)的边有\(P_{i,j}\)的概率 ...

  6. 如何利用Flashback Query 恢复误删除的数据

    网上有很多关于数据回复的文章,这里整理一篇供大家参考,希望能帮助的大家! 推荐一家即时通讯云服务商:www.yun2win.com,功能包含im即时通讯.实时音视频.电子白板.屏幕共享的多种融合通讯云 ...

  7. 2016.01.22 前端学习 HTML/CSS

    学习HTML/CSS  http://edu.51cto.com/course/course_id-3116.html 明日实践

  8. Python 之读取大文件readline与readlines的差别

    import time def get_all_lines(filename): start_time = time.time() try: f = open(filename, 'rb') exce ...

  9. Oracle 函数总结

    <1>=========================返回 String,其中包含有与指定的字符代码相关的字符======================== 函      数:< ...

  10. CAD在网页中如何设置实体闪烁?

    主要用到函数说明: MxDrawXCustomFunction::Mx_TwinkeEnt 闪烁实体.详细说明如下: 参数 说明 McDbObjectId id 被闪烁的实体对象id LONG lCo ...