1. updateStateByKey 解释:

    以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加

    在有新的数据信息进入或更新时。能够让用户保持想要的不论什么状。使用这个功能须要完毕两步:

    1) 定义状态:能够是随意数据类型

    2) 定义状态更新函数:用一个函数指定怎样使用先前的状态。从输入流中的新值更新状态。

    对于有状态操作,要不断的把当前和历史的时间切片的RDD累加计算,随着时间的流失,计算的数据规模会变得越来越大。

  2. updateStateByKey源代码:

    /**

    • Return a new “state” DStream where the state for each key is updated by applying
    • the given function on the previous state of the key and the new values of the key.
    • org.apache.spark.Partitioner is used to control the partitioning of each RDD.
    • @param updateFunc State update function. If this function returns None, then
    • corresponding state key-value pair will be eliminated.
    • @param partitioner Partitioner for controlling the partitioning of each RDD in the new
    • DStream.
    • @param initialRDD initial state value of each key.
    • @tparam S State type

      */

      def updateStateByKey[S: ClassTag](

      updateFunc: (Seq[V], Option[S]) => Option[S],

      partitioner: Partitioner,

      initialRDD: RDD[(K, S)]

      ): DStream[(K, S)] = {

      val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {

      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))

      }

      updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)

      }
  3. 代码实现

    • StatefulNetworkWordCount

      object StatefulNetworkWordCount {
      def main(args: Array[String]) {
      if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
      System.exit(1)
      } Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val updateFunc = (values: Seq[Int], state: Option[Int]) => {
      val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount)
      } val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
      } val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local")
      // Create the context with a 1 second batch size
      val ssc = new StreamingContext(sparkConf, Seconds(1))
      ssc.checkpoint(".") // Initial RDD input to updateStateByKey
      val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the
      // words in input stream of \n delimited test (eg. generated by 'nc')
      val lines = ssc.socketTextStream(args(0), args(1).toInt)
      val words = lines.flatMap(_.split(" "))
      val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey
      // This will give a Dstream made of state (which is the cumulative count of the words)
      val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
      new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
      stateDstream.print()
      ssc.start()
      ssc.awaitTermination()
      }
      }
    • NetworkWordCount

import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._ object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
} val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//使用updateStateByKey前须要设置checkpoint
ssc.checkpoint("hdfs://master:8020/spark/checkpoint") val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => {
//通过Spark内部的reduceByKey按key规约。然后这里传入某key当前批次的Seq/List,再计算当前批次的总和
val currentCount = currValues.sum
// 已累加的值
val previousCount = prevValueState.getOrElse(0)
// 返回累加后的结果。是一个Option[Int]类型
Some(currentCount + previousCount)
} val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1)) //val currWordCounts = pairs.reduceByKey(_ + _)
//currWordCounts.print() val totalWordCounts = pairs.updateStateByKey[Int](addFunc)
totalWordCounts.print() ssc.start()
ssc.awaitTermination()
}
}
  • WebPagePopularityValueCalculator
package com.spark.streaming

import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext} /**
* ━━━━━━神兽出没━━━━━━
*    ┏┓   ┏┓
*   ┏┛┻━━━┛┻┓
*   ┃       ┃
*   ┃   ━   ┃
*   ┃ ┳┛ ┗┳ ┃
*   ┃       ┃
*   ┃   ┻   ┃
*   ┃       ┃
*   ┗━┓   ┏━┛
*     ┃   ┃神兽保佑, 永无BUG!
*      ┃   ┃Code is far away from bug with the animal protecting
*     ┃   ┗━━━┓
*     ┃       ┣┓
*     ┃       ┏┛
*     ┗┓┓┏━┳┓┏┛
*      ┃┫┫ ┃┫┫
*      ┗┻┛ ┗┻┛
* ━━━━━━感觉萌萌哒━━━━━━
* Module Desc:
* User: wangyue
* DateTime: 15-11-9上午10:50
*/
object WebPagePopularityValueCalculator { private val checkpointDir = "popularity-data-checkpoint"
private val msgConsumerGroup = "user-behavior-topic-message-consumer-group" def main(args: Array[String]) { if (args.length < 2) {
println("Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) ")
System.exit(1)
} val Array(zkServers, processingInterval) = args
val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator") val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt))
//using updateStateByKey asks for enabling checkpoint
ssc.checkpoint(checkpointDir) val kafkaStream = KafkaUtils.createStream(
//Spark streaming context
ssc,
//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...
zkServers,
//kafka message consumer group ID
msgConsumerGroup,
//Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
Map("user-behavior-topic" -> 3))
val msgDataRDD = kafkaStream.map(_._2) //for debug use only
//println("Coming data in this interval...")
//msgDataRDD.print()
// e.g page37|5|1.5119122|-1
val popularityData = msgDataRDD.map { msgLine => {
val dataArr: Array[String] = msgLine.split("\\|")
val pageID = dataArr(0)
//calculate the popularity value
val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1
(pageID, popValue)
}
} //sum the previous popularity value and current value
//定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加,得到最新的热度值。 val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {
iterator.flatMap(t => {
val newValue: Double = t._2.sum
val stateValue: Double = t._3.getOrElse(0);
Some(newValue + stateValue)
}.map(sumedValue => (t._1, sumedValue)))
} val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00))) //调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。
val stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue,
new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD) //set the checkpoint interval to avoid too frequently data checkpoint which may
//may significantly reduce operation throughput
stateDStream.checkpoint(Duration(8 * processingInterval.toInt * 1000)) //after calculation, we need to sort the result and only show the top 10 hot pages
//最后得到最新结果后,须要对结果进行排序。最后打印热度值最高的 10 个网页。 stateDStream.foreachRDD { rdd => {
val sortedData = rdd.map { case (k, v) => (v, k) }.sortByKey(false)
val topKData = sortedData.take(10).map { case (v, k) => (k, v) }
topKData.foreach(x => {
println(x)
})
}
} ssc.start()
ssc.awaitTermination()
}
}

參考文章:

http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala

http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters

http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation

尊重原创,未经同意不得转载:

http://blog.csdn.net/stark_summer/article/details/47666337

spark streaming updateStateByKey 使用方法的更多相关文章

  1. Spark Streaming updateStateByKey案例实战和内幕源码解密

    本节课程主要分二个部分: 一.Spark Streaming updateStateByKey案例实战二.Spark Streaming updateStateByKey源码解密 第一部分: upda ...

  2. spark streaming updateStateByKey 用法

    object NetworkWordCount { def main(args: Array[String]) { ) { System.err.println("Usage: Networ ...

  3. Spark Streaming updateStateByKey和mapWithState源码解密

    本篇从二个方面进行源码分析: 一.updateStateByKey解密 二.mapWithState解密 通过对Spark研究角度来研究jvm.分布式.图计算.架构设计.软件工程思想,可以学到很多东西 ...

  4. 55、Spark Streaming:updateStateByKey以及基于缓存的实时wordcount程序

    一.updateStateByKey 1.概述 SparkStreaming 7*24 小时不间断的运行,有时需要管理一些状态,比如wordCount,每个batch的数据不是独立的而是需要累加的,这 ...

  5. Spark Streaming状态管理函数updateStateByKey和mapWithState

    Spark Streaming状态管理函数updateStateByKey和mapWithState 一.状态管理函数 二.mapWithState 2.1关于mapWithState 2.2mapW ...

  6. spark streaming - kafka updateStateByKey 统计用户消费金额

    场景 餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现 从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...

  7. Spark Streaming中空batches处理的两种方法(转)

    原文链接:Spark Streaming中空batches处理的两种方法 Spark Streaming是近实时(near real time)的小批处理系统.对给定的时间间隔(interval),S ...

  8. Spark之 Spark Streaming整合kafka(并演示reduceByKeyAndWindow、updateStateByKey算子使用)

    Kafka0.8版本基于receiver接受器去接受kafka topic中的数据(并演示reduceByKeyAndWindow的使用) 依赖 <dependency> <grou ...

  9. kafka broker Leader -1引起spark Streaming不能消费的故障解决方法

    一.问题描述:Kafka生产集群中有一台机器cdh-003由于物理故障原因挂掉了,并且系统起不来了,使得线上的spark Streaming实时任务不能正常消费,重启实时任务都不行.查看kafka t ...

随机推荐

  1. POJ 3855 计算几何·多边形重心

    思路: 多边形面积->任选一个点,把多边形拆成三角,叉积一下 三角形重心->(x1+x2+x3)/3,(y1+y2+y3)/3 多边形重心公式题目中有,套一下就好了 计算多边形重心方法: ...

  2. mariadb的安装

    mysql (分支 mariadb)1.安装mariadb -yum -源码编译安装 -下载rpm安装 yum和源码编译安装的区别? 1.路径区别-yum安装的软件是他自定义的,源码安装的软件./co ...

  3. UNIX环境高级编程--6

    系统数据文件和信息    数据文件都是ASCII文本文件,并且使用标准I/O库读这些文件,例如口令文件/etc/passwd和组文件/etc/group就是经常被多个程序频繁使用的两个文件.    口 ...

  4. 元组Tuple、数组Array、映射Map

    一.元组Tuple 元组Tuple是不同类型的值的聚集,元组的值将单个的值包含在圆括号中来构成,元组可以包含一个不同类型的元素 如 val riple = (100, "Scala" ...

  5. [转]Wote用python语言写的imgHash.py

    #!/usr/bin/python import glob import os import sys from PIL import Image EXTS = 'jpg', 'jpeg', 'JPG' ...

  6. Android各种函数单位

    该博客随时更新.因为每次写函数都会考虑这个单位是什么,所以嫌比较麻烦,在这里总结一下,方便以后使用. paint.setStrokeWidth() 单位是 px . paint.getStrokeWi ...

  7. Less——less基本使用

    基本概况 Less 是一门 CSS 预处理语言,它扩充了 CSS 语言,增加了诸如变量.混合(mixin).函数等功能,让 CSS 更易维护.方便制作主题.扩充.Less 可以运行在 Node.浏览器 ...

  8. 控制台——屏蔽Ctrl+C快捷键对窗体的关闭功能

    导入SetCtrlHandlerHandler API //定义处理程序委托 public delegate bool ConsoleCtrlDelegate(int ctrlType); //导入S ...

  9. java实现搜索附近地点或人的功能

    前言 当前大多数app都有查找附近的功能, 简单的有查找周围的运动场馆, 复杂的有滴滴, 摩拜查找周围的车辆. 本文主要阐述查找附近地点的一般实现. 方案比较 方案1 (性能还不错) 数据库直接存经纬 ...

  10. Quartz实战

    https://my.oschina.net/yinxiaoling/blog/542336?fromerr=s3ko7u33 Quartz实战 > 一.内存型(1) <bean name ...