spark streaming updateStateByKey 使用方法
updateStateByKey 解释:
以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加
在有新的数据信息进入或更新时。能够让用户保持想要的不论什么状。使用这个功能须要完毕两步:
1) 定义状态:能够是随意数据类型
2) 定义状态更新函数:用一个函数指定怎样使用先前的状态。从输入流中的新值更新状态。
对于有状态操作,要不断的把当前和历史的时间切片的RDD累加计算,随着时间的流失,计算的数据规模会变得越来越大。updateStateByKey源代码:
/**
- Return a new “state” DStream where the state for each key is updated by applying
- the given function on the previous state of the key and the new values of the key.
- org.apache.spark.Partitioner is used to control the partitioning of each RDD.
- @param updateFunc State update function. If
thisfunction returns None, then - corresponding state key-value pair will be eliminated.
- @param partitioner Partitioner for controlling the partitioning of each RDD in the new
- DStream.
- @param initialRDD initial state value of each key.
- @tparam S State type
*/
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]
): DStream[(K, S)] = {
val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
}
updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
}
代码实现
StatefulNetworkWordCount
object StatefulNetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
System.exit(1)
} Logger.getLogger("org.apache.spark").setLevel(Level.WARN) val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount)
} val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
} val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local")
// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(".") // Initial RDD input to updateStateByKey
val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the
// words in input stream of \n delimited test (eg. generated by 'nc')
val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey
// This will give a Dstream made of state (which is the cumulative count of the words)
val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,
new HashPartitioner (ssc.sparkContext.defaultParallelism), true, initialRDD)
stateDstream.print()
ssc.start()
ssc.awaitTermination()
}
}NetworkWordCount
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
object NetworkWordCount {
def main(args: Array[String]) {
if (args.length < 2) {
System.err.println("Usage: NetworkWordCount <hostname> <port>")
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("NetworkWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(10))
//使用updateStateByKey前须要设置checkpoint
ssc.checkpoint("hdfs://master:8020/spark/checkpoint")
val addFunc = (currValues: Seq[Int], prevValueState: Option[Int]) => {
//通过Spark内部的reduceByKey按key规约。然后这里传入某key当前批次的Seq/List,再计算当前批次的总和
val currentCount = currValues.sum
// 已累加的值
val previousCount = prevValueState.getOrElse(0)
// 返回累加后的结果。是一个Option[Int]类型
Some(currentCount + previousCount)
}
val lines = ssc.socketTextStream(args(0), args(1).toInt)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
//val currWordCounts = pairs.reduceByKey(_ + _)
//currWordCounts.print()
val totalWordCounts = pairs.updateStateByKey[Int](addFunc)
totalWordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
- WebPagePopularityValueCalculator
package com.spark.streaming
import org.apache.spark.{HashPartitioner, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Duration, Seconds, StreamingContext}
/**
* ━━━━━━神兽出没━━━━━━
* ┏┓ ┏┓
* ┏┛┻━━━┛┻┓
* ┃ ┃
* ┃ ━ ┃
* ┃ ┳┛ ┗┳ ┃
* ┃ ┃
* ┃ ┻ ┃
* ┃ ┃
* ┗━┓ ┏━┛
* ┃ ┃神兽保佑, 永无BUG!
* ┃ ┃Code is far away from bug with the animal protecting
* ┃ ┗━━━┓
* ┃ ┣┓
* ┃ ┏┛
* ┗┓┓┏━┳┓┏┛
* ┃┫┫ ┃┫┫
* ┗┻┛ ┗┻┛
* ━━━━━━感觉萌萌哒━━━━━━
* Module Desc:
* User: wangyue
* DateTime: 15-11-9上午10:50
*/
object WebPagePopularityValueCalculator {
private val checkpointDir = "popularity-data-checkpoint"
private val msgConsumerGroup = "user-behavior-topic-message-consumer-group"
def main(args: Array[String]) {
if (args.length < 2) {
println("Usage:WebPagePopularityValueCalculator zkserver1:2181, zkserver2: 2181, zkserver3: 2181 consumeMsgDataTimeInterval (secs) ")
System.exit(1)
}
val Array(zkServers, processingInterval) = args
val conf = new SparkConf().setAppName("Web Page Popularity Value Calculator")
val ssc = new StreamingContext(conf, Seconds(processingInterval.toInt))
//using updateStateByKey asks for enabling checkpoint
ssc.checkpoint(checkpointDir)
val kafkaStream = KafkaUtils.createStream(
//Spark streaming context
ssc,
//zookeeper quorum. e.g zkserver1:2181,zkserver2:2181,...
zkServers,
//kafka message consumer group ID
msgConsumerGroup,
//Map of (topic_name -> numPartitions) to consume. Each partition is consumed in its own thread
Map("user-behavior-topic" -> 3))
val msgDataRDD = kafkaStream.map(_._2)
//for debug use only
//println("Coming data in this interval...")
//msgDataRDD.print()
// e.g page37|5|1.5119122|-1
val popularityData = msgDataRDD.map { msgLine => {
val dataArr: Array[String] = msgLine.split("\\|")
val pageID = dataArr(0)
//calculate the popularity value
val popValue: Double = dataArr(1).toFloat * 0.8 + dataArr(2).toFloat * 0.8 + dataArr(3).toFloat * 1
(pageID, popValue)
}
}
//sum the previous popularity value and current value
//定义一个匿名函数去把网页热度上一次的计算结果值和新计算的值相加,得到最新的热度值。
val updatePopularityValue = (iterator: Iterator[(String, Seq[Double], Option[Double])]) => {
iterator.flatMap(t => {
val newValue: Double = t._2.sum
val stateValue: Double = t._3.getOrElse(0);
Some(newValue + stateValue)
}.map(sumedValue => (t._1, sumedValue)))
}
val initialRDD = ssc.sparkContext.parallelize(List(("page1", 0.00)))
//调用 updateStateByKey 原语并传入上面定义的匿名函数更新网页热度值。
val stateDStream = popularityData.updateStateByKey[Double](updatePopularityValue,
new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD)
//set the checkpoint interval to avoid too frequently data checkpoint which may
//may significantly reduce operation throughput
stateDStream.checkpoint(Duration(8 * processingInterval.toInt * 1000))
//after calculation, we need to sort the result and only show the top 10 hot pages
//最后得到最新结果后,须要对结果进行排序。最后打印热度值最高的 10 个网页。
stateDStream.foreachRDD { rdd => {
val sortedData = rdd.map { case (k, v) => (v, k) }.sortByKey(false)
val topKData = sortedData.take(10).map { case (v, k) => (k, v) }
topKData.foreach(x => {
println(x)
})
}
}
ssc.start()
ssc.awaitTermination()
}
}
參考文章:
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
https://github.com/apache/spark/blob/branch-1.3/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala
http://stackoverflow.com/questions/28998408/spark-streaming-example-calls-updatestatebykey-with-additional-parameters
http://stackoverflow.com/questions/27535668/spark-streaming-groupbykey-and-updatestatebykey-implementation
尊重原创,未经同意不得转载:
http://blog.csdn.net/stark_summer/article/details/47666337
spark streaming updateStateByKey 使用方法的更多相关文章
- Spark Streaming updateStateByKey案例实战和内幕源码解密
本节课程主要分二个部分: 一.Spark Streaming updateStateByKey案例实战二.Spark Streaming updateStateByKey源码解密 第一部分: upda ...
- spark streaming updateStateByKey 用法
object NetworkWordCount { def main(args: Array[String]) { ) { System.err.println("Usage: Networ ...
- Spark Streaming updateStateByKey和mapWithState源码解密
本篇从二个方面进行源码分析: 一.updateStateByKey解密 二.mapWithState解密 通过对Spark研究角度来研究jvm.分布式.图计算.架构设计.软件工程思想,可以学到很多东西 ...
- 55、Spark Streaming:updateStateByKey以及基于缓存的实时wordcount程序
一.updateStateByKey 1.概述 SparkStreaming 7*24 小时不间断的运行,有时需要管理一些状态,比如wordCount,每个batch的数据不是独立的而是需要累加的,这 ...
- Spark Streaming状态管理函数updateStateByKey和mapWithState
Spark Streaming状态管理函数updateStateByKey和mapWithState 一.状态管理函数 二.mapWithState 2.1关于mapWithState 2.2mapW ...
- spark streaming - kafka updateStateByKey 统计用户消费金额
场景 餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现 从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...
- Spark Streaming中空batches处理的两种方法(转)
原文链接:Spark Streaming中空batches处理的两种方法 Spark Streaming是近实时(near real time)的小批处理系统.对给定的时间间隔(interval),S ...
- Spark之 Spark Streaming整合kafka(并演示reduceByKeyAndWindow、updateStateByKey算子使用)
Kafka0.8版本基于receiver接受器去接受kafka topic中的数据(并演示reduceByKeyAndWindow的使用) 依赖 <dependency> <grou ...
- kafka broker Leader -1引起spark Streaming不能消费的故障解决方法
一.问题描述:Kafka生产集群中有一台机器cdh-003由于物理故障原因挂掉了,并且系统起不来了,使得线上的spark Streaming实时任务不能正常消费,重启实时任务都不行.查看kafka t ...
随机推荐
- blockhouses
题意 : 给你一张图上面" X " 代表墙 , " . " 代表空地 , 让你在空地上放置炮台 , 条件是 不能 让彼此的炮台 可以互相看见 ( 隔着墙就看不见 ...
- HTML--使用下拉列表框,节省空间
下拉列表在网页中也常会用到,它可以有效的节省网页空间.既可以单选.又可以多选.如下代码: 讲解: 1.value: 2.selected="selected": 设置selecte ...
- 【知识总结】后缀数组(Suffix_Array)
又是一个学了n遍还没学会的算法-- 后缀数组是一种常用的处理字符串问题的数据结构,主要由\(sa\)和\(rank\)两个数组组成.以下给出一些定义: \(str\)表示处理的字符串,长度为\(len ...
- ACM_小游戏(棋盘博弈)
Problem Description: 最近kiki无事可做,于是他想玩棋盘游戏.棋盘的大小是n * m.首先,棋子放置在右上角(1,m). 每次可以将棋子向左方,下方或左下方移动一个位置.当移动到 ...
- day03_12/13/2016_bean属性的设置之构造器方式注入
- 胖ap和瘦ap的区别
一,什么是AP,胖瘦AP如何区分? 先说说AP的概念.AP是Access Point的简称,即无线接入点,其作用是把局域网里通过双绞线传输的有线信号(即电信号)经过编译,转换成无线电信号传 ...
- 正则表达式 \D 元字符
\D元字符可以匹配非数字字符,等价于"[^0-9]". 语法结构: (1).构造函数方式: new RegExp("\\D") (2).对象直接量方式: /\D ...
- JQuery中常用的$.get(),$.post(),$.ajax(),$.getJSON(),load()的详解与区别
背景:因为最近需要获取本地的数据件进行项目测试,需要用到JQuery实现数据文件的读取,但是由于对JQuery内的获取文件方式不太了解,这次趁着机会进行一下总结.因为该总结是本人根据平常的使用及网上的 ...
- PHP魔术法__set和__get
__set: 在给不可访问属性赋值时,__set() 会被调用.语法如下: public void __set ( string $name , mixed $value ) __get: 读取不可访 ...
- JS——筋斗云案例
需求: 1.鼠标移动到哪里,云彩移动到哪里 2.鼠标离开,云彩回到原点 3.鼠标离开,云彩回到之前点击的地方 <!DOCTYPE html> <html lang="en& ...