Spark Streaming之窗口函数和状态转换函数

流处理主要有3种应用场景：无状态操作、window操作、状态操作。

reduceByKeyAndWindow

import kafka.serializer.StringDecoder

import org.apache.log4j.{Level, Logger}

import org.apache.spark.sql.SQLContext

import org.apache.spark.streaming.kafka.KafkaUtils

import org.apache.spark.streaming._

import org.apache.spark.{SparkContext, SparkConf}

object ClickStream {

  def main (args: Array[String]){

    // 屏蔽不必要的日志显示在终端上

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

     //创建SparkConf对象，设置应用程序的名称，在程序运行的监控界面可以看到名称

    val conf = new SparkConf().setAppName("ClickStream").setMaster("local[*]")

    val sc = new SparkContext(conf)

    //此处设置Batch Interval是在Spark Streaming中生成基本Job的时间单位，窗口和滑动时间间隔一定是该Batch Interval的整数倍

    val ssc = new StreamingContext(sc, Seconds(args().toLong))

    //由于用到了窗口函数，需要复用前面的RDD，必须checkpoint，注意复用的RDD之间是没有任何关系的

    ssc.checkpoint(args())

    val topics = Set("clickstream")    //所要获取数据在kafka上的主题

    val brokers = "yz4203.hadoop.data.sina.com.cn:19092,yz4202.hadoop.data.sina.com.cn:19092,10.39.4.212:19092,10.39.4.201:19092"

    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)

    //val offset = "largest"    //values: smallest, largest ，控制读取最新的数据，还是旧的数据, 默认值为largest

    //从Spark1.3开始，我们能够使用如下方式高效地从kafka上获取数据

    val kvsTemp = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

    val kvs = kvsTemp.map(line => line._2)                 //第一部分是null为key，第二部分才是所需数据，为string类型

    //根据需求对流进来的数据进行清洗、转换等处理

    val data = kvs.map(_.split("\\t")).filter(_() == "finance").map(_()).map(_.split("\\?")()).filter(! _.contains("iframe")).map((_, ))

    //滑动窗口长度为1小时，滑动间隔为10分钟，这会得到过去1小时内，url和pv的对应关系

    //val pvWindow = data.reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2, Minutes(60), Minutes(10))

     //滑动窗口长度为1小时，滑动间隔为10分钟，这同样会得到过去1小时内，url和pv的对应关系，只不过这是加新减旧，第一个参数加上新的，第2个参数，减去上一个batch的。
//和上一个版本的reduceByKeyAndWindow每次都会重新算相比（叠加方式），这种方式（增量方式）更加高效优雅

    val pvWindow = data.reduceByKeyAndWindow(_ + _, _ - _, Minutes(), Minutes())

    pvWindow.print()

    ssc.start()             // Start the computation

    ssc.awaitTermination()  // Wait for the computation to terminat

    ssc.stop(true, true)    //优雅地结束

  }

}

countByValueAndWindow

countByValueAndWindow的源码如下所示：

 /**

   * Return a new DStream in which each RDD contains the count of distinct elements in

   * RDDs in a sliding window over this DStream. Hash partitioning is used to generate

   * the RDDs with `numPartitions` partitions (Spark's default number of partitions if

   * `numPartitions` not specified).

   * @param windowDuration width of the window; must be a multiple of this DStream's

   *                       batching interval

   * @param slideDuration  sliding interval of the window (i.e., the interval after which

   *                       the new DStream will generate RDDs); must be a multiple of this

   *                       DStream's batching interval

   * @param numPartitions  number of partitions of each RDD in the new DStream.

   */

  def countByValueAndWindow(

      windowDuration: Duration,

      slideDuration: Duration,

      numPartitions: Int = ssc.sc.defaultParallelism)

      (implicit ord: Ordering[T] = null)

      : DStream[(T, Long)] = ssc.withScope {

    this.map((_, 1L)).reduceByKeyAndWindow(

      (x: Long, y: Long) => x + y,

      (x: Long, y: Long) => x - y,

      windowDuration,

      slideDuration,

      numPartitions,

      (x: (T, Long)) => x._2 != 0L

    )

  }

reduceByWindow

reduceByWindow的源码如下所示：

/**

   * Return a new DStream in which each RDD has a single element generated by reducing all

   * elements in a sliding window over this DStream. However, the reduction is done incrementally

   * using the old window's reduced value :

   *  1. reduce the new values that entered the window (e.g., adding new counts)

   *  2. "inverse reduce" the old values that left the window (e.g., subtracting old counts)

   *  This is more efficient than reduceByWindow without "inverse reduce" function.

   *  However, it is applicable to only "invertible reduce functions".

   * @param reduceFunc associative and commutative reduce function

   * @param invReduceFunc inverse reduce function; such that for all y, invertible x:

   *                      `invReduceFunc(reduceFunc(x, y), x) = y`

   * @param windowDuration width of the window; must be a multiple of this DStream's

   *                       batching interval

   * @param slideDuration  sliding interval of the window (i.e., the interval after which

   *                       the new DStream will generate RDDs); must be a multiple of this

   *                       DStream's batching interval

   */

  def reduceByWindow(

      reduceFunc: (T, T) => T,

      invReduceFunc: (T, T) => T,

      windowDuration: Duration,

      slideDuration: Duration

    ): DStream[T] = ssc.withScope {

      this.map((, _))

          .reduceByKeyAndWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration, )

          .map(_._2)

  }

countByWindow

countByWindow的源码如下所示：

 /**

   * Return a new DStream in which each RDD has a single element generated by counting the number

   * of elements in a sliding window over this DStream. Hash partitioning is used to generate

   * the RDDs with Spark's default number of partitions.

   * @param windowDuration width of the window; must be a multiple of this DStream's

   *                       batching interval

   * @param slideDuration  sliding interval of the window (i.e., the interval after which

   *                       the new DStream will generate RDDs); must be a multiple of this

   *                       DStream's batching interval

   */

  def countByWindow(

      windowDuration: Duration,

      slideDuration: Duration): DStream[Long] = ssc.withScope {

    this.map(_ => 1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)

  }

由此可见，countByValueAndWindow、reduceByWindow、countByWindow的底层实现都是“加新减旧”版本的reduceByKeyAndWindow。

上面，求出了每一小时窗口内的Url和Pv的对应关系，如果想求出相同的Url在上一个窗口的Pv和本次窗口的Pv的比值，那么这时侯updateStateByKey，mapWithState就粉墨登场了。由于updateStateByKey和mapWithState二者之间有10倍左右的性能差异。

这里，只涉及mapWithState。

mapWithState

import kafka.serializer.StringDecoder

import org.apache.log4j.{Level, Logger}

import org.apache.spark.sql.SQLContext

import org.apache.spark.streaming.kafka.KafkaUtils

import org.apache.spark.streaming._

import org.apache.spark.{SparkContext, SparkConf}

object ClickStream {

  def main (args: Array[String]){

    // 屏蔽不必要的日志显示在终端上

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

     //创建SparkConf对象，设置应用程序的名称，在程序运行的监控界面可以看到名称

    val conf = new SparkConf().setAppName("ClickStream").setMaster("local[*]")

    val sc = new SparkContext(conf)

    //此处设置Batch Interval是在Spark Streaming中生成基本Job的时间单位，窗口和滑动时间间隔一定是该Batch Interval的整数倍

    val ssc = new StreamingContext(sc, Seconds(args().toLong))

    //由于用到了窗口函数，需要复用前面的RDD，必须checkpoint，注意复用的RDD之间是没有任何关系的

    ssc.checkpoint(args())

    val topics = Set("clickstream")    //所要获取数据在kafka上的主题

    val brokers = yz4207.hadoop.data.sina.com.cn:19092,yz4203.hadoop.data.sina.com.cn:19092,yz4202.hadoop.data.sina.com.cn:19092,10.39.4.212:19092,10.39.4.201:19092"

    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)

    //val offset = "largest"    //values: smallest, largest ，控制读取最新的数据，还是旧的数据, 默认值为largest

    //从Spark1.3开始，我们能够使用如下方式高效地从kafka上获取数据

    val kvsTemp = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

    val kvs = kvsTemp.map(line => line._2)                 //第一部分是null为key，第二部分才是所需数据，为string类型

    //根据需求对流进来的数据进行清洗、转换等处理

    val data = kvs.map(_.split("\\t")).filter(_() == "finance").map(_()).map(_.split("\\?")()).filter(! _.contains("iframe")).map((_, ))

    //滑动窗口长度为1小时，滑动间隔为10分钟，这会得到过去1小时内，url和pv的对应关系

    //val pvWindow = data.reduceByKeyAndWindow((v1: Int, v2: Int) => v1+v2, Minutes(60), Minutes(10))

     //滑动窗口长度为1小时，滑动间隔为10分钟，这同样会得到过去1小时内，url和pv的对应关系，只不过这是加新减旧，第一个参数加上新的，第2个参数，减去上一个batch的。和上一个版本的reduceByKeyAndWindow每次都会重新算相比（叠加方式），
     //这种方式（增量方式）更加高效优雅

    val pvWindow = data.reduceByKeyAndWindow(_ + _, _ - _, Minutes(), Minutes())

    //key是K, value是新值，state是原始值(本batch之前的状态值)。这里你需要把state更新为新值

    val mappingFunc = (key: String, value: Option[Int], state: State[Int]) => {

        val currentPV = value.getOrElse()

        val output = (key, currentPV, state.getOption().getOrElse())

        state.update(currentPV)

        output

      }

    //StateSpec只是一个包裹，实际操作仍然是定义的mappingFunc函数

    val urlPvs = pvWindow.mapWithState(StateSpec.function(mappingFunc))    //url,当前batch的PV,上一个batch的PV

    urlPvs.print()

    ssc.start()             // Start the computation

    ssc.awaitTermination()  // Wait for the computation to terminat

    ssc.stop(true, true)    //优雅地结束

  }

}

Spark Streaming之窗口函数和状态转换函数的更多相关文章

Spark Streaming揭秘 Day14 State状态管理
Spark Streaming揭秘 Day14 State状态管理今天让我们进入下SparkStreaming的一个非常好用的功能,也就State相关的操作.State是SparkStreaming ...
Spark Streaming之六：Transformations 普通的转换操作
与RDD类似,DStream也提供了自己的一系列操作方法,这些操作可以分成四类: Transformations 普通的转换操作 Window Operations 窗口转换操作 Join Opera ...
java笔记----线程状态转换函数
注意:stop().suspend()和 resume()方法现在已经不提倡使用,这些方法在虚拟机中可能引起“死锁”现象.suspend()和 resume()方法的替代方法是 wait()和 sle ...
周期性清除Spark Streaming流状态的方法
在Spark Streaming程序中,若需要使用有状态的流来统计一些累积性的指标,比如各个商品的PV.简单的代码描述如下,使用mapWithState()算子: val productPvStrea ...
Spark Streaming之一：整体介绍
提到Spark Streaming,我们不得不说一下BDAS(Berkeley Data Analytics Stack),这个伯克利大学提出的关于数据分析的软件栈.从它的视角来看,目前的大数据处理可 ...
Spark Streaming源码解读之State管理之UpdataStateByKey和MapWithState解密
本期内容 : UpdateStateByKey解密 MapWithState解密 Spark Streaming是实现State状态管理因素: 01. Spark Streaming是按照整个Bach ...
使用 Kafka 和 Spark Streaming 构建实时数据处理系统
使用 Kafka 和 Spark Streaming 构建实时数据处理系统来源:https://www.ibm.com/developerworks,这篇文章转载自微信里文章,正好解决了我项目中的技 ...
Spark Streaming和Kafka集成深入浅出
写在前面本文主要介绍Spark Streaming基本概念.kafka集成.Offset管理本文主要介绍Spark Streaming基本概念.kafka集成.Offset管理一.概述 Spar ...
使用 Kafka 和 Spark Streaming 构建实时数据处理系统（转）
原文链接:http://www.ibm.com/developerworks/cn/opensource/os-cn-spark-practice2/index.html?ca=drs-&ut ...

随机推荐

小白都能看明白的VLAN原理解释
为什么需要VLAN 1. 什么是VLAN? VLAN(Virtual LAN),翻译成中文是“虚拟局域网”.LAN可以是由少数几台家用计算机构成的网络,也可以是数以百计的计算机构成的企业网络.VLAN ...
[转]MySQL中timestamp数据类型的特点
原文地址:https://www.imooc.com/article/16158 在使用MySQL数据库时有很多常见的误解,其中使用int类型来保存日期数据会提高数据读取的效率就是比较常见的一个误解. ...
【规范】前端编码规范——javascript 规范
全局命名空间污染与 IIFE 总是将代码包裹成一个 IIFE(Immediately-Invoked Function Expression),用以创建独立隔绝的定义域.这一举措可防止全局命名空间被污 ...
不同语言的水仙花性能比较【Test1W】
看了大佬@鱼丸粗面一碗的文章:<这段代码,c 1秒,java 9秒,c# 14秒,而python...>,基于水仙花数的各种语言1W次性能比较,觉得很有意思.于是开启cv大法,把我有环境的 ...
Windows Server 2003下DHCP服务器的安装与简单配置图文教程
在前面的内容中,我们提到了DHCP这个词,为什么要用到DHCP呢,企业里如果有100台计算机,那样,我们一台台的进行配置Ip,我想还是可以的,因为少嘛,如果成千上万台,那我们也去一台台的配置,我相信这 ...
前端Js框架 UI框架汇总特性适用范围选择
身为一个资深后端工程师,面对层出不穷的前端框架,总让人眼花缭乱,做一个综合解析贴,从全局着眼,让我们明白各种前端框架的应用范围,为如何选择前端框架,从不同的维度提供一些线索,做为一个长期优化贴,欢迎指 ...
基于物理规则的渲染(PBR)
为毛我的效果那么挫,我也是按照公式来的 2017 -3 -20
Mybatis常考面试题汇总（附答案）
1.#{}和${}的区别是什么? #{}和${}的区别是什么? 在Mybatis中,有两种占位符 #{}解析传递进来的参数数据 ${}对传递进来的参数原样拼接在SQL中 #{}是预编译处理,${}是字 ...
webapi 统一处理时间格式
public class UnixDateTimeConvertor : DateTimeConverterBase { public override object ReadJson(JsonRea ...
Qt编写自定义控件9-导航按钮控件
前言导航按钮控件,主要用于各种漂亮精美的导航条,我们经常在web中看到导航条都非常精美,都是html+css+js实现的,还自带动画过度效果,Qt提供的qss其实也是无敌的,支持基本上所有的CSS2 ...