object SparkStreaming

窗口查询

object SparkStreaming_StateFul {

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

    val conf = new SparkConf().setMaster("local[2]")
      .setAppName(this.getClass.getSimpleName)
      .set("spark.executor.memory", "2g")
      .set("spark.cores.max", "8")
      .setJars(Array("E:\\ScalaSpace\\Spark_Streaming\\out\\artifacts\\Spark_Streaming.jar"))
    val context = new SparkContext(conf)

    val updateFunc = (values : Seq[Int],state : Option[Int]) => {
      val currentCount  = values.foldLeft(0)(_+_)
      val previousCount = state.getOrElse(0)
      Some(currentCount + previousCount)
    } 对历史数据进行保存，若存在则取值，不存在默认值为0

    //step1 create streaming context
    val ssc = new StreamingContext(context,Seconds(5)) 每5s进行统计
    ssc.checkpoint(".")

    //step2 create a networkInputStream on get ip:port and count the words in input stream of \n delimited text
    val lines = ssc.socketTextStream("218.193.154.79",12345)

    val data = lines.flatMap(_.split(" "))
    val wordDstream = data.map(x => (x,1)).reduceByKeyAndWindow(_+_,_-_,Seconds(10),Seconds(15))

    每隔15s进行查询，查询为前10s的结果。这里的值必须为采集时间的倍数

    //使用updateStateByKey 来更新状态
    val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)

    stateDstream.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

其输出结果如下所示，对全部的结果进行统计

-------------------------------------------

Time: 1459156160000 ms

-------------------------------------------

(B,1)

(F,1)

(D,4)

(G,1)

(A,1)

(C,5)

现在就可以，最热关键词进行统计，其统计代码如下所示：

那么此处为什么会有transform呢操作呢，我们看transform的介绍如下所示

/**
 * Return a new DStream in which each RDD is generated by applying a function
 * on each RDD of 'this' DStream.
 */
def transform[U: ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U] = ssc.withScope {
  // because the DStream is reachable from the outer object here, and because
  // DStreams can't be serialized with closures, we can't proactively check
  // it for serializability and so we pass the optional false to SparkContext.clean
  val cleanedF = context.sparkContext.clean(transformFunc, false)
  transform((r: RDD[T], t: Time) => cleanedF(r))
}

/**
 * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
 * `collect` or `save` on the resulting RDD will return or output an ordered list of records
 * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
 * order of the keys).
 */
// TODO: this currently doesn't work on P other than Tuple2!
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
    : RDD[(K, V)] = self.withScope
{
  val part = new RangePartitioner(numPartitions, self, ascending)
  new ShuffledRDD[K, V, V](self, part)
    .setKeyOrdering(if (ascending) ordering else ordering.reverse)
}

通过上述注释我们可以知道，sort是对RDD内所有partition数据进行排序，而并非针对所有RDD，因为SparkStreaming 是操作多个RDD，因此我们需要将使用transform 操作，对所有的RDD进行排序操作。

stateDstream.map{
  case (char,count) => (count,char)
}.transform(_.sortByKey(false))

From WizNote

object SparkStreaming_StateFul {的更多相关文章

SparkStreaming updateStateByKey 保存记录信息
)(_+_) ) 查看是否存在,如果存在直接获取 )) ssc.checkpoint() )) //使用updateStateByKey 来更新状态 val stateDstream = wordDs ...
CoreCLR源码探索(一) Object是什么
.Net程序员们每天都在和Object在打交道如果你问一个.Net程序员什么是Object,他可能会信誓旦旦的告诉你"Object还不简单吗,就是所有类型的基类" 这个答案是对的 ...
JavaScript Object对象
目录 1. 介绍:阐述 Object 对象. 2. 构造函数:介绍 Object 对象的构造函数. 3. 实例属性:介绍 Object 对象的实例属性:prototype.constructor等等. ...
javascript之Object.defineProperty的奥妙
直切主题今天遇到一个这样的功能: 写一个函数,该函数传递两个参数,第一个参数为返回对象的总数据量,第二个参数为初始化对象的数据.如: var o = obj (4, {name: 'xu', age ...
c# 基础 object ,new操作符，类型转换
参考页面: http://www.yuanjiaocheng.net/webapi/config-webapi.html http://www.yuanjiaocheng.net/webapi/web ...
APEX:对object中数据进行简单处理？
在Salesforce中,常常要对各种数据进行处理,已满足业务逻辑.本篇文章会介绍如何实现从object获取数据,然后将取得的数据进行一系列简单处理. 第一步:SongName__c 是一个新建的ob ...
笔记：Memory Notification: Library Cache Object loaded into SGA
笔记:Memory Notification: Library Cache Object loaded into SGA在警告日志中发现一些这样的警告信息:Mon Nov 21 14:24:22 20 ...
Selenium的PO模式（Page Object Model）[python版]
Page Object Model 简称POM 普通的测试用例代码: .... #测试用例 def test_login_mail(self): driver = self.driver driv ...
Object是什么
Object是什么 .Net程序员们每天都在和Object在打交道如果你问一个.Net程序员什么是Object,他可能会信誓旦旦的告诉你"Object还不简单吗,就是所有类型的基类" ...

随机推荐

Could not load conf for core new_core 解決方法
new_core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load ...
UltraEdit窗口布局重新设置
解决办法:工具栏中的视图-->环境-->左边小框里选择“编程员”,再点选择环境转载:https://blog.csdn.net/u011650048/article/details/18 ...
oracle 调试数据库
转载:https://www.cnblogs.com/liuqiyun/p/6589814.html 工具/原料 PL\SQL Oracle 方法/步骤首先在PL/SQL的左侧资源栏中展 ...
pta l2-10（排座位）
题目链接:https://pintia.cn/problem-sets/994805046380707840/problems/994805066135879680 题意:给宴席排座位,有n个人,m个 ...
python os.path模块常用方法详解（转）
转自:https://www.cnblogs.com/wuxie1989/p/5623435.html os.path模块主要用于文件的属性获取,在编程中经常用到,以下是该模块的几种常用方法.更多的方 ...
思维+并查集 hdu5652
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=5652 题意: 输入T,接下来T个样例,每个样例输入n,m代表图的大小,接下来n行,每行m个数,代表图, ...
基础DP(初级版)
本文主要内容为基础DP,内容来源为<算法导论>,总结不易,转载请注明出处. 后续会更新出kuanbin关于基础DP的题目...... 动态规划: 动态规划用于子问题重叠的情况,即不同的子问 ...
Shell教程之printf命令
上一章节我们学习了 Shell 的 echo 命令,本章节我们来学习 Shell 的另一个输出命令 printf. printf 命令模仿 C 程序库(library)里的 printf() 程序. ...
VM虚拟机安装linux系统
首先需要下载VMware10 和CentOS-6.4,我这边提供了百度网盘,可供下载链接:https://pan.baidu.com/s/1vrJUK167xnB2JInLH890fw 密码:r4jj ...
java编程求和
用java编程,实现字符串强制类型转化成整数型,用到Integer.parseInt(),可以把字符串强制转换成整数结果截图

object SparkStreaming_StateFul {

object SparkStreaming_StateFul {的更多相关文章

随机推荐

热门专题