Spark Streaming - DStream
map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. step2 Define the streaming computations by applying transformation and output operations to DStreams.
step3 Start receiving data and processing it using streamingContext.start().
step4 Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
step5 The processing can be manually stopped using streamingContext.stop().
Points to remember:
- Once a context has been started, no new streaming computations can be set up or added to it.
- Once a context has been stopped, it cannot be restarted.
- Only one StreamingContext can be active in a JVM at the same time.
- stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of
stop()calledstopSparkContextto false. - A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
Points to remember
When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).
Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.
window length - The duration of the window (3 in the figure).
sliding interval - The interval at which the window operation is performed (2 in the figure).
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
4.1 Reducing the Batch Processing Times
Spark Streaming - DStream的更多相关文章
- 58、Spark Streaming: DStream的output操作以及foreachRDD详解
一.output操作 1.output操作 DStream中的所有计算,都是由output操作触发的,比如print().如果没有任何output操作,那么,压根儿就不会执行定义的计算逻辑. 此外,即 ...
- 54、Spark Streaming:DStream的transformation操作概览
一. transformation操作概览 Transformation Meaning map 对传入的每个元素,返回一个新的元素 flatMap 对传入的每个元素,返回一个或多个元素 filter ...
- spark streaming(2) DAG静态定义及DStream,DStreamGraph
DAG 中文名有向无环图.它不是spark独有技术.它是一种编程思想 ,甚至于hadoop阵营里也有运用DAG的技术,比如Tez,Oozie.有意思的是,Tez是从MapReduce的基础上深化而来的 ...
- 大数据技术之_19_Spark学习_04_Spark Streaming 应用解析 + Spark Streaming 概述、运行、解析 + DStream 的输入、转换、输出 + 优化
第1章 Spark Streaming 概述1.1 什么是 Spark Streaming1.2 为什么要学习 Spark Streaming1.3 Spark 与 Storm 的对比第2章 运行 S ...
- Spark Streaming源码分析 – DStream
A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence o ...
- spark streaming 2: DStream
DStream是类似于RDD概念,是对数据的抽象封装.它是一序列的RDD,事实上,它大部分的操作都是对RDD支持的操作的封装,不同的是,每次DStream都要遍历它内部所有的RDD执行这些操作.它可以 ...
- Spark Streaming消费Kafka Direct方式数据零丢失实现
使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...
- Spark Streaming
Spark Streaming Spark Streaming 是Spark为了用户实现流式计算的模型. 数据源包括Kafka,Flume,HDFS等. DStream 离散化流(discretize ...
- spark streaming kafka1.4.1中的低阶api createDirectStream使用总结
转载:http://blog.csdn.net/ligt0610/article/details/47311771 由于目前每天需要从kafka中消费20亿条左右的消息,集群压力有点大,会导致job不 ...
随机推荐
- Linux中将端口(80)重定向
在Linux中直接指定命令: iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080 其中80为要访问的端 ...
- VirtualBox复制的虚拟机无法获取IP解决办法
自从建立了这个账号后写了一篇,好几年没来了,今天来看看,顺便分享一下. 昨天晚上想玩玩zookeeper集群,在vb里复制了一台主机,可怎么也无法获取IP,经研究,终于还是解决了. 1.复制主机时勾选 ...
- Apache 错误:httpd: Could not open configuration file
神奇的事件,折磨我 电脑关机重启了一下关机之前正常的状态没有任何的异常出现,过了一会开机准备工作.神奇的事情tmd出现了!!!! 打开phpstudy 启动... 嗯?apache亮红报错?? 第一反 ...
- ActivatedRoute 当前激活的路由对象
ActivatedRoute,当前激活的路由对象,主要用于保存路由,获取路由传递的参数. 一:传递参数的三种方式,以及ActivatedRoute获取他们的方式: 1.在查询参数中传递数据: /pro ...
- 利用haohedi ETL将数据库中的数据抽取到hadoop Hive中
采用HIVE自带的apache 的JDBC驱动导入数据基本上只能采用Load data命令将文本文件导入,采用INSERT ... VALUES的方式插入速度极其慢,插入一条需要几十秒钟,基本上不可用 ...
- Django项目中关于redis包版本的坑
1.环境 python:3.6 django:1.11.8 redis:3.2.1 2.遇到的问题 报错:redis.exceptions.DataError: Invalid input of ty ...
- 20145209 2016-2017-2 《Java程序设计》第10周学习总结
20145209 2016-2017-2 <Java程序设计>第10周学习总结 教材学习内容总结 计算机网络概述 计算机网络体系结构的通信协议划分为七层,自下而上依次为:物理层(Physi ...
- BZOJ2761_不重复数字_KEY
题目传送门 Map水过(或set也行). code: /************************************************************** Problem: ...
- 4946: [Noi2017]蔬菜
4946: [Noi2017]蔬菜 http://www.lydsy.com/JudgeOnline/upload/Noi2017D2.pdf 分析: 贪心. 首先可以将一个蔬菜拆成两个,一个是有加成 ...
- 征战 OSG-序及目录
其实很早就应该写这个了,一直拖到现在就是因为懒啊. 自从七月演习回来,被划到三维平台开发部,就一直混日子,也没人带领,也没人问结果,就这么一直堕落下来了,直到有一天才发现自己也看不上自己了,觉得自己这 ...