1 Overview
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like mapreducejoin and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. 
Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.
input data stream->spark streaming->batches of input data->spark engine->batches of processed data
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
 
2 A quick example
//start the data server
# nc -lk 9999
   
3 Basic concepts
 
3.1 Linking
 
3.2 Initializing StreamingContext
step1 Define the input sources by creating input DStreams.
step2 Define the streaming computations by applying transformation and output operations to DStreams.
step3 Start receiving data and processing it using streamingContext.start().
step4 Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
step5 The processing can be manually stopped using streamingContext.stop().
 
Points to remember:
  • Once a context has been started, no new streaming computations can be set up or added to it.
  • Once a context has been stopped, it cannot be restarted.
  • Only one StreamingContext can be active in a JVM at the same time.
  • stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter of stop() calledstopSparkContext to false.
  • A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped (without stopping the SparkContext) before the next StreamingContext is created.
 
3.3 Discretized Streams (DStreams)
Discretized(离散化处理的) Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs.
 
3.4 Input DStreams and Receivers
Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.
Points to remember
  • When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. Either of these means that only one thread will be used for running tasks locally. If you are using a input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Hence, when running locally, always use “local[n]” as the master URL, where n > number of receivers to run (see Spark Properties for information on how to set the master).

  • Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.

Basic Sources:
scc.fileStream()
scc.queueStream()
scc.socketTextStream()
scc.actorStream()
Advanced Sources:
Kafka
Flume
Kinesis
Twitter
Custom Sources:
Input DStreams can also be created out of custom data sources. All you have to do is implement a user-defined receiver (see next section to understand what that is) that can receive data from the custom sources and push it into Spark. See the Custom Receiver Guide for details.
 
3.5 Transformations on DStreams
transformations that  worth discussing in more detail:
UpdateStateByKey Operation
Transform Operation
Window Operation
     Any window operation needs to specify two parameters:
         window length - The duration of the window (3 in the figure).
         sliding interval - The interval at which the window operation is performed (2 in the figure).
     I want to extend the earlier example by generating word counts over the last 30 seconds of data, every 10 seconds. 
         // Reduce last 30 seconds of data, every 10 seconds
         val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
Join Operation : leftOuterJoin, rightOuterJoin, fullOuterJoin
 
3.6 Output Operations on DStreams
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems. Since the output operations actually allow the transformed data to be consumed by external systems, they trigger the actual execution of all the DStream transformations (similar to actions for RDDs).
 
3.7 DataFrame and SQL Operations
 
3.8 MLlib Operations
 
3.9 Caching / Persistence
 
3.10 Checkpointing
The default interval is a multiple of the batch interval that is at least 10 seconds. It can be set by using dstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
3.11 Deploying Applications
 
3.12 Monitoring Applications
 
4 Performance Tuning

4.1 Reducing the Batch Processing Times
 
4.2 Setting the Right Batch Interval
 
4.3 Memory Tuning

 

Spark Streaming - DStream的更多相关文章

  1. 58、Spark Streaming: DStream的output操作以及foreachRDD详解

    一.output操作 1.output操作 DStream中的所有计算,都是由output操作触发的,比如print().如果没有任何output操作,那么,压根儿就不会执行定义的计算逻辑. 此外,即 ...

  2. 54、Spark Streaming:DStream的transformation操作概览

    一. transformation操作概览 Transformation Meaning map 对传入的每个元素,返回一个新的元素 flatMap 对传入的每个元素,返回一个或多个元素 filter ...

  3. spark streaming(2) DAG静态定义及DStream,DStreamGraph

    DAG 中文名有向无环图.它不是spark独有技术.它是一种编程思想 ,甚至于hadoop阵营里也有运用DAG的技术,比如Tez,Oozie.有意思的是,Tez是从MapReduce的基础上深化而来的 ...

  4. 大数据技术之_19_Spark学习_04_Spark Streaming 应用解析 + Spark Streaming 概述、运行、解析 + DStream 的输入、转换、输出 + 优化

    第1章 Spark Streaming 概述1.1 什么是 Spark Streaming1.2 为什么要学习 Spark Streaming1.3 Spark 与 Storm 的对比第2章 运行 S ...

  5. Spark Streaming源码分析 – DStream

    A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous sequence o ...

  6. spark streaming 2: DStream

    DStream是类似于RDD概念,是对数据的抽象封装.它是一序列的RDD,事实上,它大部分的操作都是对RDD支持的操作的封装,不同的是,每次DStream都要遍历它内部所有的RDD执行这些操作.它可以 ...

  7. Spark Streaming消费Kafka Direct方式数据零丢失实现

    使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...

  8. Spark Streaming

    Spark Streaming Spark Streaming 是Spark为了用户实现流式计算的模型. 数据源包括Kafka,Flume,HDFS等. DStream 离散化流(discretize ...

  9. spark streaming kafka1.4.1中的低阶api createDirectStream使用总结

    转载:http://blog.csdn.net/ligt0610/article/details/47311771 由于目前每天需要从kafka中消费20亿条左右的消息,集群压力有点大,会导致job不 ...

随机推荐

  1. CentOS 6.8安装Ceph

    机器规划 IP 主机名 角色 10.101.0.1 ceph01 mon admin mds 10.101.0.2 ceph02 ods 10.101.0.3 ceph03 ods 10.101.0. ...

  2. Ubuntu 16.04LTS 更新清华源

    1 备份原来的更新源 cp /etc/apt/sources.list /etc/apt/sources.list.backup 如果提示权限不够就输入下面两行,先进入到超级用户,再备份 sudo - ...

  3. 纯js轮播图练习-2,js+css旋转木马层叠轮播

    基于css3的新属性,加上js的操作,让现在js轮播图花样越来越多. 而现在出现的旋转木马层叠轮播的轮播图样式,却是得到了很多人都喜爱和投入使用. 尤其是在各大软件中,频繁的出现在大家的眼里,在web ...

  4. Xshell配色方案推荐

    使用方法: 新建mycolor.xcs文件 复制粘贴如下代码,将文件导入,修改自己喜欢的字体即可 [mycolor] text=00ff80 cyan(bold)=00ffff text(bold)= ...

  5. Git Status 中文乱码解决

    现象: jb@H39:~/doc$ git statusOn branch masterYour branch is up-to-date with 'origin/master'. Untracke ...

  6. FPGA学习之路——PLL的使用

    锁相环(PLL)主要用于频率综合,使用一个 PLL 可以从一个输入时钟信号生成多个时钟信号. PLL 内部的功能框图如下图所示: 在ISE中新建一个PLL的IP核,设置四个输出时钟,分别为25MHz. ...

  7. linux如何制作程序桌面快捷方式

    1.生成通过apt或者dpkg安装的程序的桌面快捷方式 他们的快捷方式在/usr/share/applications中,比如我们要生成火狐的桌面快捷方式,执行下列命令 cp /usr/share/a ...

  8. 成都Uber优步司机奖励政策(2月17日)

    滴快车单单2.5倍,注册地址:http://www.udache.com/ 如何注册Uber司机(全国版最新最详细注册流程)/月入2万/不用抢单:http://www.cnblogs.com/mfry ...

  9. Chrome模拟平板调试

    1. 按F12,打开开发者工具,右上角,点击红圈中的标志.然后在弹出的面板中点击'Emulation'. 2. 会看到左侧的四个选项卡  Device 设备.Screen 屏幕.User Agent ...

  10. 下载Web微信视频

    1. 用浏览器(我用Chrome)登录web微信(wx.qq.com) 2. 这个时候如果有人发视频,可以点开播放.用F12打开chrome的调试平台,查看视频源的URL(绿色框的src内容) 3. ...