spark streaming 2: DStream


/**
* A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
* sequence of RDDs (of the same type) representing a continuous stream of data (see
* org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
* DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
* etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
* transforming existing DStreams using operations such as `map`,
* `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
* periodically generates a RDD, either from live data or by transforming the RDD generated by a
* parent DStream.
*
* This class contains the basic operations available on all DStreams, such as `map`, `filter` and
* `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
* operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
* `join`. These operations are automatically available on any DStream of pairs
* (e.g., DStream[(Int, Int)] through implicit conversions when
* `org.apache.spark.streaming.StreamingContext._` is imported.
*
* DStreams internally is characterized by a few basic properties:
* - A list of other DStreams that the DStream depends on
* - A time interval at which the DStream generates an RDD
* - A function that is used to generate an RDD after each time interval
*/
abstract class DStream[T: ClassTag] (
@transient private[streaming] var ssc: StreamingContext
) extends Serializable with Logging {
// =======================================================================
// Methods that should be implemented by subclasses of DStream
// =======================================================================
/** Time interval after which the DStream generates a RDD */
def slideDuration: Duration
/** List of parent DStreams on which this DStream depends on */
def dependencies: List[DStream[_]]
/** Method that generates a RDD for the given time */
def compute (validTime: Time): Option[RDD[T]]
// =======================================================================
// Methods and fields available on all DStreams
// =======================================================================
// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {
spark streaming 2: DStream的更多相关文章
- 53、Spark Streaming:输入DStream之Kafka数据源实战
一.基于Receiver的方式 1.概述 基于Receiver的方式: Receiver是使用Kafka的高层次Consumer API来实现的.receiver从Kafka中获取的数据都是存储在Sp ...
- Spark入门实战系列--7.Spark Streaming(上)--实时流计算Spark Streaming原理介绍
[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 .Spark Streaming简介 1.1 概述 Spark Streaming 是Spa ...
- Spark Streaming
Spark Streaming Spark Streaming 是Spark为了用户实现流式计算的模型. 数据源包括Kafka,Flume,HDFS等. DStream 离散化流(discretize ...
- Spark学习之Spark Streaming
一.简介 许多应用需要即时处理收到的数据,例如用来实时追踪页面访问统计的应用.训练机器学习模型的应用,还有自动检测异常的应用.Spark Streaming 是 Spark 为这些应用而设计的模型.它 ...
- Spark Streaming 实现思路与模块概述
一.基于 Spark 做 Spark Streaming 的思路 Spark Streaming 与 Spark Core 的关系可以用下面的经典部件图来表述: 在本节,我们先探讨一下基于 Spark ...
- .Spark Streaming(上)--实时流计算Spark Streaming原理介
Spark入门实战系列--7.Spark Streaming(上)--实时流计算Spark Streaming原理介绍 http://www.cnblogs.com/shishanyuan/p/474 ...
- spark streaming的理解和应用
1.Spark Streaming简介 官方网站解释:http://spark.apache.org/docs/latest/streaming-programming-guide.html 该博客转 ...
- 实时流计算Spark Streaming原理介绍
1.Spark Streaming简介 1.1 概述 Spark Streaming 是Spark核心API的一个扩展,可以实现高吞吐量的.具备容错机制的实时流数据的处理.支持从多种数据源获取数据,包 ...
- Spark Streaming之一:整体介绍
提到Spark Streaming,我们不得不说一下BDAS(Berkeley Data Analytics Stack),这个伯克利大学提出的关于数据分析的软件栈.从它的视角来看,目前的大数据处理可 ...
随机推荐
- 11 Mysql之配置双主热备+keeepalived.md
准备 1. 双主 master1 192.168.199.49 master2 192.168.199.50 VIP 192.168.199.52 //虚拟IP 2.环境 master:nginx + ...
- html/css中map和area的应用
一.使用方法: 因为map标签是与img标签绑定使用的,所以我们需要给map标签添加ID和name属性,让img标签中的usemap属性引用map标签中的id或者name属性(由于浏览器的不同,use ...
- html基础知识(总结自www.runoob.com)
HTML属性 属性 描述 class 为html元素定义一个或多个类名(classname)(类名从样式文件引入) id 定义元素的唯一id style 规定元素的行内样式(inline style) ...
- 一个页面两个div(一个柱状图或者折线图一个饼图)
需求是一个页面中两个图,一个饼图一个折线图,接口用的是一个接口,柱状图的图例要隐藏掉,X轴为月份,每月份都有两个数据,也就是图例是两个(进口和出口)的意思饼图需要显示最新月份数据,并且有一个下拉框可以 ...
- MySQL 中 EXISTS 的用法
在MySQL中 EXISTS 和 IN 的用法有什么关系和区别呢? 假定数据库中有两个表 分别为 表 a 和表 b create table a ( a_id int, a_name varchar( ...
- 【异常】Reason: Executor heartbeat timed out after 140927 ms
1 详细异常 ERROR scheduler.JobScheduler: Error running job streaming job ms. org.apache.spark.SparkExcep ...
- zencart根据configuration_id cID查找站点配置
admin/configuration.php?gID=6&cID=1075 zencart根据configuration_id cID查找站点配置 ; zencart根据configurat ...
- 重写 equals() 和 hashcode()
重写equals() @Override public boolean equals(Object o) { if (this == o) return true; if (o == null || ...
- echarts常用的配置项
最近使用echarts可视化的业务,但是有一些配置项需要修改,把这段时间的学习总结一下 1. 修改默认配置 a. 去掉分割线和网格线,在xAxis或者yAxis中设置 splitLine: { sho ...
- IView 使用Table组件时实现给某一列添加click事件
通过给 columns 数据的项,设置一个函数 render,可以自定义渲染当前列,包括渲染自定义组件,它基于 Vue 的 Render 函数. render 函数传入两个参数,第一个是 h,第二个是 ...