spark streaming 2: DStream

DStream是类似于RDD概念，是对数据的抽象封装。它是一序列的RDD，事实上，它大部分的操作都是对RDD支持的操作的封装，不同的是，每次DStream都要遍历它内部所有的RDD执行这些操作。它可以由StreamingContext通过流数据产生或者其他DStream使用map方法产生（与RDD一样）

time属性对DStream而言非常重要，DStream里面的RDD就是通过某个时间间隔产生的，而且以产生的时间为索引。所以在访问DStream的某个RDD时，实际上是访问它在某个时间点的RDD。

/**
 * A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
 * sequence of RDDs (of the same type) representing a continuous stream of data (see
 * org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
 * DStreams can either be created from live data (such as, data from TCP sockets, Kafka, Flume,
 * etc.) using a [[org.apache.spark.streaming.StreamingContext]] or it can be generated by
 * transforming existing DStreams using operations such as `map`,
 * `window` and `reduceByKeyAndWindow`. While a Spark Streaming program is running, each DStream
 * periodically generates a RDD, either from live data or by transforming the RDD generated by a
 * parent DStream.
 *
 * This class contains the basic operations available on all DStreams, such as `map`, `filter` and
 * `window`. In addition, [[org.apache.spark.streaming.dstream.PairDStreamFunctions]] contains
 * operations available only on DStreams of key-value pairs, such as `groupByKeyAndWindow` and
 * `join`. These operations are automatically available on any DStream of pairs
 * (e.g., DStream[(Int, Int)] through implicit conversions when
 * `org.apache.spark.streaming.StreamingContext._` is imported.
 *
 * DStreams internally is characterized by a few basic properties:
 *  - A list of other DStreams that the DStream depends on
 *  - A time interval at which the DStream generates an RDD
 *  - A function that is used to generate an RDD after each time interval
 */

abstract class DStream[T: ClassTag] (
    @transient private[streaming] var ssc: StreamingContext
  ) extends Serializable with Logging {

重要属性：

// =======================================================================
// Methods that should be implemented by subclasses of DStream
// =======================================================================
/** Time interval after which the DStream generates a RDD */
def slideDuration: Duration
/** List of parent DStreams on which this DStream depends on */
def dependencies: List[DStream[_]]
/** Method that generates a RDD for the given time */
def compute (validTime: Time): Option[RDD[T]]

当前已经产生了的RDD，以产生的时间为索引

// =======================================================================
// Methods and fields available on all DStreams
// =======================================================================

// RDDs generated, marked as private[streaming] so that testsuites can access it
@transient
private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()

为某个时间点产生一个RDD

/**
 * Get the RDD corresponding to the given time; either retrieve it from cache
 * or compute-and-cache it.
 */
private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {

From WizNote

spark streaming 2: DStream的更多相关文章

53、Spark Streaming:输入DStream之Kafka数据源实战
一.基于Receiver的方式 1.概述基于Receiver的方式: Receiver是使用Kafka的高层次Consumer API来实现的.receiver从Kafka中获取的数据都是存储在Sp ...
Spark入门实战系列--7.Spark Streaming（上）--实时流计算Spark Streaming原理介绍
[注]该系列文章以及使用到安装包/测试数据可以在<倾情大奉送--Spark入门实战系列>获取 .Spark Streaming简介 1.1 概述 Spark Streaming 是Spa ...
Spark Streaming
Spark Streaming Spark Streaming 是Spark为了用户实现流式计算的模型. 数据源包括Kafka,Flume,HDFS等. DStream 离散化流(discretize ...
Spark学习之Spark Streaming
一.简介许多应用需要即时处理收到的数据,例如用来实时追踪页面访问统计的应用.训练机器学习模型的应用,还有自动检测异常的应用.Spark Streaming 是 Spark 为这些应用而设计的模型.它 ...
Spark Streaming 实现思路与模块概述
一.基于 Spark 做 Spark Streaming 的思路 Spark Streaming 与 Spark Core 的关系可以用下面的经典部件图来表述: 在本节,我们先探讨一下基于 Spark ...
.Spark Streaming（上）--实时流计算Spark Streaming原理介
Spark入门实战系列--7.Spark Streaming(上)--实时流计算Spark Streaming原理介绍 http://www.cnblogs.com/shishanyuan/p/474 ...
spark streaming的理解和应用
1.Spark Streaming简介官方网站解释:http://spark.apache.org/docs/latest/streaming-programming-guide.html 该博客转 ...
实时流计算Spark Streaming原理介绍
1.Spark Streaming简介 1.1 概述 Spark Streaming 是Spark核心API的一个扩展,可以实现高吞吐量的.具备容错机制的实时流数据的处理.支持从多种数据源获取数据,包 ...
Spark Streaming之一：整体介绍
提到Spark Streaming,我们不得不说一下BDAS(Berkeley Data Analytics Stack),这个伯克利大学提出的关于数据分析的软件栈.从它的视角来看,目前的大数据处理可 ...

随机推荐

O012、Linux如何实现VLAN
参考https://www.cnblogs.com/CloudMan6/p/5313994.html LAN 表示 Local Area Network ,本地局域网,通常使用 Hub 或者 Sw ...
nginx（五）- linux下安装nginx与配置
linux系统为Centos 64位准备目录 [root@instance-3lm099to ~]# mkdir /usr/local/nginx [root@instance-3lm099to ~ ...
转载： utm坐标和经纬度相互转换
原文地址: https://blog.csdn.net/hanshuobest/article/details/77752279 //经纬度转utm坐标 int convert_lonlat_utm( ...
python cv2读取rtsp实时码流按时生成连续视频文件
代码实现 # coding: utf-8 import datetime import cv2 import os ip = '192.168.3.160'.replace("." ...
postgres外部表
在创建外部表的时候遇见: CREATE EXTENSION file_fdw;2018-12-21 17:32:23.822 CST [31237] ERROR: could not open ex ...
linux下mysql忘记密码解决方案
一.写随笔的原因:之前自己服务器上的mysql很久不用了,忘记了密码,所以写一下解决方案,以供以后参考二.具体的内容: 1. 检查mysql服务是否启动,如果启动,关闭mysql服务运行命令:ps ...
elastic 基本操作
官方参考文档: https://www.elastic.co/guide/cn/elasticsearch/guide/current/index-doc.html 1.查看有哪些索引: curl ...
Java 和JavaScript实现C#中的String.format效果
1.Java实现 /** * 需要引入com.alibaba.fastjson.1.2.8 * String result2=HuaatUtil.format(templa ...
svn提交报错解决方法
1.先clean 2.删除 .lock文件 3.update项目 4.先还原文件,然后update 接着commit
DeepFaceLab更新至2019.12.23
本次更新主要是增加了脸图样本生成器,一般来说我们提取脸图之后会放到aligned文件夹里面,训练的时候会加载这些脸图,若是图片少还行,一旦图片太多加载效率低不说,同样会影响了训练效率.现在好了,我们只 ...

spark streaming 2: DStream

spark streaming 2: DStream的更多相关文章

随机推荐

热门专题