InputDStream的继承关系。他们都是使用InputDStream这个抽象类的接口进行操作的。特别注意ReceiverInputDStream这个类,大部分时候我们使用的是它作为扩展的基类,因为它才能(更容易)使接收数据的工作分散到各个worker上执行,更符合分布式计算的理念。
所有的输入流都某个时间间隔将数据以block的形式保存到spark memory中,但以spark core不同的是,spark streaming默认是将对象序列化后保存到内存中。



/**
* This is the abstract base class for all input streams. This class provides methods
* start() and stop() which is called by Spark Streaming system to start and stop receiving data.
* Input streams that can generate RDDs from new data by running a service/thread only on
* the driver node (that is, without running a receiver on worker nodes), can be
* implemented by directly inheriting this InputDStream. For example,
* FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for
* new files and generates RDDs with the new files. For implementing input streams
* that requires running a receiver on the worker nodes, use
* [[org.apache.spark.streaming.dstream.ReceiverInputDStream]] as the parent class.
*
* @param ssc_ Streaming context that will execute this input stream
*/
abstract class InputDStream[T: ClassTag] (@transient ssc_ : StreamingContext)
extends DStream[T](ssc_) {

private[streaming] var lastValidTime: Time = null

ssc.graph.addInputStream(this)

/**
* Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]]
* that has to start a receiver on worker nodes to receive external data.
* Specific implementations of NetworkInputDStream must
* define `the getReceiver()` function that gets the receiver object of type
* [[org.apache.spark.streaming.receiver.Receiver]] that will be sent
* to the workers to receive data.
* @param ssc_ Streaming context that will execute this input stream
* @tparam T Class type of the object of this stream
*/
abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : StreamingContext)
extends InputDStream[T](ssc_) {

/** Keeps all received blocks information */
private lazy val receivedBlockInfo = new HashMap[Time, Array[ReceivedBlockInfo]]

/** This is an unique identifier for the network input stream. */
val id = ssc.getNewReceiverStreamId()

/**
* Gets the receiver object that will be sent to the worker nodes
* to receive data. This method needs to defined by any specific implementation
* of a NetworkInputDStream.
*/
def getReceiver(): Receiver[T]
最终都是以BlockRDD返回的
/** Ask ReceiverInputTracker for received data blocks and generates RDDs with them. */
override def compute(validTime: Time): Option[RDD[T]] = {
// If this is called for any time before the start time of the context,
// then this returns an empty RDD. This may happen when recovering from a
// master failure
if (validTime >= graph.startTime) {
val blockInfo = ssc.scheduler.receiverTracker.getReceivedBlockInfo(id)
receivedBlockInfo(validTime) = blockInfo
val blockIds = blockInfo.map(_.blockId.asInstanceOf[BlockId])
Some(new BlockRDD[T](ssc.sc, blockIds))
} else {
Some(new BlockRDD[T](ssc.sc, Array[BlockId]()))
}
}
































spark streaming 5: InputDStream的更多相关文章

  1. Spark Streaming源码分析 – InputDStream

    对于NetworkInputDStream而言,其实不是真正的流方式,将数据读出来后不是直接去处理,而是先写到blocks中,后面的RDD再从blocks中读取数据继续处理这就是一个将stream离散 ...

  2. Spark Streaming消费Kafka Direct方式数据零丢失实现

    使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...

  3. spark streaming kafka1.4.1中的低阶api createDirectStream使用总结

    转载:http://blog.csdn.net/ligt0610/article/details/47311771 由于目前每天需要从kafka中消费20亿条左右的消息,集群压力有点大,会导致job不 ...

  4. spark streaming集成kafka

    Kakfa起初是由LinkedIn公司开发的一个分布式的消息系统,后成为Apache的一部分,它使用Scala编写,以可水平扩展和高吞吐率而被广泛使用.目前越来越多的开源分布式处理系统如Clouder ...

  5. spark streaming从指定offset处消费Kafka数据

    spark streaming从指定offset处消费Kafka数据 -- : 770人阅读 评论() 收藏 举报 分类: spark() 原文地址:http://blog.csdn.net/high ...

  6. Spark streaming + Kafka 流式数据处理,结果存储至MongoDB、Solr、Neo4j(自用)

    KafkaStreaming.scala文件 import kafka.serializer.StringDecoder import org.apache.spark.SparkConf impor ...

  7. spark streaming 接收kafka消息之一 -- 两种接收方式

    源码分析的spark版本是1.6. 首先,先看一下 org.apache.spark.streaming.dstream.InputDStream 的 类说明: This is the abstrac ...

  8. Spark Streaming消费Kafka Direct保存offset到Redis,实现数据零丢失和exactly once

    一.概述 上次写这篇文章文章的时候,Spark还是1.x,kafka还是0.8x版本,转眼间spark到了2.x,kafka也到了2.x,存储offset的方式也发生了改变,笔者根据上篇文章和网上文章 ...

  9. Error- Overloaded method value createDirectStream in error Spark Streaming打包报错

    直接上代码 StreamingExamples.setStreamingLogLevels() val Array(brokers, topics) = args // Create context ...

随机推荐

  1. Java高并发程序设计学习笔记(十一):Jetty分析

    转自:https://blog.csdn.net/dataiyangu/article/details/87894253 new Server()初始化线程池QueuedThreadPoolexecu ...

  2. jvm出现OutOfMemoryError时处理方法/jvm原理和优化参考

    The heap stores all of the objects created by your java program.The heap's contents is monitored by ...

  3. Linux下Mysql 不能访问新数据文件夹问题

    新挂载的盘,打算将数据文件夹配置到 /data/mysql,却无法启动mysqld. 除了将目录授权给mysql用户和组以外 chown -R mysql:mysql /data/mysql 太需要将 ...

  4. 第四章· MySQL客户端工具及SQL讲解

    一.客户端命令介绍 1.mysql 1.用于数据库的连接管理 1) 连接(略) 2) 管理: #MySQL接口自带的命令 \h 或 help 或? 查看帮助 \G 格式化查看数据(key:value) ...

  5. 第六章 组件 63 组件传值-父组件向子组件传值和data与props的区别

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8&quo ...

  6. 初识LVS和LVS_NAT

    如果一台服务器承受过多的压力,那么服务可能会崩溃,所以,我们应该让一台服务器承受的压力在合理范围内,但是如果服务端必须要承受较大的压力,那么一台服务器可能无法满足我们的要求,所以我们可以使用多台服务器 ...

  7. 运维堡垒机(跳板机)系统 python

    相信各位对堡垒机(跳板机)不陌生,为了保证服务器安全,前面加个堡垒机,所有ssh连接都通过堡垒机来完成,堡垒机也需要有 身份认证,授权,访问控制,审计等功能,笔者用Python基本实现了上述功能. A ...

  8. Django项目开发,XSS攻击,图片防盗链,图片验证码,kindeditor编辑器

    目录 一.Django项目开发 1. 项目开发流程 2. auth模块的补充 (1)django的admin可视化管理页面 (2)将admin可视化管理页面的模型表显示成中文 (3)auth模块的用户 ...

  9. hive的事物性 transaction manager

    create table lk3 (id string,nname string,grade int,goldUser int); insert into lk3 values (,, ), (,, ...

  10. 题解 [SCOI2007]修车

    题面 解析 这题要拆点.. 首先,证明一个式子: 设修理员M修了N辆车, 且修每辆车的时间为W1,W2....WN. 那么,这个修理员一共花的时间就为:W1*N+W2*(N-1)+...+WN*1. ...