StreamingContext 和SparkContex的用途是差不多的,作为spark stream的入口,提供配置、生成DStream等功能。

总体来看,spark stream包括如下模块:


/**
* Main entry point for Spark Streaming functionality. It provides methods used to create
* [[org.apache.spark.streaming.dstream.DStream]]s from various input sources. It can be either
* created by providing a Spark master URL and an appName, or from a org.apache.spark.SparkConf
* configuration (see core Spark documentation), or from an existing org.apache.spark.SparkContext.
* The associated SparkContext can be accessed using `context.sparkContext`. After
* creating and transforming DStreams, the streaming computation can be started and stopped
* using `context.start()` and `context.stop()`, respectively.
* `context.awaitTransformation()` allows the current thread to wait for the termination
* of the context by `stop()` or by an exception.
*/
class StreamingContext private[streaming] (
sc_ : SparkContext,
cp_ : Checkpoint,
batchDur_ : Duration
) extends Logging {
重要的属性:DStreamGraph有点像简洁版的DAG scheduler,负责根据某个时间间隔生成一序列JobSet,以及按照依赖关系序列化
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
assert(batchDur_ != null, "Batch duration for streaming context cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}

private val nextReceiverInputStreamId = new AtomicInteger(0)

JobScheduler
private[streaming] val scheduler = new JobScheduler(this)

private[streaming] val waiter = new ContextWaiter

private[streaming] val progressListener = new StreamingJobProgressListener(this)

添加一个listener,他们会处理InputStream输入的数据
/** Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for
* receiving system events related to streaming.
*/
def addStreamingListener(streamingListener: StreamingListener) {
scheduler.listenerBus.addListener(streamingListener)
}
启动调度器
/**
* Start the execution of the streams.
*
* @throws SparkException if the context has already been started or stopped.
*/
def start(): Unit = synchronized {
if (state == Started) {
throw new SparkException("StreamingContext has already been started")
}
if (state == Stopped) {
throw new SparkException("StreamingContext has already been stopped")
}
validate()
sparkContext.setCallSite(DStream.getCreationSite())
scheduler.start()
state = Started
}

各种InputStream,后续再细看。
class PluggableInputDStream[T: ClassTag](
@transient ssc_ : StreamingContext,
receiver: Receiver[T]) extends ReceiverInputDStream[T](ssc_) {

def getReceiver(): Receiver[T] = {
receiver
}
}

class SocketInputDStream[T: ClassTag](
@transient ssc_ : StreamingContext,
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](ssc_) {

def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
}
}

/**
* An input stream that reads blocks of serialized objects from a given network address.
* The blocks will be inserted directly into the block store. This is the fastest way to get
* data into Spark Streaming, though it requires the sender to batch data and serialize it
* in the format that the system is configured with.
*/
private[streaming]
class RawInputDStream[T: ClassTag](
@transient ssc_ : StreamingContext,
host: String,
port: Int,
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](ssc_ ) with Logging {

def getReceiver(): Receiver[T] = {
new RawNetworkReceiver(host, port, storageLevel).asInstanceOf[Receiver[T]]
}
}


/**
* Create a input stream that monitors a Hadoop-compatible filesystem
* for new files and reads them using the given key-value types and input format.
* Files must be written to the monitored directory by "moving" them from another
* location within the same file system. File names starting with . are ignored.
* @param directory HDFS directory to monitor for new file
* @tparam K Key type for reading HDFS file
* @tparam V Value type for reading HDFS file
* @tparam F Input format for reading HDFS file
*/
def fileStream[
K: ClassTag,
V: ClassTag,
F <: NewInputFormat[K, V]: ClassTag
] (directory: String): InputDStream[(K, V)] = {
new FileInputDStream[K, V, F](this, directory)
}


private[streaming]
class TransformedDStream[U: ClassTag] (
parents: Seq[DStream[_]],
transformFunc: (Seq[RDD[_]], Time) => RDD[U]
) extends DStream[U](parents.head.ssc) {

require(parents.length > 0, "List of DStreams to transform is empty")
require(parents.map(_.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
require(parents.map(_.slideDuration).distinct.size == 1,
"Some of the DStreams have different slide durations")

override def dependencies = parents.toList

override def slideDuration: Duration = parents.head.slideDuration

override def compute(validTime: Time): Option[RDD[U]] = {
val parentRDDs = parents.map(_.getOrCompute(validTime).orNull).toSeq
Some(transformFunc(parentRDDs, validTime))
}
}








spark streaming 1: SparkContex的更多相关文章

  1. Spark踩坑记——Spark Streaming+Kafka

    [TOC] 前言 在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...

  2. Spark Streaming+Kafka

    Spark Streaming+Kafka 前言 在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端, ...

  3. Storm介绍及与Spark Streaming对比

    Storm介绍 Storm是由Twitter开源的分布式.高容错的实时处理系统,它的出现令持续不断的流计算变得容易,弥补了Hadoop批处理所不能满足的实时要求.Storm常用于在实时分析.在线机器学 ...

  4. flume+kafka+spark streaming整合

    1.安装好flume2.安装好kafka3.安装好spark4.流程说明: 日志文件->flume->kafka->spark streaming flume输入:文件 flume输 ...

  5. spark streaming kafka example

    // scalastyle:off println package org.apache.spark.examples.streaming import kafka.serializer.String ...

  6. Spark Streaming中动态Batch Size实现初探

    本期内容 : BatchDuration与 Process Time 动态Batch Size Spark Streaming中有很多算子,是否每一个算子都是预期中的类似线性规律的时间消耗呢? 例如: ...

  7. Spark Streaming源码解读之No Receivers彻底思考

    本期内容 : Direct Acess Kafka Spark Streaming接收数据现在支持的两种方式: 01. Receiver的方式来接收数据,及输入数据的控制 02. No Receive ...

  8. Spark Streaming架构设计和运行机制总结

    本期内容 : Spark Streaming中的架构设计和运行机制 Spark Streaming深度思考 Spark Streaming的本质就是在RDD基础之上加上Time ,由Time不断的运行 ...

  9. Spark Streaming中空RDD处理及流处理程序优雅的停止

    本期内容 : Spark Streaming中的空RDD处理 Spark Streaming程序的停止 由于Spark Streaming的每个BatchDuration都会不断的产生RDD,空RDD ...

随机推荐

  1. Dubbo从入门到实战视频教程

    Dubbo是阿里巴巴公司开源的一个高性能优秀的服务框架,使得应用可通过高性能的 RPC 实现服务的输出和输入功能,可以和 Spring框架无缝集成.这里整理了一套关于dubbo的视频教程分享给大家,包 ...

  2. css复杂动画(animation属性)

    1.声明:@keyframes name{   }: 2.涉及到的属性 animation-name:动画名称 animation-duration:单次动画总时长 animation-timing- ...

  3. LeetCode——回文链表

    题目 给定一个链表的头节点head,请判断该链表是否为回 文结构. 例如: 1->2->1,返回true. 1->2->2->1,返回true. 15->6-> ...

  4. HTTP/HTTPS协议 & GraphQL(非RESTFUL方式)

    HTTP访问控制-跨域资源共享(CORS) 缓存管理 HTTP VS HTTPS架构 TLS协议 HTTPS会话劫持 基于HTTP协议的服务器消息机制 1. Longpoll 2. SSE 3. We ...

  5. ActiveMQ基础01——Linux下载安装ActiveMQ

    1.下载 下载地址:http://activemq.apache.org/ 点击按钮 下载Linux下最新版安装包,点击即可下载 2.安装ActiveMQ 将之前下载的安装包上传到linux当中,一般 ...

  6. xml_SAX解析

    (一)SAX解析 1.1 SAX解析 SAX:基于事件处理的机制 sax解析xml文件时,遇到根开始标签,根结束标签,开始解析文件,文件解析结束,字符内容,空白字符等都会触发各自的方法. 优点: 适合 ...

  7. HashMap并发分析

    我们听过并发情况下的HashMap,会出现成环的情况,现在,我就来总结一下它成环的过程. 一言以蔽之,就是他在resize的时候,会改变元素的next指针. 之前在一篇博客里提到,HashMap的re ...

  8. Linux系统下C语言获取Time

    获取时间的函数有很多,具体包括如下: time()/gettimeofday()等等,下面是获取具体到usecond的时间程序: #include <iostream> #include ...

  9. 自定义Java Validator

    自定义Java Validator 在项目中,针对汉字的长度计算,数据库和java的计算方式不一致,需要重新处理下java 的 Validator,使其满足项目 建立自定义的 validator an ...

  10. CDN和浏览器缓存

    1,CDN 旨在解决的最重要的问题是什么,我们称之为网络延迟,通过网络获取资源总是比从本地获取慢,无论服务器是在同一个局域网中还是位于世界的另一个角落,都是如此.这里的速度差异是 IT 行业的一个核心 ...