Spark Streaming之三:DStream解析
DStream
1.1基本说明
1.1.1 Duration
Spark Streaming的时间类型,单位是毫秒;
生成方式如下:
1)new Duration(milli seconds)
输入毫秒数值来生成;
2)seconds(seconds)
输入秒数值来生成;
3)Minutes(minutes)
输入分钟数值来生成;
1.1.2 slideDuration
/** Time interval after which the DStream generates a RDD */
def slideDuration: Duration
slideDuration,时间窗口滑动长度;根据这个时间长度来生成一个RDD;
1.1.3 dependencies
/** List of parent DStreams on which this DStream depends on */
def dependencies: List[DStream[_]]
dependencies,DStreams的依赖关系;
1.1.4 compute
/** Method that generates a RDD for the given time */
def compute(validTime: Time): Option[RDD[T]]
compute,根据给定的时间来生成RDD;
1.1.5 zeroTime
// Time zero for the DStream
private[streaming] var zeroTime: Time = null
zeroTime,DStream的起点时间;
1.1.6 rememberDuration
// Duration for which the DStream will remember each RDD created
private[streaming] var rememberDuration: Duration = null
rememberDuration,记录DStream中每个RDD的产生时间;
1.1 7 storageLevel
// Storage level of the RDDs in the stream
private[streaming] var storageLevel: StorageLevel = StorageLevel.NONE
storageLevel,DStream中每个RDD的存储级别;
1.1.8 parentRememberDuration
// Duration for which the DStream requires its parent DStream to remember each RDD created
private[streaming] def parentRememberDuration = rememberDuration
parentRememberDuration,父DStream记录RDD的生成时间;
1.1.9 persist
/** Persist the RDDs of this DStream with the given storage level */
def persist(level: StorageLevel): DStream[T] = {
if (this.isInitialized) {
throw new UnsupportedOperationException(
"Cannot change storage level of a DStream after streaming context has started")
}
this.storageLevel = level
this
}
Persist,DStream中RDD的存储级别;
1.1.10 checkpoint
/**
* Enable periodic checkpointing of RDDs of this DStream
* @param interval Time interval after which generated RDD will be checkpointed
*/
def checkpoint(interval: Duration): DStream[T] = {
if (isInitialized) {
throw new UnsupportedOperationException(
"Cannot change checkpoint interval of a DStream after streaming context has started")
}
persist()
checkpointDuration = interval
this
}
checkpoint,设置DStream的checkpoint时间间隔;
1.1.11 initialize
/**
* Initialize the DStream by setting the "zero" time, based on which
* the validity of future times is calculated. This method also recursively initializes
* its parent DStreams.
*/
private[streaming] def initialize(time: Time) {
if (zeroTime != null && zeroTime != time) {
throw new SparkException(s"ZeroTime is already initialized to $zeroTime"
+ s", cannot initialize it again to $time")
}
zeroTime = time // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger
if (mustCheckpoint && checkpointDuration == null) {
checkpointDuration = slideDuration * math.ceil(Seconds(10) / slideDuration).toInt
logInfo(s"Checkpoint interval automatically set to $checkpointDuration")
} // Set the minimum value of the rememberDuration if not already set
var minRememberDuration = slideDuration
if (checkpointDuration != null && minRememberDuration <= checkpointDuration) {
// times 2 just to be sure that the latest checkpoint is not forgotten (#paranoia)
minRememberDuration = checkpointDuration * 2
}
if (rememberDuration == null || rememberDuration < minRememberDuration) {
rememberDuration = minRememberDuration
} // Initialize the dependencies
dependencies.foreach(_.initialize(zeroTime))
}
initialize,DStream初始化,其初始时间通过"zero" time设置;
1.1.12 getOrCompute
/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
getOrCompute,通过时间参数获取RDD;
1.1.13 generateJob
/**
* Generate a SparkStreaming job for the given time. This is an internal method that
* should not be called directly. This default implementation creates a job
* that materializes the corresponding RDD. Subclasses of DStream may override this
* to generate their own jobs.
*/
private[streaming] def generateJob(time: Time): Option[Job] = {
getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => {
val emptyFunc = { (iterator: Iterator[T]) => {} }
context.sparkContext.runJob(rdd, emptyFunc)
}
Some(new Job(time, jobFunc))
case None => None
}
}
generateJob,内部方法,来生成SparkStreaming的作业。
1.1.14 clearMetadata
/**
*Clear metadata that are older than `rememberDuration` of this DStream.
* This is an internal method that should notbe called directly. This default
* implementation clears the old generatedRDDs. Subclasses of DStream may override
* this to clear their own metadata alongwith the generated RDDs.
*/
private[streaming]defclearMetadata(time: Time) {
clearMetadata,内部方法,清除DStream中过期的数据。
1.1.15 updateCheckpointData
/**
* Refresh the list of checkpointed RDDs thatwill be saved along with checkpoint of
* this stream. This is an internal methodthat should not be called directly. This is
* a default implementation that saves onlythe file names of the checkpointed RDDs to
* checkpointData. Subclasses of DStream(especially those of InputDStream) may override
* this method to save custom checkpointdata.
*/
private[streaming]defupdateCheckpointData(currentTime:Time) {
updateCheckpointData,内部方法,更新Checkpoint。
1.2 DStream基本操作
1.2.1 map
/** Return a newDStreamby applying a function toall elements of this DStream. */
defmap[U: ClassTag](mapFunc: T=> U): DStream[U] = {
newMappedDStream(this, context.sparkContext.clean(mapFunc))
}
Map操作,对DStream中所有元素进行Map操作,和RDD中的操作一样。
1.2.2 flatMap
/**
* Return a new DStream by applying afunction to all elements of this DStream,
* and then flattening the results
*/
defflatMap[U:ClassTag](flatMapFunc: T => Traversable[U]): DStream[U] = {
newFlatMappedDStream(this, context.sparkContext.clean(flatMapFunc))
}
flatMap操作,对DStream中所有元素进行flatMap操作,和RDD中的操作一样。
1.2.3filter
/** Return a new DStream containing only the elements that satisfy apredicate. */
def filter(filterFunc: T => Boolean): DStream[T] = newFilteredDStream(this, filterFunc)
filter操作,对DStream中所有元素进行过滤,和RDD中的操作一样。
1.2.4 glom
/**
* Return a new DStream in which each RDD isgenerated by applying glom() to each RDD of
* this DStream. Applying glom() to an RDD coalescesall elements within each partition into
* an array.
*/
defglom(): DStream[Array[T]] =new GlommedDStream(this)
glom操作,对DStream中RDD的所有元素聚合,数组形式返回。
1.2.5 repartition
/**
* Return a new DStream with an increased ordecreased level of parallelism. Each RDD in the
* returned DStream has exactly numPartitionspartitions.
*/
defrepartition(numPartitions: Int):DStream[T] =this.transform(_.repartition(numPartitions))
repartition操作,对DStream中RDD重新分区,和RDD中的操作一样。
1.2.6 mapPartitions
/**
* Return a new DStream in which each RDD isgenerated by applying mapPartitions() to each RDDs
* of this DStream. Applying mapPartitions()to an RDD applies a function to each partition
* of the RDD.
*/
defmapPartitions[U:ClassTag](
mapPartFunc: Iterator[T] => Iterator[U],
preservePartitioning: Boolean = false
): DStream[U] = {
newMapPartitionedDStream(this, context.sparkContext.clean(mapPartFunc), preservePartitioning)
}
mapPartitions操作,对DStream中RDD进行mapPartitions操作,和RDD中的操作一样。
1.2.7 reduce
/**
* Return a new DStream in which each RDD hasa single element generated by reducing each RDD
* of this DStream.
*/
defreduce(reduceFunc:(T, T) => T): DStream[T] =
this.map(x => (null, x)).reduceByKey(reduceFunc, 1).map(_._2)
reduce操作,对DStream中RDD进行reduce操作,和RDD中的操作一样。
1.2.8 count
/**
* Return a new DStream in which each RDD hasa single element generated by counting each RDD
* of this DStream.
*/
defcount(): DStream[Long] = {
this.map(_=> (null,1L))
.transform(_.union(context.sparkContext.makeRDD(Seq((null,0L)),1)))
.reduceByKey(_ + _)
.map(_._2)
}
count操作,对DStream中RDD进行count操作,和RDD中的操作一样。
1.2.9 countByValue
/**
* Return a new DStream in which each RDDcontains the counts of each distinct value in
* each RDD of this DStream. Hashpartitioning is used to generate
* the RDDs with `numPartitions` partitions(Spark's default number of partitions if
* `numPartitions` not specified).
*/
defcountByValue(numPartitions:Int = ssc.sc.defaultParallelism)(implicit ord: Ordering[T] = null)
: DStream[(T, Long)] =
this.map(x => (x, 1L)).reduceByKey((x: Long, y: Long) => x +y, numPartitions)
countByValue操作,对DStream中RDD进行countByValue操作,和RDD中的操作一样。
1.2.10 foreachRDD
/**
* Apply a function to each RDD in thisDStream. This is an output operator, so
* 'this' DStream will be registered as anoutput stream and therefore materialized.
*/
defforeachRDD(foreachFunc:(RDD[T], Time) => Unit) {
// because the DStream is reachable from the outer objecthere, and because
// DStreams can't be serialized with closures, we can'tproactively check
// it for serializability and so we pass the optionalfalse to SparkContext.clean
newForEachDStream(this, context.sparkContext.clean(foreachFunc, false)).register()
}
foreachRDD操作,对DStream中RDD进行函数操作,该操作是一个输出操作。
1.2.11 transform
/**
* Return a new DStream in which each RDD isgenerated by applying a function
* on each RDD of 'this' DStream.
*/
deftransform[U:ClassTag](transformFunc: RDD[T] => RDD[U]): DStream[U] = {
// because the DStream is reachable from the outer objecthere, and because
// DStreams can't be serialized with closures, we can'tproactively check
// it for serializability and so we pass the optionalfalse to SparkContext.clean
transform((r: RDD[T], t: Time) =>context.sparkContext.clean(transformFunc(r),false))
}
transform操作,对DStream中RDD进行transform函数操作。
1.2.12 transformWith
/**
* Return a new DStream in which each RDD isgenerated by applying a function
* on each RDD of 'this' DStream and 'other'DStream.
*/
deftransformWith[U: ClassTag,V: ClassTag](
other: DStream[U], transformFunc:(RDD[T], RDD[U]) => RDD[V]
): DStream[V] = {
// because the DStream is reachable from the outer objecthere, and because
// DStreams can't be serialized with closures, we can'tproactively check
// it for serializability and so we pass the optionalfalse to SparkContext.clean
valcleanedF = ssc.sparkContext.clean(transformFunc, false)
transformWith(other, (rdd1: RDD[T], rdd2:RDD[U], time: Time) => cleanedF(rdd1, rdd2))
}
transformWith操作,对DStream与其它DStream进行transform函数操作。
1.2.13 print
/**
* Print the first ten elements of each RDDgenerated in this DStream. This is an output
* operator, so this DStream will beregistered as an output stream and there materialized.
*/
defprint() {
defforeachFunc = (rdd: RDD[T], time: Time) => {
valfirst11 = rdd.take(11)
println ("-------------------------------------------")
println ("Time: " + time)
println ("-------------------------------------------")
first11.take(10).foreach(println)
if(first11.size > 10) println("...")
println()
}
newForEachDStream(this, context.sparkContext.clean(foreachFunc)).register()
}
print操作,对DStream进行打印输出,这是一个输出操作。
1.2.14 window
/**
* Return a new DStream in which each RDDcontains all the elements in seen in a
* sliding window of time over this DStream.The new DStream generates RDDs with
* the same interval as this DStream.
* @param windowDuration width of thewindow; must be a multiple of this DStream's interval.
*/
defwindow(windowDuration:Duration): DStream[T] = window(windowDuration,this.slideDuration)
/**
* Return a new DStreaminwhich each RDD contains all the elements in seen in a
* sliding window of time over this DStream.
* @param windowDuration width of thewindow; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., theinterval after which
* the new DStream willgenerate RDDs); must be a multiple of this
* DStream's batchinginterval
*/
def window(windowDuration:Duration, slideDuration: Duration): DStream[T] = {
newWindowedDStream(this, windowDuration, slideDuration)
}
window操作,设置窗口时长、滑动时长,生成一个窗口的DStream。
1.2.15 reduceByWindow
/**
* Return a new DStream in which each RDD hasa single element generated by reducing all
* elements in a sliding window over thisDStream.
* @param reduceFunc associativereduce function
* @param windowDuration width of thewindow; must be a multiple of this DStream's
* batching interval
* @paramslideDuration sliding interval of thewindow (i.e., the interval after which
* the new DStream willgenerate RDDs); must be a multiple of this
* DStream's batchinginterval
*/
def reduceByWindow(
reduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = {
this.reduce(reduceFunc).window(windowDuration,slideDuration).reduce(reduceFunc)
}
/**
* Return a new DStream in which each RDD hasa single element generated by reducing all
* elements in a sliding window over thisDStream. However, the reduction is done incrementally
* using the old window's reduced value :
* 1.reduce the new values that entered the window (e.g., adding new counts)
* 2."inverse reduce" the old values that left the window (e.g.,subtracting old counts)
* This is more efficient than reduceByWindow without "inversereduce" function.
* However, it is applicable to only "invertible reduce functions".
* @param reduceFunc associativereduce function
* @param invReduceFunc inverse reducefunction
* @param windowDuration width of thewindow; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., theinterval after which
* the new DStream willgenerate RDDs); must be a multiple of this
* DStream's batchinginterval
*/
defreduceByWindow(
reduceFunc:(T, T) => T,
invReduceFunc: (T, T) => T,
windowDuration: Duration,
slideDuration: Duration
): DStream[T] = {
this.map(x=> (1, x))
.reduceByKeyAndWindow(reduceFunc,invReduceFunc, windowDuration, slideDuration,1)
.map(_._2)
}
reduceByWindow操作,对窗口进行reduceFunc操作。
1.2.16 countByWindow
/**
* Return a new DStream in which each RDD hasa single element generated by counting the number
* of elements in a sliding window over thisDStream. Hash partitioning is used to generate
* the RDDs with Spark's default number ofpartitions.
* @param windowDuration width of thewindow; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., theinterval after which
* the new DStream willgenerate RDDs); must be a multiple of this
* DStream's batchinginterval
*/
defcountByWindow(windowDuration:Duration, slideDuration: Duration): DStream[Long] = {
this.map(_=>1L).reduceByWindow(_ + _, _ - _, windowDuration, slideDuration)
}
countByWindow操作,对窗口进行count操作。
1.2.17countByValueAndWindow
/**
* Return a new DStream in which each RDDcontains the count of distinct elements in
* RDDs in a sliding window over thisDStream. Hash partitioning is used to generate
* the RDDs with `numPartitions` partitions(Spark's default number of partitions if
* `numPartitions` not specified).
* @param windowDuration width of thewindow; must be a multiple of this DStream's
* batching interval
* @param slideDuration sliding interval of the window (i.e., theinterval after which
* the new DStream willgenerate RDDs); must be a multiple of this
* DStream's batchinginterval
* @param numPartitions number of partitions of each RDD in the newDStream.
*/
defcountByValueAndWindow(
windowDuration: Duration,
slideDuration: Duration,
numPartitions: Int =ssc.sc.defaultParallelism)
(implicitord: Ordering[T] = null)
: DStream[(T, Long)] =
{
this.map(x=> (x, 1L)).reduceByKeyAndWindow(
(x: Long, y: Long) => x + y,
(x: Long, y: Long) => x - y,
windowDuration,
slideDuration,
numPartitions,
(x: (T, Long)) => x._2 != 0L
)
}
countByValueAndWindow操作,对窗口进行countByValue操作。
1.2.18 union
/**
* Return a new DStream by unifying data ofanother DStream with this DStream.
* @paramthat Another DStream having the same slideDuration as this DStream.
*/
defunion(that:DStream[T]): DStream[T] =new UnionDStream[T](Array(this, that))
/**
* Return all the RDDs defined by theInterval object (both end times included)
*/
def slice(interval:Interval): Seq[RDD[T]] = {
slice(interval.beginTime, interval.endTime)
}
union操作,对DStream和其它DStream进行合并操作。
1.2.19 slice
/**
* Return all the RDDs between 'fromTime' to'toTime' (both included)
*/
defslice(fromTime:Time, toTime: Time): Seq[RDD[T]] = {
if(!isInitialized) {
thrownew SparkException(this + " has not beeninitialized")
}
if(!(fromTime - zeroTime).isMultipleOf(slideDuration)) {
logWarning("fromTime (" + fromTime + ") is not amultiple of slideDuration ("
+ slideDuration + ")")
}
if(!(toTime - zeroTime).isMultipleOf(slideDuration)) {
logWarning("toTime (" + fromTime + ") is not amultiple of slideDuration ("
+ slideDuration + ")")
}
valalignedToTime = toTime.floor(slideDuration)
valalignedFromTime = fromTime.floor(slideDuration)
logInfo("Slicing from " + fromTime + " to " + toTime +
" (aligned to " + alignedFromTime + " and " + alignedToTime + ")")
alignedFromTime.to(alignedToTime,slideDuration).flatMap(time => {
if(time >= zeroTime) getOrCompute(time) elseNone
})
}
slice操作,根据时间间隔,取DStream中的每个RDD序列,生成一个RDD。
1.2.20saveAsObjectFiles
/**
* Save each RDD in this DStream as aSequence file of serialized objects.
* The file name at each batch interval isgenerated based on `prefix` and
* `suffix`:"prefix-TIME_IN_MS.suffix".
*/
defsaveAsObjectFiles(prefix: String, suffix: String = ""){
valsaveFunc = (rdd: RDD[T], time: Time) => {
valfile = rddToFileName(prefix, suffix, time)
rdd.saveAsObjectFile(file)
}
this.foreachRDD(saveFunc)
}
saveAsObjectFiles操作,输出操作,对DStream中的每个RDD输出为序列化文件格式。
1.2.21 saveAsTextFiles
/**
* Save each RDD in this DStreamasat text file, using string representation
* of elements. The file name at each batchinterval is generated based on
* `prefix` and `suffix`:"prefix-TIME_IN_MS.suffix".
*/
defsaveAsTextFiles(prefix:String, suffix: String ="") {
valsaveFunc = (rdd: RDD[T], time: Time) => {
valfile = rddToFileName(prefix, suffix, time)
rdd.saveAsTextFile(file)
}
this.foreachRDD(saveFunc)
}
/**
* Register this streaming as an outputstream. This would ensure that RDDs of this
* DStream will be generated.
*/
private[streaming]defregister(): DStream[T] = {
ssc.graph.addOutputStream(this)
this
}
}
saveAsTextFiles操作,输出操作,对DStream中的每个RDD输出为文本格式。
转载请注明出处:
http://blog.csdn.net/sunbow0/article/details/43091247
Spark Streaming之三:DStream解析的更多相关文章
- Spark Streaming揭秘 Day34 解析UI监听模式
Spark Streaming揭秘 Day34 解析UI监听模式 今天分享下SparkStreaming中的UI部分,和所有的UI系统一样,SparkStreaming中的UI系统使用的是监听器模式. ...
- spark streaming 使用geoIP解析IP
1.首先将GEOIP放到服务器上,如,/opt/db/geo/GeoLite2-City.mmdb 2.新建scala sbt工程,测试是否可以顺利解析 import java.io.Fileimpo ...
- 53、Spark Streaming:输入DStream之Kafka数据源实战
一.基于Receiver的方式 1.概述 基于Receiver的方式: Receiver是使用Kafka的高层次Consumer API来实现的.receiver从Kafka中获取的数据都是存储在Sp ...
- spark streaming 2: DStream
DStream是类似于RDD概念,是对数据的抽象封装.它是一序列的RDD,事实上,它大部分的操作都是对RDD支持的操作的封装,不同的是,每次DStream都要遍历它内部所有的RDD执行这些操作.它可以 ...
- spark streaming之三 rdd,job的动态生成以及动态调度
前面一篇讲到了,DAG静态模板的生成.那么spark streaming会在每一个batch时间一到,就会根据DAG所形成的逻辑以及物理依赖链(dependencies)动态生成RDD以及由这些RDD ...
- Spark Streaming on Kafka解析和安装实战
本课分2部分讲解: 第一部分,讲解Kafka的概念.架构和用例场景: 第二部分,讲解Kafka的安装和实战. 由于时间关系,今天的课程只讲到如何用官网的例子验证Kafka的安装是否成功.后续课程会接着 ...
- Spark Streaming运行流程及源码解析(一)
本系列主要描述Spark Streaming的运行流程,然后对每个流程的源码分别进行解析 之前总听同事说Spark源码有多么棒,咱也不知道,就是疯狂点头.今天也来撸一下Spark源码. 对Spark的 ...
- 苏宁基于Spark Streaming的实时日志分析系统实践 Spark Streaming 在数据平台日志解析功能的应用
https://mp.weixin.qq.com/s/KPTM02-ICt72_7ZdRZIHBA 苏宁基于Spark Streaming的实时日志分析系统实践 原创: AI+落地实践 AI前线 20 ...
- Spark Streaming 在数据平台日志解析功能的应用
https://mp.weixin.qq.com/s/bGXhC9hvDj4lzK7wYYHGDg 目前,我们使用Filebeat监控日志产生的目录,收集产生的日志,打到logstash集群,接入ka ...
随机推荐
- C 标准库 - <math.h>
C 标准库 - <math.h> 简介 math.h 头文件定义了各种数学函数和一个宏.在这个库中所有可用的功能都带有一个 double 类型的参数,且都返回 double类型的结果. 库 ...
- 使用react全家桶制作博客后台管理系统 网站PWA升级 移动端常见问题处理 循序渐进学.Net Core Web Api开发系列【4】:前端访问WebApi [Abp 源码分析]四、模块配置 [Abp 源码分析]三、依赖注入
使用react全家桶制作博客后台管理系统 前面的话 笔者在做一个完整的博客上线项目,包括前台.后台.后端接口和服务器配置.本文将详细介绍使用react全家桶制作的博客后台管理系统 概述 该项目是基 ...
- Python 模块之 ConfigParser: 用 Python 解析配置文件
在程序中使用配置文件来灵活的配置一些参数是一件很常见的事情,配置文件的解析并不复杂,在 Python 里更是如此,在官方发布的库中就包含有做这件事情的库,那就是 ConfigParser,这里简单的做 ...
- 快速上手npm
1.npm的安装和更新 2.npm的常用操作 3.npm的常用配置项 4.npm常用命令速查表
- Kubernetes对象之ReplicaSet
系列目录 说到ReplicaSet对象,得先说说ReplicationController(简称为RC).在旧版本的Kubernetes中,只有ReplicationController对象.它的主要 ...
- Sass编译css/Grunt压缩文件
Sass安装(mac) $ sudo gem install sass scss编译成css文件 $ sass ui.scss ui.css CLI安装 $ npm install -g grunt- ...
- 命令行查看memcached的运行状态(转载)
很多时候需要监控服务器上的Memcached运行情况,比如缓存的查询次数,命中率之类的.但找到的那个memcached-tool是linux下用perl写的,我也没试过windows能不能用.后来发现 ...
- 《TCP/IP具体解释卷2:实现》笔记--UDP:用户数据报协议
用户数据报协议.即UDP,是一个面向数据报的简单运输层协议:进程的每次输出操作仅仅产生一个UDP数据报,从而发送 一个IP数据报. 进程通过创建一个Internet域内的SOCK_DGRAM类型的插口 ...
- kbmmw 5 的日志备份功能简介
kbmmw 自从4.8.2 版本里增加了日志管理以后,随着版本升级,增加了很多功能,使用方法也有所改变. 功能也越来越强大. 今天说一下 kbmmw5 里面的日志备份,顺便演示一下新的使用方法. 我们 ...
- ffmpeg强制使用TCP方式推流到EasyDarwin开源流媒体服务器进行直播
我们的EasyDarwin目前部署在阿里云的服务器上面,运行的效果是非常好的,而且无论是以TCP方式.还是UDP的方式推送,都可以非常好地进行直播转发: 但并不是所有的用户服务器都是阿里云的形式,有很 ...