SparkStreaming 源码分析
SparkStreaming 分析 (基于1.5版本源码)
SparkStreaming 介绍
SparkStreaming是一个流式批处理框架,它的核心执行引擎是Spark,适合处理实时数据与历史数据混合处理的场景。其处理流程如下:
1、 接收实时流数据并持久化
2、 将实时流以时间片切分成多个批次
3、 将每块(一个批次)的数据做为RDD,并用RDD操作处理数据
4、 每块数据生成一个SparkJob,提交Spark进行处理,并返回结果

Dstream 介绍
Spark Streaming中一个关键的程序抽象,表示从数据源获取持续性的数据流以及经过转换后的数据流。DStream由持续的RDD序列组成 :

作用于DStream上的操作有两种(与RDD类似):Transformation与Output。
DStream之间的的转换所形成的依赖关系保存在DStreamGraph中(DstreamGraph在StreamingContext创建时初始化),DStreamGraph会定期生成RDD DAG.
SparkStreaming应用
SparkStreaming应用程序,以WordCount为例,实现如下:
val conf = new SparkConf().setAppName("wordCount").setMaster("local[4]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val lines = ssc.socketTextStream("localhost", 8585, StorageLevel.MEMORY_ONLY)
val words = lines.flatMap(_.split(" ")).map(w => (w,1))
val wordCount = words.reduceByKey(_+_)
wordCount.print
ssc.start()
Spark Streaming 执行过程分析
Spark Streaming执行过程,将依托第一部分应用程序(WordCount)进行分析
StreamingContext初始化过程
StreamingContext是很多Streaming功能的入口,如:它提供从多种数据源创建DStream的方法等。由上述WordCount应用可知,Streaming应用执行时首先会创建StreamingContext。
伴随StreamingContext的创建将会创建如下主要组件:
1、创建DStreamGraph,并为其设置转换成RDD Graph的时间间隔。
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
cp_.graph.setContext(this)
cp_.graph.restoreCheckpointData()
cp_.graph
} else {
require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(batchDur_)
newGraph
}
}
2、创建JobScheduler
private[streaming] val scheduler = new JobScheduler(this)
DStream创建及转换
利用刚刚创建的StreamingContext通过调用socketTextStream方法创建SocketInputDStream(ReceiverInputDstream).
InputDStream继承体系如下:(以SocketInputDStream与KafkaInputDStream为例)。

JAVA中初始化子类时,会先初始化其父类。所以在创建SocketInputDStream时,会先初始化InputDStream,在InputDStream中实现将自身加入DStreamGraph中(具体见上图)。
InputDStream子类都有一个getReceiver方法, 此方法用来获取Receiver对象. 以SocketInputDStream为例, 如上图, 其会创建SocketReceiver来接收数据.
DStream中算子的转换,类似于RDD中的转换,都是延迟计算。当上述应用遇到print--Output算子时,会将DStream转换为ForEachDStream,并调register方法作为OutputStream注册到DStreamGraph的outputStreams列表.
/**
* Print the first num elements of each RDD generated in this DStream. This is an output
* operator, so this DStream will be registered as an output stream and there materialized.
*/
def print(num: Int): Unit = ssc.withScope {
def foreachFunc: (RDD[T], Time) => Unit = {
(rdd: RDD[T], time: Time) => {
val firstNum = rdd.take(num + 1)
// scalastyle:off println
println("-------------------------------------------")
println("Time: " + time)
println("-------------------------------------------")
firstNum.take(num).foreach(println)
if (firstNum.length > num) println("...")
println()
// scalastyle:on println
}
}
new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register()
}
其中ForEachDStream不同于其它DStream的地方是其重载了generateJob方法。
所有DStream之间的转换关系,使用类似RDD的依赖来表示。
启动过程
应用程序通过调用ssc.start()方法,开始执行stream应用的执行.start方法具体实具如下所示:
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
scheduler.start()
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}
其中, 最核心代码为scheduler.start, scheduler为JobScheduler对象. 如上文所述其在StreamingContext实例化时创建. (下面将按照函数调用流程进行分析)
JobScheduler创建及执行
JobScheuler用来调度运行在Spark上的作业, 它使用JobGenerator生成jobs, 然后使用一个线程池并行运行提交作业.
一. JobScheduler 创建:
JobScheduler由StreamingContext创建,并触发start调用.
JobScheduler初始化时,会创建一个ThreadPool(jobExecutor)和jobGenerator
其中:
jobExecutor用于提交作业.ThreadPool中线程的数量为Job并发量,由”spark.streaming.concurrentJobs”指定,默认为1.
JobGenerator为JobGenerator类实例.其用于依据DStreams生成jobs。
二. JobScheduler执行
start方法执行时会创建并启动以下服务:
eventLoop: EventLoop[JobSchedulerEvent]对象,用以接收和处理事件。调用者通过调用其post方法向事件队列注册事件。EventLoop开始执行时,会开启一deamon线程用于处理队列中的事件。EventLoop是一个抽象类,JobScheduler中初始化EventLoop时实现了其OnReceive方法。该方法中指定接收的事件由processEvent(event)方法处理。
receiverTracker: ReceiverTracker对象,用以管理ReceiverInputDStream中receiver的执行。这个对象必须在所有InputStream添加至DStreamGraph中后创建。因其实例化时会从DStreamGraph中抽取InputDStream. 以便用在其启动时抽取其中的Receiver。
jobGenertor:其在JobScheduler实例化时创建,在此处进行启动。
此部分代码实现,如下图所示:
def start(): Unit = synchronized {
if (eventLoop != null) return // scheduler has already been started
logDebug("Starting JobScheduler")
eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)
override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
}
eventLoop.start()
// attach rate controllers of input streams to receive batch completion updates
for {
inputDStream <- ssc.graph.getInputStreams
rateController <- inputDStream.rateController
} ssc.addStreamingListener(rateController)
listenerBus.start(ssc.sparkContext)
receiverTracker = new ReceiverTracker(ssc)
inputInfoTracker = new InputInfoTracker(ssc)
receiverTracker.start()
jobGenerator.start()
logInfo("Started JobScheduler")
}
JobScheduler启动时,主要实例化并启动上述组件,下面将整个过程分为两大块并按组件进行介绍。
1、数据接收过程(启动Receiver, 接收数据, 生成Block)
ReceiverTracker
在JobScheduler中创建并调用其start方法。
一、JobScheduler创建
JobScheduler创建时会创建以下主要属性:
receiverInputStream:接收数据的InputDStream实例,通过ssc.graph.getReceiverInputStreams获取,其内部存放ReceiverInputDStream实例及子类实例(包括:SocketInputDStream, RawInputDStream,FlumePollingInputDStream, KafkaInputDStream, MQTTInputDStream)
receiverdBlockTracker: ReceivedBlockTracker实例,用来记录Receiver接收的blocks。通过此类进行的操作可以保存一个WAL(write ahead log), 以便失败后进行保存
schedulingPolocy: ReceiverSchedulingPolicy实例,用于调度Receiver.
receiverTrackingInfo: HashMap对象, 用于维护receivers信息,key: receiverId, value: receiver info. 只能由ReceiverTrackerEndpoint访问。
二、start方法执行
start方法被调用时,首先其会判断receiverInputStream是否为空,如果为空,也就是没有任何Receiver,不做任何操作。当不为空时,其会创建ReceiverTrackerEndpoint实例endpoint.
ReceiverTrackerEndpoint 为RPC终端,用于接收各Receiver发送的消息并进行相应处理,必要时给予响应。
如果第一次启动,则默认不跳过启动Receiver(默认:skipReceiverLaunch = false,创建ReceiverTracker时默认指定),则执行launchReceivers()方法。
这部分代码如下图所示:
/** Start the endpoint and receiver execution thread. */
def start(): Unit = synchronized {
if (isTrackerStarted) {
throw new SparkException("ReceiverTracker already started")
}
if (!receiverInputStreams.isEmpty) {
endpoint = ssc.env.rpcEnv.setupEndpoint(
"ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
if (!skipReceiverLaunch) launchReceivers()
logInfo("ReceiverTracker started")
trackerState = Started
}
}
其中launchReceivers方法,将调用runDummySparkJob()方法和向ReceiverTrackerEndpoint发送StartAllReceivers消息。
/**
* Get the receivers from the ReceiverInputDStreams, distributes them to the
* worker nodes as a parallel collection, and runs them.
*/
private def launchReceivers(): Unit = {
val receivers = receiverInputStreams.map(nis => {
val rcvr = nis.getReceiver()
rcvr.setReceiverId(nis.id)
rcvr
})
runDummySparkJob()
logInfo("Starting " + receivers.length + " receivers")
endpoint.send(StartAllReceivers(receivers))
}
runDummySparkJob()方法,用来确认slave节点是否注册,避免将所有receivers分发到同一节点。其通过执行一非常简单的任务,让SparkCore执行一次,然后通过其组件信息判断是否有除driver之外的Executor存在。runDummySparkJob代码如下:
/**
* Run the dummy Spark job to ensure that all slaves have registered. This avoids all the
* receivers to be scheduled on the same node.
*
* TODO Should poll the executor number and wait for executors according to
* "spark.scheduler.minRegisteredResourcesRatio" and
* "spark.scheduler.maxRegisteredResourcesWaitingTime" rather than running a dummy job.
*/
private def runDummySparkJob(): Unit = {
if (!ssc.sparkContext.isLocal) {
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
}
assert(getExecutors.nonEmpty)
}
检查完slave注册情况后,其会从receiverInputstreams列表中抽取所有Receivers,并用其创建StartAllReceivers消息发送给endpoint.
ReceiverTrackerEndpoint
用于接收Receiver及其自身发送的消息并进行处理,必要时进行响应。其可处理的消息有:
StartAllReceivers,、RestartReceivers、CleanupOldBlocks、UpdateReceiverRateLimit、ReportError以及RegisterReceiver、AddBlock、DeregisterReceiver、AllReceiverIds、StopAllReceivers.
当ReceiverTrackerEndpoint接收到StartAllReceivers时,其会通过调度策略shedulingPolicy(前文已描述) 生成Receiver分发策略(挑选出可运行Receiver的Executor),然后通过调用startReceiver进行启动。
case StartAllReceivers(receivers) =>
val scheduledExecutors = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
for (receiver <- receivers) {
val executors = scheduledExecutors(receiver.streamId)
updateReceiverScheduledExecutors(receiver.streamId, executors)
receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
startReceiver(receiver, executors)
}
此部分代码实现,如上所示。其中schedulingPolicy与startReceiver将在下小节进行说明。
ReceiverSchedulingPolicy 调度策略
此类用来,调度receivers, 并保证其均匀分布。Receiver调度分为两个阶段:一是全局调度,当ReceiverTracker启动时, 通过scheduleReceivers同时调度所有reciver;二是局部调度,当某个Receiver重启时发生,通过scheduleReceiver进行调度。此处接上小节,接着进行scheduleReceivers的解析:
在ReceiverTrackerEndpoint通过调用schedulingPolicy.scheduleReceivers(receivers, getExecutor)来触发Receivers调度, 其中包含两个参数: receivers和getExecutor. 第一个参数receivers以消息的形式进行传递;第二个参数代表的是可进行调度的Executor列表,用getExecutor方法进行获取。获取的Exectutor不包含与Driver在同一个主机上的节点。其中Executor信息格式为“host:port”
scheduleReceivers在执行时,首先将从executors列表转换成Map格式,单个元素host->”host:port”格式。 然后遍历receivers列表,为其逐个分配节点,分配过程如下:
从列表取出一个receiver,判断其是否具有preferredLocation.(可优先选择机器) 此方法在Receiver基类中声明,要求子类进行实现。上文举列中SocketReceiver未重写此方法,因此不具preferredLocation,因此未执行Receiver的Executor节点可以随意选取。
此部分代码如下
/**
* Try our best to schedule receivers with evenly distributed. However, if the
* `preferredLocation`s of receivers are not even, we may not be able to schedule them evenly
* because we have to respect them.
*
* Here is the approach to schedule executors:
* <ol>
* <li>First, schedule all the receivers with preferred locations (hosts), evenly among the
* executors running on those host.<>
* <li>Then, schedule all other receivers evenly among all the executors such that overall
* distribution over all the receivers is even.<>
* </ol>
*
* This method is called when we start to launch receivers at the first time.
*/
def scheduleReceivers(
receivers: Seq[Receiver[_]], executors: Seq[String]): Map[Int, Seq[String]] = {
if (receivers.isEmpty) {
return Map.empty
}
if (executors.isEmpty) {
return receivers.map(_.streamId -> Seq.empty).toMap
}
val hostToExecutors = executors.groupBy(_.split(":")(0))
val scheduledExecutors = Array.fill(receivers.length)(new mutable.ArrayBuffer[String])
val numReceiversOnExecutor = mutable.HashMap[String, Int]()
// Set the initial value to 0
executors.foreach(e => numReceiversOnExecutor(e) = 0)
// Firstly, we need to respect "preferredLocation". So if a receiver has "preferredLocation",
// we need to make sure the "preferredLocation" is in the candidate scheduled executor list.
for (i <- 0 until receivers.length) {
// Note: preferredLocation is host but executors are host:port
receivers(i).preferredLocation.foreach { host =>
hostToExecutors.get(host) match {
case Some(executorsOnHost) =>
// preferredLocation is a known host. Select an executor that has the least receivers in
// this host
val leastScheduledExecutor =
executorsOnHost.minBy(executor => numReceiversOnExecutor(executor))
scheduledExecutors(i) += leastScheduledExecutor
numReceiversOnExecutor(leastScheduledExecutor) =
numReceiversOnExecutor(leastScheduledExecutor) + 1
case None =>
// preferredLocation is an unknown host.
// Note: There are two cases:
// 1. This executor is not up. But it may be up later.
// 2. This executor is dead, or it's not a host in the cluster.
// Currently, simply add host to the scheduled executors.
scheduledExecutors(i) += host
}
}
}
// For those receivers that don't have preferredLocation, make sure we assign at least one
// executor to them.
for (scheduledExecutorsForOneReceiver <- scheduledExecutors.filter(_.isEmpty)) {
// Select the executor that has the least receivers
val (leastScheduledExecutor, numReceivers) = numReceiversOnExecutor.minBy(_._2)
scheduledExecutorsForOneReceiver += leastScheduledExecutor
numReceiversOnExecutor(leastScheduledExecutor) = numReceivers + 1
}
// Assign idle executors to receivers that have less executors
val idleExecutors = numReceiversOnExecutor.filter(_._2 == 0).map(_._1)
for (executor <- idleExecutors) {
// Assign an idle executor to the receiver that has least candidate executors.
val leastScheduledExecutors = scheduledExecutors.minBy(_.size)
leastScheduledExecutors += executor
}
receivers.map(_.streamId).zip(scheduledExecutors).toMap
}
当挑选出要执行的Executor后,调用startReceiver(receiver, executor)方法, 在指定executor是启动receiver.
startReceiver方法
将Receiver使用RDD进行包装,然后使用SparkContext.submitJob方法进行提交,使其巧妙的以普通RDD作业运行的方式将Receiver分发在选出的Executor方法上执行, 其执行逻辑为startReceiverFunc操作。当startReceiverFunc方法被调用时,会为分发到该节点中的receiver创建ReceiverSupervisorImpl对够象supervisor。 然后调用supervisor的start方法,使其监控运行在worker中的receiver。ReceiverSupervisor提供了处理接收数据的必要接口。这部分的具体实现如下代码所示:
// Function to start the receiver on the worker node
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
val receiver = iterator.next()
assert(iterator.hasNext == false)
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
// Create the RDD using the scheduledExecutors to run the receiver in a Spark job
val receiverRDD: RDD[Receiver[_]] =
if (scheduledExecutors.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
ssc.sc.makeRDD(Seq(receiver -> scheduledExecutors))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
// We will keep restarting the receiver job until ReceiverTracker is stopped
future.onComplete {
case Success(_) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
case Failure(e) =>
if (!shouldStartReceiver) {
onReceiverJobFinish(receiverId)
} else {
logError("Receiver has been stopped. Try to restart it.", e)
logInfo(s"Restarting Receiver $receiverId")
self.send(RestartReceiver(receiver))
}
}(submitJobThreadPool)
logInfo(s"Receiver ${receiver.streamId} started")
}
当receiver执行起来后ReceiverTracker等待程序结束。接下来介绍ReceiverSupervisorImpl
ReceiverSupervisor
用监控运行在worker中的receiver,提供了处理接收数据的必要接口,其start方法执行逻辑如下:
/** Start the supervisor */
def start() {
onStart()
startReceiver()
}
如上代码所述,其将执行onStart()及startReceiver()两个方法, onStart由其实现类ReceiverSupervisor类实现。
其中OnStart方法中会让所有注册的BlockGenerators执行, BlockGenerator的作用是定时使用接收的数据生成Block,并将生成的block加入队列。下一小节具体说明BlockGenerators。
startReceiver方法用于启动Receiver, 实现逻辑如下所述:
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
logInfo("Starting receiver")
receiverState = Started
receiver.onStart()
logInfo("Called receiver onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
其中的onReceiverStart由ReceiverSupervisorImpl实现, 其作用主要是:向ReceiverTracker发送注册消息(RegisterReceiver),并等待响应(成功:true; 否则:false.)。
方法通过调用Receiver(SocketReceiver)中的onStart 使用创建deamon线程并开启,该线程用于真实receive数据。其中onStart方法具体实现如下图所示:
def onStart() {
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
其中receive方法就是接收数据的具体实现部分, 其内部创建Socket对象,通过Socket接收信息。然后将其转换成String,用Iterator进行包装,然后调用store方法进行存储。其receive方法的实现如下所示:
/** Create a socket connection and receive data until receiver is stopped */
def receive() {
var socket: Socket = null
try {
logInfo("Connecting to " + host + ":" + port)
socket = new Socket(host, port)
logInfo("Connected to " + host + ":" + port)
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next)
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case e: java.net.ConnectException =>
restart("Error connecting to " + host + ":" + port, e)
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
if (socket != null) {
socket.close()
logInfo("Closed socket to " + host + ":" + port)
}
}
}
}
其中store方法在Receiver基类中定义, 其调用ReceiverSupervisor中的pushSingle方法,将一个接收数据的记录保存传递给BlockGenerator.则BlockGenertor会使用其addData方法将记录加入bufffer.
BlockGenerator
上节所述:ReceiverSupervisor调用start时会执行OnStart。OnStart方法中会让所有注册的BlockGenerators执行(调用start方法), BlockGenerator的作用是定时使用接收的数据生成Block,并将生成的block加入队列。下面将对其进行分析:
BlockGenerators重要属性:
blockIntervalTimer: RecurringTimer对象, 其做用创建daemon线程, 定时执行第三个参数callback传入函数. 此处为updateCurrentBuffer函数。
blockPushingThread: 用于执行keepPushingBlocks的线程,其中blocksForPushing类型为ArrayBlockingQueue.
当调用其start()方法时,其会启动调用blockIntervalTimer的start方法及启动blockPushingThread线程。
blockIntervalTimer的start方法被调用时,它将启动后台线程,定时执行updateCurrentBuffer操作,该操作将缓存中的数据,包装成block, 加入 blocksForPushing.
blockPushingThread线程开启后,会执行keepPushingBlocks() 方法, 其将blocksForPushing中的block信息调用pushBlock方法,这个方法内部触发BlockGeneratorListener的onPushBlock事件。该监听器在ReceiverSupervisorImpl创建BlockGenerator时做为参数传入。
BlockGeneratorListener的onPushBlock事件触发时,其将调用pushArrayBuffer方法,pushArrayBuffer进一步调用pushAndReportBlock方法。PushAndReportBlock方法具体实现如下:
/** Store block and report it to driver */
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId = blockIdOption.getOrElse(nextBlockId)
val time = System.currentTimeMillis
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
val numRecords = blockStoreResult.numRecords
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
logDebug(s"Reported block $blockId")
}
其将利用receiverdBlockHandler.storeBlock存储block数据, 并向trackerEndpoint发送AddBlock消息。
ReceiverTracker接收到AddBlock信息后,其将信息包装成BlockAdditionEvent写入日志,并且将Block信息加入getReceiverdBlockQueue队列
小结: 以上部分,描述了数据的接收、存储过程。如上文所述,在JobSheduler执行start方法时,还将会启动jobGenerator组件,也就是启动数据处理的过程。下面将对jobGenerator进行处理。
2、数据处理过程(DStream转换成RDD, 生成Jobs并提交)
JobGenerator
重要属性:
Timer: RecurringTimer对象, 与JobScheduler中类似,在该对象中其会定期执行GenerateJobs方法。
在JobGenerator.start()被调用时,其将创建eventLoop对象并启动,以及调用startFirstTime()方法。
其中eventLoop与上文JobScheduler中的一样,其定义事件交由processEvent(event). 此处processEvent实现逻辑为:
/** Processes all events */
private def processEvent(event: JobGeneratorEvent) {
logDebug("Got event " + event)
event match {
case GenerateJobs(time) => generateJobs(time)
case ClearMetadata(time) => clearMetadata(time)
case DoCheckpoint(time, clearCheckpointDataLater) =>
doCheckpoint(time, clearCheckpointDataLater)
case ClearCheckpointData(time) => clearCheckpointData(time)
}
}
startFisrtTime()的实现逻辑为: 调用graph.start方法及开启Timer(即解发GeneratorJobs()事件).
DStreamGraph.start实现逻辑如下:
def start(time: Time) {
this.synchronized {
require(zeroTime == null, "DStream graph computation already started")
zeroTime = time
startTime = time
outputStreams.foreach(_.initialize(zeroTime))
outputStreams.foreach(_.remember(rememberDuration))
outputStreams.foreach(_.validateAtStart)
inputStreams.par.foreach(_.start())
}
}
其会调用outputStreams 及inputStream的start方法
当GenerateJobs事件触发时, processEvent中会执行generateJobs(time)方法, 该方法最终会调用ReceiverBlockTracker.allocateBlocksToBatch(time)将receiverdBlockQueue中未分配的block生成一个批次,并将信息保存在timeToAllocatedBlocks(Hashmap<batchTime, allocatedBlocks>)中,同时记录日志信息。
然后调用graph.generateJobs(time)方法,调用所有outputStream.generateJob(time). 因为OutputStream(举例中为ForEachDStream)会重写DStream中的generateJob方法,此时会调用ForEachDStream中的generateJob.[不要错误的认为是DStream], 其具体实现为
override def generateJob(time: Time): Option[Job] = {
parent.getOrCompute(time) match {
case Some(rdd) =>
val jobFunc = () => createRDDWithLocalProperties(time) {
ssc.sparkContext.setCallSite(creationSite)
foreachFunc(rdd, time)
}
Some(new Job(time, jobFunc))
case None => None
}
}
其中parent.getOrCompute用于获取指定批次的RDD。此处是调用parent.getOrCompute, 先来看一下WordCount应用中DStream的转换,如下图:

getOrCompute( compute方法与之类似)方法由DStream基类创建, 如果子类重写该方法,则执行子类方法; 若未重写,则执行基类中的方法。getOrCompute方法会进行递归,直至回溯至SocketInputDStream中。但SocketInputDStream中并未重写此方法,所以其将执行基类中方法,代码如下:
/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
// If RDD was already generated, then retrieve it from HashMap,
// or else compute the RDD
generatedRDDs.get(time).orElse {
// Compute the RDD if time is valid (e.g. correct time in a sliding window)
// of RDD generation, else generate nothing.
if (isTimeValid(time)) {
val rddOption = createRDDWithLocalProperties(time) {
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details. We need to have this call here because
// compute() might cause Spark jobs to be launched.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
compute(time)
}
}
rddOption.foreach { case newRDD =>
// Register the generated RDD for caching and checkpointing
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}
generatedRDDs.put(time, newRDD)
}
rddOption
} else {
None
}
}
}
其中通过执行compute方法来生成批定批次的初始RDD(利用接收到的数据生成BlockRDD)。compute方法代码如下:
/**
* Generates RDDs with blocks received by the receiver of this stream. */
override def compute(validTime: Time): Option[RDD[T]] = {
val blockRDD = {
if (validTime < graph.startTime) {
// If this is called for any time before the start time of the context,
// then this returns an empty RDD. This may happen when recovering from a
// driver failure without any write ahead log to recover pre-failure data.
new BlockRDD[T](ssc.sc, Array.empty)
} else {
// Otherwise, ask the tracker for all the blocks that have been allocated to this stream
// for this batch
val receiverTracker = ssc.scheduler.receiverTracker
val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)
// Register the input blocks information into InputInfoTracker
val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
// Create the BlockRDD
createBlockRDD(validTime, blockInfos)
}
}
Some(blockRDD)
}
生成BlockRDD后,返回递归上层(FlatMappedDStream中)继续执行,上层代码如下:
override def compute(validTime: Time): Option[RDD[U]] = {
parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc))
}
其从parrent.getOrCompute(SocketInputDStream.getOrCompute)返回后将进行RDD的转换(生成RDD Graph的过程),执行完成会再返回递归上层进行RDD转换, 直至回到调用入口ForEachDStream(outputStream)中。
outputStream.generateJob(time)【该方法在ForEachDStream中实现】会使用foreachFunc方法(DStream.print中定义)及当前批次创建Job.
当创建完Job后,JobGenerator.generateJobs会使用jobScheduler.submitJobSet提交作业。
SubmitJobSet具体实现如下:
def submitJobSet(jobSet: JobSet) {
if (jobSet.jobs.isEmpty) {
logInfo("No jobs added for time " + jobSet.time)
} else {
listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
jobSets.put(jobSet.time, jobSet)
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
logInfo("Added jobs for time " + jobSet.time)
}
}
最终会将JobSet中的job,使用jobExecutor线程池以多线程方式使用JobHandler进行处理。
private class JobHandler(job: Job) extends Runnable with Logging {
import JobScheduler._
def run() {
try {
val formattedTime = UIUtils.formatBatchTime(
job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
val batchUrl = s"/streaming/batchid=${job.time.milliseconds}"
val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"
ssc.sc.setJobDescription(
s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)
// We need to assign `eventLoop` to a temp variable. Otherwise, because
// `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
// it's possible that when `post` is called, `eventLoop` happens to null.
var _eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(JobStarted(job, clock.getTimeMillis()))
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details.
PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
job.run()
}
_eventLoop = eventLoop
if (_eventLoop != null) {
_eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
}
} else {
// JobScheduler has been stopped.
}
} finally {
ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
}
}
}
}
上边代码为JobHandler核心代码。其将向EventLop发送一JobStarted事件,及调用Job.run()方法。
def run() {
_result = Try(func())
}
其中job.run()方法会执行生成job时的函数foreachFunc。foreachFunc中的take操作为action操作会触发作业提交,从而完成数据处理操作。
本文地址:http://www.cnblogs.com/barrenlake/p/4889190.html
SparkStreaming 源码分析的更多相关文章
- DirectStream、Stream的区别-SparkStreaming源码分析02
转http://hadoop1989.com/2016/03/15/KafkaStreaming/ 在Spark1.3之前,默认的Spark接收Kafka数据的方式是基于Receiver的,在这之后的 ...
- spark-streaming-kafka-0-10源码分析
转发请注明原创地址http://www.cnblogs.com/dongxiao-yang/p/7767621.html 本文所研究的spark-streaming代码版本为2.3.0-SNAPSHO ...
- spark源码分析以及优化
第一章.spark源码分析之RDD四种依赖关系 一.RDD四种依赖关系 RDD四种依赖关系,分别是 ShuffleDependency.PrunDependency.RangeDependency和O ...
- ABP源码分析一:整体项目结构及目录
ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...
- HashMap与TreeMap源码分析
1. 引言 在红黑树--算法导论(15)中学习了红黑树的原理.本来打算自己来试着实现一下,然而在看了JDK(1.8.0)TreeMap的源码后恍然发现原来它就是利用红黑树实现的(很惭愧学了Ja ...
- nginx源码分析之网络初始化
nginx作为一个高性能的HTTP服务器,网络的处理是其核心,了解网络的初始化有助于加深对nginx网络处理的了解,本文主要通过nginx的源代码来分析其网络初始化. 从配置文件中读取初始化信息 与网 ...
- zookeeper源码分析之五服务端(集群leader)处理请求流程
leader的实现类为LeaderZooKeeperServer,它间接继承自标准ZookeeperServer.它规定了请求到达leader时需要经历的路径: PrepRequestProcesso ...
- zookeeper源码分析之四服务端(单机)处理请求流程
上文: zookeeper源码分析之一服务端启动过程 中,我们介绍了zookeeper服务器的启动过程,其中单机是ZookeeperServer启动,集群使用QuorumPeer启动,那么这次我们分析 ...
- zookeeper源码分析之三客户端发送请求流程
znode 可以被监控,包括这个目录节点中存储的数据的修改,子节点目录的变化等,一旦变化可以通知设置监控的客户端,这个功能是zookeeper对于应用最重要的特性,通过这个特性可以实现的功能包括配置的 ...
随机推荐
- git init
git clone git@code.easyunion.net:516059158/cloud-basic-knowledge.git添加codes;git add .;git commit -m ...
- 黑魔法__attribute__((cleanup))
原文地址:http://blog.sunnyxx.com/2014/09/15/objc-attribute-cleanup/ 编译器属性__attribute__用于向编译器描述特殊的标识.检查或优 ...
- C# asp.net页面通过URL参数传值中文乱码问题解决办法
1.编码string state=Server.UrlEncode(stateName.Text.Trim());Response.Redirect("aaa.aspx?state=&quo ...
- wxPython学习笔记(三)
要理解事件,我们需要知道哪些术语? 事件(event):在你的应用程序期间发生的事情,它要求有一个响应. 事件对象(event object):在wxPython中,它具体代表一个事件,其中包括了事件 ...
- linux/hpux 添加用户
#添加用户组, 组名dev, 组id为1001groupadd -g 1001 dev #添加用户组dev,不指定组idgroupadd dev #添加用户user1, id为111, 属于dev组, ...
- Spring 3 来创建 RESTful Web Services
Spring 3 创建 RESTful Web Services 在 Java™ 中,您可以使用以下几种方法来创建 RESTful Web Service:使用 JSR 311(311)及其参考实现 ...
- iOS CoreBluetooth 教程
去App Store搜索并下载“LightBlue”这个App,对调试你的app和理解Core Bluetooth会很有帮助. ================================ Cor ...
- 表达式求值 (栈) 用C++实现
#include <cstdio> #include <cstdlib> #include <cmath> #include <stack> #incl ...
- Android - 消息机制与线程通信
以下资料摘录整理自老罗的Android之旅博客,是对老罗的博客关于Android底层原理的一个抽象的知识概括总结(如有错误欢迎指出)(侵删):http://blog.csdn.net/luosheng ...
- PYCURL ERROR 22 - "The requested URL returned error: 403 Forbidden"
RHEL6.5创建本地Yum源后,发现不可用,报错如下: [root@namenode1 html]# yum install gcc Loaded plugins: product-id, refr ...