SparkStreaming 分析 (基于1.5版本源码)

SparkStreaming 介绍

SparkStreaming是一个流式批处理框架,它的核心执行引擎是Spark,适合处理实时数据与历史数据混合处理的场景。其处理流程如下:

1、    接收实时流数据并持久化

2、    将实时流以时间片切分成多个批次

3、    将每块(一个批次)的数据做为RDD,并用RDD操作处理数据

4、    每块数据生成一个SparkJob,提交Spark进行处理,并返回结果

Dstream 介绍

Spark Streaming中一个关键的程序抽象,表示从数据源获取持续性的数据流以及经过转换后的数据流。DStream由持续的RDD序列组成 :

作用于DStream上的操作有两种(与RDD类似):Transformation与Output。

DStream之间的的转换所形成的依赖关系保存在DStreamGraph中(DstreamGraph在StreamingContext创建时初始化),DStreamGraph会定期生成RDD DAG.

SparkStreaming应用

SparkStreaming应用程序,以WordCount为例,实现如下:

 val conf = new SparkConf().setAppName("wordCount").setMaster("local[4]")
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds(10))
    val lines = ssc.socketTextStream("localhost", 8585, StorageLevel.MEMORY_ONLY)
    val words = lines.flatMap(_.split(" ")).map(w => (w,1))
    val wordCount = words.reduceByKey(_+_)
    wordCount.print
    ssc.start()

Spark Streaming 执行过程分析

Spark Streaming执行过程,将依托第一部分应用程序(WordCount)进行分析

StreamingContext初始化过程

StreamingContext是很多Streaming功能的入口,如:它提供从多种数据源创建DStream的方法等。由上述WordCount应用可知,Streaming应用执行时首先会创建StreamingContext。

伴随StreamingContext的创建将会创建如下主要组件:

1、创建DStreamGraph,并为其设置转换成RDD Graph的时间间隔。

private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      cp_.graph.setContext(this)
      cp_.graph.restoreCheckpointData()
      cp_.graph
    } else {
      require(batchDur_ != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(batchDur_)
      newGraph
    }
  }

2、创建JobScheduler

private[streaming] val scheduler = new JobScheduler(this)

DStream创建及转换

利用刚刚创建的StreamingContext通过调用socketTextStream方法创建SocketInputDStream(ReceiverInputDstream).

InputDStream继承体系如下:(以SocketInputDStream与KafkaInputDStream为例)。

JAVA中初始化子类时,会先初始化其父类。所以在创建SocketInputDStream时,会先初始化InputDStream,在InputDStream中实现将自身加入DStreamGraph中(具体见上图)。

InputDStream子类都有一个getReceiver方法, 此方法用来获取Receiver对象. 以SocketInputDStream为例, 如上图, 其会创建SocketReceiver来接收数据.

DStream中算子的转换,类似于RDD中的转换,都是延迟计算。当上述应用遇到print--Output算子时,会将DStream转换为ForEachDStream,并调register方法作为OutputStream注册到DStreamGraph的outputStreams列表.

/**
   * Print the first num elements of each RDD generated in this DStream. This is an output
   * operator, so this DStream will be registered as an output stream and there materialized.
   */
  def print(num: Int): Unit = ssc.withScope {
    def foreachFunc: (RDD[T], Time) => Unit = {
      (rdd: RDD[T], time: Time) => {
        val firstNum = rdd.take(num + 1)
        // scalastyle:off println
        println("-------------------------------------------")
        println("Time: " + time)
        println("-------------------------------------------")
        firstNum.take(num).foreach(println)
        if (firstNum.length > num) println("...")
        println()
        // scalastyle:on println
      }
    }
    new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register()
  }

其中ForEachDStream不同于其它DStream的地方是其重载了generateJob方法。

所有DStream之间的转换关系,使用类似RDD的依赖来表示。

启动过程

应用程序通过调用ssc.start()方法,开始执行stream应用的执行.start方法具体实具如下所示:

def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

其中, 最核心代码为scheduler.start, scheduler为JobScheduler对象. 如上文所述其在StreamingContext实例化时创建. (下面将按照函数调用流程进行分析)

JobScheduler创建及执行

JobScheuler用来调度运行在Spark上的作业, 它使用JobGenerator生成jobs, 然后使用一个线程池并行运行提交作业.

一. JobScheduler 创建:

JobScheduler由StreamingContext创建,并触发start调用.

JobScheduler初始化时,会创建一个ThreadPool(jobExecutor)和jobGenerator

其中:

jobExecutor用于提交作业.ThreadPool中线程的数量为Job并发量,由”spark.streaming.concurrentJobs”指定,默认为1.

JobGenerator为JobGenerator类实例.其用于依据DStreams生成jobs。

二. JobScheduler执行

start方法执行时会创建并启动以下服务:

eventLoop: EventLoop[JobSchedulerEvent]对象,用以接收和处理事件。调用者通过调用其post方法向事件队列注册事件。EventLoop开始执行时,会开启一deamon线程用于处理队列中的事件。EventLoop是一个抽象类,JobScheduler中初始化EventLoop时实现了其OnReceive方法。该方法中指定接收的事件由processEvent(event)方法处理。

receiverTracker: ReceiverTracker对象,用以管理ReceiverInputDStream中receiver的执行。这个对象必须在所有InputStream添加至DStreamGraph中后创建。因其实例化时会从DStreamGraph中抽取InputDStream. 以便用在其启动时抽取其中的Receiver。

jobGenertor:其在JobScheduler实例化时创建,在此处进行启动。

此部分代码实现,如下图所示:

def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start(ssc.sparkContext)
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)
    receiverTracker.start()
    jobGenerator.start()
    logInfo("Started JobScheduler")
  }

JobScheduler启动时,主要实例化并启动上述组件,下面将整个过程分为两大块并按组件进行介绍。

  

1、数据接收过程(启动Receiver, 接收数据, 生成Block)

ReceiverTracker

在JobScheduler中创建并调用其start方法。

一、JobScheduler创建

JobScheduler创建时会创建以下主要属性:

receiverInputStream:接收数据的InputDStream实例,通过ssc.graph.getReceiverInputStreams获取,其内部存放ReceiverInputDStream实例及子类实例(包括:SocketInputDStream, RawInputDStream,FlumePollingInputDStream, KafkaInputDStream, MQTTInputDStream)

receiverdBlockTracker: ReceivedBlockTracker实例,用来记录Receiver接收的blocks。通过此类进行的操作可以保存一个WAL(write ahead log), 以便失败后进行保存

schedulingPolocy: ReceiverSchedulingPolicy实例,用于调度Receiver.

receiverTrackingInfo: HashMap对象, 用于维护receivers信息,key: receiverId, value: receiver info. 只能由ReceiverTrackerEndpoint访问。

二、start方法执行

start方法被调用时,首先其会判断receiverInputStream是否为空,如果为空,也就是没有任何Receiver,不做任何操作。当不为空时,其会创建ReceiverTrackerEndpoint实例endpoint.

ReceiverTrackerEndpoint 为RPC终端,用于接收各Receiver发送的消息并进行相应处理,必要时给予响应。

如果第一次启动,则默认不跳过启动Receiver(默认:skipReceiverLaunch = false,创建ReceiverTracker时默认指定),则执行launchReceivers()方法。

这部分代码如下图所示:

/** Start the endpoint and receiver execution thread. */
  def start(): Unit = synchronized {
    if (isTrackerStarted) {
      throw new SparkException("ReceiverTracker already started")
    }

    if (!receiverInputStreams.isEmpty) {
      endpoint = ssc.env.rpcEnv.setupEndpoint(
        "ReceiverTracker", new ReceiverTrackerEndpoint(ssc.env.rpcEnv))
      if (!skipReceiverLaunch) launchReceivers()
      logInfo("ReceiverTracker started")
      trackerState = Started
    }
  }

其中launchReceivers方法,将调用runDummySparkJob()方法和向ReceiverTrackerEndpoint发送StartAllReceivers消息。

/**
   * Get the receivers from the ReceiverInputDStreams, distributes them to the
   * worker nodes as a parallel collection, and runs them.
   */
  private def launchReceivers(): Unit = {
    val receivers = receiverInputStreams.map(nis => {
      val rcvr = nis.getReceiver()
      rcvr.setReceiverId(nis.id)
      rcvr
    })

    runDummySparkJob()

    logInfo("Starting " + receivers.length + " receivers")
    endpoint.send(StartAllReceivers(receivers))
  }

runDummySparkJob()方法,用来确认slave节点是否注册,避免将所有receivers分发到同一节点。其通过执行一非常简单的任务,让SparkCore执行一次,然后通过其组件信息判断是否有除driver之外的Executor存在。runDummySparkJob代码如下:

/**
   * Run the dummy Spark job to ensure that all slaves have registered. This avoids all the
   * receivers to be scheduled on the same node.
   *
   * TODO Should poll the executor number and wait for executors according to
   * "spark.scheduler.minRegisteredResourcesRatio" and
   * "spark.scheduler.maxRegisteredResourcesWaitingTime" rather than running a dummy job.
   */
  private def runDummySparkJob(): Unit = {
    if (!ssc.sparkContext.isLocal) {
      ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
    }
    assert(getExecutors.nonEmpty)
  }

检查完slave注册情况后,其会从receiverInputstreams列表中抽取所有Receivers,并用其创建StartAllReceivers消息发送给endpoint.

ReceiverTrackerEndpoint

用于接收Receiver及其自身发送的消息并进行处理,必要时进行响应。其可处理的消息有:

StartAllReceivers,、RestartReceivers、CleanupOldBlocks、UpdateReceiverRateLimit、ReportError以及RegisterReceiver、AddBlock、DeregisterReceiver、AllReceiverIds、StopAllReceivers.

当ReceiverTrackerEndpoint接收到StartAllReceivers时,其会通过调度策略shedulingPolicy(前文已描述) 生成Receiver分发策略(挑选出可运行Receiver的Executor),然后通过调用startReceiver进行启动。

case StartAllReceivers(receivers) =>
        val scheduledExecutors = schedulingPolicy.scheduleReceivers(receivers, getExecutors)
        for (receiver <- receivers) {
          val executors = scheduledExecutors(receiver.streamId)
          updateReceiverScheduledExecutors(receiver.streamId, executors)
          receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation
          startReceiver(receiver, executors)
        }

此部分代码实现,如上所示。其中schedulingPolicy与startReceiver将在下小节进行说明。

ReceiverSchedulingPolicy 调度策略

此类用来,调度receivers, 并保证其均匀分布。Receiver调度分为两个阶段:一是全局调度,当ReceiverTracker启动时, 通过scheduleReceivers同时调度所有reciver;二是局部调度,当某个Receiver重启时发生,通过scheduleReceiver进行调度。此处接上小节,接着进行scheduleReceivers的解析:

在ReceiverTrackerEndpoint通过调用schedulingPolicy.scheduleReceivers(receivers, getExecutor)来触发Receivers调度, 其中包含两个参数: receivers和getExecutor. 第一个参数receivers以消息的形式进行传递;第二个参数代表的是可进行调度的Executor列表,用getExecutor方法进行获取。获取的Exectutor不包含与Driver在同一个主机上的节点。其中Executor信息格式为“host:port”

scheduleReceivers在执行时,首先将从executors列表转换成Map格式,单个元素host->”host:port”格式。 然后遍历receivers列表,为其逐个分配节点,分配过程如下:

从列表取出一个receiver,判断其是否具有preferredLocation.(可优先选择机器) 此方法在Receiver基类中声明,要求子类进行实现。上文举列中SocketReceiver未重写此方法,因此不具preferredLocation,因此未执行Receiver的Executor节点可以随意选取。

此部分代码如下

 /**
   * Try our best to schedule receivers with evenly distributed. However, if the
   * `preferredLocation`s of receivers are not even, we may not be able to schedule them evenly
   * because we have to respect them.
   *
   * Here is the approach to schedule executors:
   * <ol>
   *   <li>First, schedule all the receivers with preferred locations (hosts), evenly among the
   *       executors running on those host.<>
   *   <li>Then, schedule all other receivers evenly among all the executors such that overall
   *       distribution over all the receivers is even.<>
   * </ol>
   *
   * This method is called when we start to launch receivers at the first time.
   */
  def scheduleReceivers(
      receivers: Seq[Receiver[_]], executors: Seq[String]): Map[Int, Seq[String]] = {
    if (receivers.isEmpty) {
      return Map.empty
    }

    if (executors.isEmpty) {
      return receivers.map(_.streamId -> Seq.empty).toMap
    }

    val hostToExecutors = executors.groupBy(_.split(":")(0))
    val scheduledExecutors = Array.fill(receivers.length)(new mutable.ArrayBuffer[String])
    val numReceiversOnExecutor = mutable.HashMap[String, Int]()
    // Set the initial value to 0
    executors.foreach(e => numReceiversOnExecutor(e) = 0)

    // Firstly, we need to respect "preferredLocation". So if a receiver has "preferredLocation",
    // we need to make sure the "preferredLocation" is in the candidate scheduled executor list.
    for (i <- 0 until receivers.length) {
      // Note: preferredLocation is host but executors are host:port
      receivers(i).preferredLocation.foreach { host =>
        hostToExecutors.get(host) match {
          case Some(executorsOnHost) =>
            // preferredLocation is a known host. Select an executor that has the least receivers in
            // this host
            val leastScheduledExecutor =
              executorsOnHost.minBy(executor => numReceiversOnExecutor(executor))
            scheduledExecutors(i) += leastScheduledExecutor
            numReceiversOnExecutor(leastScheduledExecutor) =
              numReceiversOnExecutor(leastScheduledExecutor) + 1
          case None =>
            // preferredLocation is an unknown host.
            // Note: There are two cases:
            // 1. This executor is not up. But it may be up later.
            // 2. This executor is dead, or it's not a host in the cluster.
            // Currently, simply add host to the scheduled executors.
            scheduledExecutors(i) += host
        }
      }
    }

    // For those receivers that don't have preferredLocation, make sure we assign at least one
    // executor to them.
    for (scheduledExecutorsForOneReceiver <- scheduledExecutors.filter(_.isEmpty)) {
      // Select the executor that has the least receivers
      val (leastScheduledExecutor, numReceivers) = numReceiversOnExecutor.minBy(_._2)
      scheduledExecutorsForOneReceiver += leastScheduledExecutor
      numReceiversOnExecutor(leastScheduledExecutor) = numReceivers + 1
    }

    // Assign idle executors to receivers that have less executors
    val idleExecutors = numReceiversOnExecutor.filter(_._2 == 0).map(_._1)
    for (executor <- idleExecutors) {
      // Assign an idle executor to the receiver that has least candidate executors.
      val leastScheduledExecutors = scheduledExecutors.minBy(_.size)
      leastScheduledExecutors += executor
    }

    receivers.map(_.streamId).zip(scheduledExecutors).toMap
  }

  当挑选出要执行的Executor后,调用startReceiver(receiver, executor)方法, 在指定executor是启动receiver.

startReceiver方法

将Receiver使用RDD进行包装,然后使用SparkContext.submitJob方法进行提交,使其巧妙的以普通RDD作业运行的方式将Receiver分发在选出的Executor方法上执行, 其执行逻辑为startReceiverFunc操作。当startReceiverFunc方法被调用时,会为分发到该节点中的receiver创建ReceiverSupervisorImpl对够象supervisor。 然后调用supervisor的start方法,使其监控运行在worker中的receiver。ReceiverSupervisor提供了处理接收数据的必要接口。这部分的具体实现如下代码所示:

      // Function to start the receiver on the worker node
      val startReceiverFunc: Iterator[Receiver[_]] => Unit =
        (iterator: Iterator[Receiver[_]]) => {
          if (!iterator.hasNext) {
            throw new SparkException(
              "Could not start receiver as object not found.")
          }
          if (TaskContext.get().attemptNumber() == 0) {
            val receiver = iterator.next()
            assert(iterator.hasNext == false)
            val supervisor = new ReceiverSupervisorImpl(
              receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
            supervisor.start()
            supervisor.awaitTermination()
          } else {
            // It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
          }
        }

      // Create the RDD using the scheduledExecutors to run the receiver in a Spark job
      val receiverRDD: RDD[Receiver[_]] =
        if (scheduledExecutors.isEmpty) {
          ssc.sc.makeRDD(Seq(receiver), 1)
        } else {
          ssc.sc.makeRDD(Seq(receiver -> scheduledExecutors))
        }
      receiverRDD.setName(s"Receiver $receiverId")
      ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
      ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))

      val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
        receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
      // We will keep restarting the receiver job until ReceiverTracker is stopped
      future.onComplete {
        case Success(_) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
        case Failure(e) =>
          if (!shouldStartReceiver) {
            onReceiverJobFinish(receiverId)
          } else {
            logError("Receiver has been stopped. Try to restart it.", e)
            logInfo(s"Restarting Receiver $receiverId")
            self.send(RestartReceiver(receiver))
          }
      }(submitJobThreadPool)
      logInfo(s"Receiver ${receiver.streamId} started")
    }

当receiver执行起来后ReceiverTracker等待程序结束。接下来介绍ReceiverSupervisorImpl

ReceiverSupervisor

用监控运行在worker中的receiver,提供了处理接收数据的必要接口,其start方法执行逻辑如下:

/** Start the supervisor */
  def start() {
    onStart()
    startReceiver()
  }

如上代码所述,其将执行onStart()及startReceiver()两个方法, onStart由其实现类ReceiverSupervisor类实现。

其中OnStart方法中会让所有注册的BlockGenerators执行, BlockGenerator的作用是定时使用接收的数据生成Block,并将生成的block加入队列。下一小节具体说明BlockGenerators。

startReceiver方法用于启动Receiver, 实现逻辑如下所述:

  /** Start receiver */
  def startReceiver(): Unit = synchronized {
    try {
      if (onReceiverStart()) {
        logInfo("Starting receiver")
        receiverState = Started
        receiver.onStart()
        logInfo("Called receiver onStart")
      } else {
        // The driver refused us
        stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
      }
    } catch {
      case NonFatal(t) =>
        stop("Error starting receiver " + streamId, Some(t))
    }
  }

其中的onReceiverStart由ReceiverSupervisorImpl实现, 其作用主要是:向ReceiverTracker发送注册消息(RegisterReceiver),并等待响应(成功:true; 否则:false.)。
方法通过调用Receiver(SocketReceiver)中的onStart 使用创建deamon线程并开启,该线程用于真实receive数据。其中onStart方法具体实现如下图所示:

 def onStart() {
    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }

其中receive方法就是接收数据的具体实现部分, 其内部创建Socket对象,通过Socket接收信息。然后将其转换成String,用Iterator进行包装,然后调用store方法进行存储。其receive方法的实现如下所示:

  /** Create a socket connection and receive data until receiver is stopped */
  def receive() {
    var socket: Socket = null
    try {
      logInfo("Connecting to " + host + ":" + port)
      socket = new Socket(host, port)
      logInfo("Connected to " + host + ":" + port)
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next)
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case e: java.net.ConnectException =>
        restart("Error connecting to " + host + ":" + port, e)
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      if (socket != null) {
        socket.close()
        logInfo("Closed socket to " + host + ":" + port)
      }
    }
  }
}

其中store方法在Receiver基类中定义, 其调用ReceiverSupervisor中的pushSingle方法,将一个接收数据的记录保存传递给BlockGenerator.则BlockGenertor会使用其addData方法将记录加入bufffer.

BlockGenerator

上节所述:ReceiverSupervisor调用start时会执行OnStart。OnStart方法中会让所有注册的BlockGenerators执行(调用start方法), BlockGenerator的作用是定时使用接收的数据生成Block,并将生成的block加入队列。下面将对其进行分析:

BlockGenerators重要属性:

blockIntervalTimer: RecurringTimer对象, 其做用创建daemon线程, 定时执行第三个参数callback传入函数. 此处为updateCurrentBuffer函数。

blockPushingThread: 用于执行keepPushingBlocks的线程,其中blocksForPushing类型为ArrayBlockingQueue.

  当调用其start()方法时,其会启动调用blockIntervalTimer的start方法及启动blockPushingThread线程。

blockIntervalTimer的start方法被调用时,它将启动后台线程,定时执行updateCurrentBuffer操作,该操作将缓存中的数据,包装成block, 加入 blocksForPushing.

blockPushingThread线程开启后,会执行keepPushingBlocks() 方法, 其将blocksForPushing中的block信息调用pushBlock方法,这个方法内部触发BlockGeneratorListener的onPushBlock事件。该监听器在ReceiverSupervisorImpl创建BlockGenerator时做为参数传入。

BlockGeneratorListener的onPushBlock事件触发时,其将调用pushArrayBuffer方法,pushArrayBuffer进一步调用pushAndReportBlock方法。PushAndReportBlock方法具体实现如下:

  /** Store block and report it to driver */
  def pushAndReportBlock(
      receivedBlock: ReceivedBlock,
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    val blockId = blockIdOption.getOrElse(nextBlockId)
    val time = System.currentTimeMillis
    val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
    logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
    val numRecords = blockStoreResult.numRecords
    val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
    trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))
    logDebug(s"Reported block $blockId")
  }

   其将利用receiverdBlockHandler.storeBlock存储block数据, 并向trackerEndpoint发送AddBlock消息。

ReceiverTracker接收到AddBlock信息后,其将信息包装成BlockAdditionEvent写入日志,并且将Block信息加入getReceiverdBlockQueue队列

小结: 以上部分,描述了数据的接收、存储过程。如上文所述,在JobSheduler执行start方法时,还将会启动jobGenerator组件,也就是启动数据处理的过程。下面将对jobGenerator进行处理。

2、数据处理过程(DStream转换成RDD, 生成Jobs并提交)

JobGenerator

重要属性:

Timer: RecurringTimer对象, 与JobScheduler中类似,在该对象中其会定期执行GenerateJobs方法。

在JobGenerator.start()被调用时,其将创建eventLoop对象并启动,以及调用startFirstTime()方法。

其中eventLoop与上文JobScheduler中的一样,其定义事件交由processEvent(event). 此处processEvent实现逻辑为:

  /** Processes all events */
  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }

startFisrtTime()的实现逻辑为: 调用graph.start方法及开启Timer(即解发GeneratorJobs()事件).

DStreamGraph.start实现逻辑如下:

 def start(time: Time) {
    this.synchronized {
      require(zeroTime == null, "DStream graph computation already started")
      zeroTime = time
      startTime = time
      outputStreams.foreach(_.initialize(zeroTime))
      outputStreams.foreach(_.remember(rememberDuration))
      outputStreams.foreach(_.validateAtStart)
      inputStreams.par.foreach(_.start())
    }
  }

其会调用outputStreams 及inputStream的start方法

当GenerateJobs事件触发时, processEvent中会执行generateJobs(time)方法, 该方法最终会调用ReceiverBlockTracker.allocateBlocksToBatch(time)将receiverdBlockQueue中未分配的block生成一个批次,并将信息保存在timeToAllocatedBlocks(Hashmap<batchTime, allocatedBlocks>)中,同时记录日志信息。

然后调用graph.generateJobs(time)方法,调用所有outputStream.generateJob(time). 因为OutputStream(举例中为ForEachDStream)会重写DStream中的generateJob方法,此时会调用ForEachDStream中的generateJob.[不要错误的认为是DStream], 其具体实现为

  override def generateJob(time: Time): Option[Job] = {
    parent.getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => createRDDWithLocalProperties(time) {
          ssc.sparkContext.setCallSite(creationSite)
          foreachFunc(rdd, time)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

其中parent.getOrCompute用于获取指定批次的RDD。此处是调用parent.getOrCompute, 先来看一下WordCount应用中DStream的转换,如下图:

  

  getOrCompute( compute方法与之类似)方法由DStream基类创建, 如果子类重写该方法,则执行子类方法; 若未重写,则执行基类中的方法。getOrCompute方法会进行递归,直至回溯至SocketInputDStream中。但SocketInputDStream中并未重写此方法,所以其将执行基类中方法,代码如下:

/**
   * Get the RDD corresponding to the given time; either retrieve it from cache
   * or compute-and-cache it.
   */
  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {
    // If RDD was already generated, then retrieve it from HashMap,
    // or else compute the RDD
    generatedRDDs.get(time).orElse {
      // Compute the RDD if time is valid (e.g. correct time in a sliding window)
      // of RDD generation, else generate nothing.
      if (isTimeValid(time)) {

        val rddOption = createRDDWithLocalProperties(time) {
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details. We need to have this call here because
          // compute() might cause Spark jobs to be launched.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            compute(time)
          }
        }

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }

其中通过执行compute方法来生成批定批次的初始RDD(利用接收到的数据生成BlockRDD)。compute方法代码如下:

/**
   * Generates RDDs with blocks received by the receiver of this stream. */
  override def compute(validTime: Time): Option[RDD[T]] = {
    val blockRDD = {

      if (validTime < graph.startTime) {
        // If this is called for any time before the start time of the context,
        // then this returns an empty RDD. This may happen when recovering from a
        // driver failure without any write ahead log to recover pre-failure data.
        new BlockRDD[T](ssc.sc, Array.empty)
      } else {
        // Otherwise, ask the tracker for all the blocks that have been allocated to this stream
        // for this batch
        val receiverTracker = ssc.scheduler.receiverTracker
        val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty)

        // Register the input blocks information into InputInfoTracker
        val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum)
        ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)

        // Create the BlockRDD
        createBlockRDD(validTime, blockInfos)
      }
    }
    Some(blockRDD)
  }

生成BlockRDD后,返回递归上层(FlatMappedDStream中)继续执行,上层代码如下:

override def compute(validTime: Time): Option[RDD[U]] = {
    parent.getOrCompute(validTime).map(_.flatMap(flatMapFunc))
  }

其从parrent.getOrCompute(SocketInputDStream.getOrCompute)返回后将进行RDD的转换(生成RDD Graph的过程),执行完成会再返回递归上层进行RDD转换, 直至回到调用入口ForEachDStream(outputStream)中。

outputStream.generateJob(time)【该方法在ForEachDStream中实现】会使用foreachFunc方法(DStream.print中定义)及当前批次创建Job.

当创建完Job后,JobGenerator.generateJobs会使用jobScheduler.submitJobSet提交作业。

SubmitJobSet具体实现如下:

def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

最终会将JobSet中的job,使用jobExecutor线程池以多线程方式使用JobHandler进行处理。

  private class JobHandler(job: Job) extends Runnable with Logging {

    import JobScheduler._

    def run() {
      try {
        val formattedTime = UIUtils.formatBatchTime(
          job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)
        val batchUrl = s"/streaming/batchid=${job.time.milliseconds}"
        val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"

        ssc.sc.setJobDescription(
          s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")
        ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)
        ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)

        // We need to assign `eventLoop` to a temp variable. Otherwise, because
        // `JobScheduler.stop(false)` may set `eventLoop` to null when this method is running, then
        // it's possible that when `post` is called, `eventLoop` happens to null.
        var _eventLoop = eventLoop
        if (_eventLoop != null) {
          _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
          // Disable checks for existing output directories in jobs launched by the streaming
          // scheduler, since we may need to write output to an existing directory during checkpoint
          // recovery; see SPARK-4835 for more details.
          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
            job.run()
          }
          _eventLoop = eventLoop
          if (_eventLoop != null) {
            _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
          }
        } else {
          // JobScheduler has been stopped.
        }
      } finally {
        ssc.sc.setLocalProperty(JobScheduler.BATCH_TIME_PROPERTY_KEY, null)
        ssc.sc.setLocalProperty(JobScheduler.OUTPUT_OP_ID_PROPERTY_KEY, null)
      }
    }
  }

}

上边代码为JobHandler核心代码。其将向EventLop发送一JobStarted事件,及调用Job.run()方法。

 def run() {
    _result = Try(func())
  }

其中job.run()方法会执行生成job时的函数foreachFunc。foreachFunc中的take操作为action操作会触发作业提交,从而完成数据处理操作。

本文地址:http://www.cnblogs.com/barrenlake/p/4889190.html

SparkStreaming 源码分析的更多相关文章

  1. DirectStream、Stream的区别-SparkStreaming源码分析02

    转http://hadoop1989.com/2016/03/15/KafkaStreaming/ 在Spark1.3之前,默认的Spark接收Kafka数据的方式是基于Receiver的,在这之后的 ...

  2. spark-streaming-kafka-0-10源码分析

    转发请注明原创地址http://www.cnblogs.com/dongxiao-yang/p/7767621.html 本文所研究的spark-streaming代码版本为2.3.0-SNAPSHO ...

  3. spark源码分析以及优化

    第一章.spark源码分析之RDD四种依赖关系 一.RDD四种依赖关系 RDD四种依赖关系,分别是 ShuffleDependency.PrunDependency.RangeDependency和O ...

  4. ABP源码分析一:整体项目结构及目录

    ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...

  5. HashMap与TreeMap源码分析

    1. 引言     在红黑树--算法导论(15)中学习了红黑树的原理.本来打算自己来试着实现一下,然而在看了JDK(1.8.0)TreeMap的源码后恍然发现原来它就是利用红黑树实现的(很惭愧学了Ja ...

  6. nginx源码分析之网络初始化

    nginx作为一个高性能的HTTP服务器,网络的处理是其核心,了解网络的初始化有助于加深对nginx网络处理的了解,本文主要通过nginx的源代码来分析其网络初始化. 从配置文件中读取初始化信息 与网 ...

  7. zookeeper源码分析之五服务端(集群leader)处理请求流程

    leader的实现类为LeaderZooKeeperServer,它间接继承自标准ZookeeperServer.它规定了请求到达leader时需要经历的路径: PrepRequestProcesso ...

  8. zookeeper源码分析之四服务端(单机)处理请求流程

    上文: zookeeper源码分析之一服务端启动过程 中,我们介绍了zookeeper服务器的启动过程,其中单机是ZookeeperServer启动,集群使用QuorumPeer启动,那么这次我们分析 ...

  9. zookeeper源码分析之三客户端发送请求流程

    znode 可以被监控,包括这个目录节点中存储的数据的修改,子节点目录的变化等,一旦变化可以通知设置监控的客户端,这个功能是zookeeper对于应用最重要的特性,通过这个特性可以实现的功能包括配置的 ...

随机推荐

  1. HDOJ 2089 不要62(打表)

    Problem Description 杭州人称那些傻乎乎粘嗒嗒的人为62(音:laoer). 杭州交通管理局经常会扩充一些的士车牌照,新近出来一个好消息,以后上牌照,不再含有不吉利的数字了,这样一来 ...

  2. C++ STL之vector常用指令

    只记载本人在ACM中常用的函数. vector,相当于动态数组,数组大小可变.声明vector以后,自动在内存中分配一块连续的内存空间进行数据存储. vector在内部进行插入.删除操作时间复杂度O( ...

  3. Test execution order

    刚开始的时候,JUnit并没有规定测试方法的调用执行顺序.方法通过映射的API返回的顺序进行调用.然 而,使用JVM顺序是不明智的,因为Java平台没有规定任何特定的顺序,事实上JDK7或多或少的返回 ...

  4. Gmail邮件功能那么强大,GMail被封,在国内怎么用gmail收邮件?

    IT圈子里最热门的话题一定是:gmail被封,该怎么办?gmail由于强大的邮件功能,ITer一定是人手一个or多个,之前想要收发gmail使用imap或SMTP方式是可以在国内正常使用的,目前ima ...

  5. javascript 数组 排除null, undefined, 和不存在的元素

    The most common way to loop through the elements of an array is with a for loop: var o = [1,2,3,4,5] ...

  6. 遍历INI文件和删除指定域内容

    主要还是使用的INI文件操作的API,只是把参数修改下. BOOL WINAPI WritePrivateProfileString( __in LPCTSTR lpAppName, __in LPC ...

  7. Sandcastle Help File Builder使用教程

    Sandcastle Help File Builder相信很多的园友用过,小弟我最近因为工作原因需要生成公司的一套SDK的帮助文档,因此找了一些资料,发现网上的资料很多,但是都不怎么完全,有些只是随 ...

  8. Spring MVC中使用Mongodb总结

    近期项目做了次架构调整,原来是使用MySQL+GeoHash来存储LBS数据(地理位置信息),现在使用NOSQL数据库MongoDB来存储LBS数据(地理位置信息).由于项目是基于spring MVC ...

  9. Sorting File Contents and Output with sort

     Sorting File Contents and Output with sort   Another very useful command to use on text file is  so ...

  10. Tomcat相关目录及配置文件总结

    Tomcat根目录介绍      [bin]目录主要是用来存放tomcat的命令,主要有两大类,一类是以.sh结尾的(linux命令),另一类是以.bat结尾的(windows命令). 很多环境变量的 ...