大batch任务对structured streaming任务影响
信念,你拿它没办法,但是没有它你什么也做不成。—— 撒姆尔巴特勒
前言
对于spark streaming而言,大的batch任务会导致后续batch任务积压,对于structured streaming任务影响如何,本篇文章主要来做一下简单的说明。
本篇文章的全称为设置trigger后,运行时间长的 query 对后续 query 的submit time的影响
Trigger类型
首先trigger有三种类型,分别为 OneTimeTrigger ,ProcessingTime 以及 ContinuousTrigger 三种。这三种解释可以参照 spark 集群优化 中对 trigger的解释说明。
设置OneTimeTrigger后,运行时间长的 query 对后续 query 的submit time的影响
OneTimeTrigger只执行一次query就结束了,不存在对后续batch影响。
设置ProcessingTimeTrigger后,运行时间长的 query 对后续 query 的submit time的影响
设置超过 trigger inverval的sleep时间
代码截图如下,即在每一个partition上的task中添加一个sleep逻辑:

运行效果
运行效果截图如下:
UI的Jobs面板截图如下:

UI的SQL面板截图如下:

通过上面两个面板截图中的submitted列可以看出,此时每一个batch的query 提交时间是根据前驱query的结束时间来确定的。
设置 ContinuousTrigger 后,运行时间长的 query 对后续 query 的submit time的影响
源码分析
下面从源码角度来分析一下。
StreamExecution的职责
它是当新数据到达时在后台连续执行的查询的句柄。管理在单独线程中发生的流式Spark SQL查询的执行。 与标准查询不同,每次新数据到达查询计划中存在的任何 Source 时,流式查询都会重复执行。每当新数据到达时,都会创建一个 QueryExecution,并将结果以事务方式提交给给定的 Sink 。
它有两个子类,截图如下:

Tigger和StreamExecution的对应关系
在org.apache.spark.sql.streaming.StreamingQueryManager#createQuery方法中有如下代码片段:
(sink, trigger) match {
case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) =>
if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) {
UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode)
}
// 使用 ContinuousTrigger 则为 ContinuousExecution
new StreamingQueryWrapper(new ContinuousExecution(
sparkSession,
userSpecifiedName.orNull,
checkpointLocation,
analyzedPlan,
v2Sink,
trigger,
triggerClock,
outputMode,
extraOptions,
deleteCheckpointOnStop))
case _ =>
// 使用 ProcessingTrigger 则为 MicroBatchExecution
new StreamingQueryWrapper(new MicroBatchExecution(
sparkSession,
userSpecifiedName.orNull,
checkpointLocation,
analyzedPlan,
sink,
trigger,
triggerClock,
outputMode,
extraOptions,
deleteCheckpointOnStop))
}
可以看出,Tigger和对应的StreamExecution的关系如下:
|
Trigger
|
StreamExecution
|
|---|---|
| OneTimeTrigger | MicroBatchExecution |
| ProcessingTrigger | MicroBatchExecution |
| ContinuousTrigger | ContinuousExecution |
另外,StreamExecution构造参数中的analyzedPlan是指LogicalPlan,也就是说在第一个query启动之前,LogicalPlan已经生成,此时的LogicalPlan是 UnResolved LogicalPlan,因为此时每一个AST依赖的数据节点的source信息还未知,还无法优化LogicalPlan。
注意
ContinuousExecution支持的source类型目前有限,主要为StreamWriteSupport子类,即:
|
source
|
class full name
|
|---|---|
| console | org.apache.spark.sql.execution.streaming.ConsoleSinkProvider |
| kafka | org.apache.spark.sql.kafka010.KafkaSourceProvider |
| ForeachSink | org.apache.spark.sql.execution.streaming.sources.ForeachWriterProvider |
| MemorySinkV2 | org.apache.spark.sql.execution.streaming.sources.MemorySinkV2 |
否则会匹配到 MicroBatchExecution, 但是在初始化 triggerExecution成员变量时,只支持ProcessingTrigger,不支持 ContinuousTrigger,会抛出如下异常:

StreamExecution的执行
org.apache.spark.sql.streaming.StreamingQueryManager#startQuery有如下代码片段:
try {
// When starting a query, it will call `StreamingQueryListener.onQueryStarted` synchronously.
// As it's provided by the user and can run arbitrary codes, we must not hold any lock here.
// Otherwise, it's easy to cause dead-lock, or block too long if the user codes take a long
// time to finish.
query.streamingQuery.start()
} catch {
case e: Throwable =>
activeQueriesLock.synchronized {
activeQueries -= query.id
}
throw e
}
这里的query.streamingQuery就是StreamExecution,即为MicroBatchExecution 或 ContinuousExecution。
StreamExecution的start 方法如下:
/**
* Starts the execution. This returns only after the thread has started and [[QueryStartedEvent]]
* has been posted to all the listeners.
*/
def start(): Unit = {
logInfo(s"Starting $prettyIdString. Use $resolvedCheckpointRoot to store the query checkpoint.")
queryExecutionThread.setDaemon(true)
queryExecutionThread.start()
startLatch.await() // Wait until thread started and QueryStart event has been posted
}
queryExecutionThread成员变量声明如下:
/**
* The thread that runs the micro-batches of this stream. Note that this thread must be
* [[org.apache.spark.util.UninterruptibleThread]] to workaround KAFKA-1894: interrupting a
* running `KafkaConsumer` may cause endless loop.
*/
val queryExecutionThread: QueryExecutionThread =
new QueryExecutionThread(s"stream execution thread for $prettyIdString") {
override def run(): Unit = {
// To fix call site like "run at <unknown>:0", we bridge the call site from the caller
// thread to this micro batch thread
sparkSession.sparkContext.setCallSite(callSite)
runStream()
}
}
其中,QueryExecutionThread 是 UninterruptibleThread 的子类,UninterruptibleThread 是 Thread的子类,即QueryExecutionThread 是一个线程类。他会运行runStream方法,runStream关键代码如下:
try {
// 运行Stream query的准备工作,send QueryStartedEvent event, countDown latch,streaming configure等操作
runActivatedStream(sparkSessionForStream) // 运行 stream
} catch {
// 异常处理
} finally {
// 运行完Stream query的收尾工作,stop source,send stream stop event,删除checkpoint(如果启用deleteCheckpointOnStop)等等操作
}
runActivatedStream 说明:Run the activated stream until stopped. :它是抽象方法,由子类实现。
MicroBatchExecution的runActivatedStream的实现
MicroBatchExecution 的 runActivatedStream的方法逻辑描述如下:
triggerExecutor.execute(() =>{
提交执行每一个query的操作
})
triggerExecution 的定义如下:
private val triggerExecutor = trigger match {
case t: ProcessingTime => ProcessingTimeExecutor(t, triggerClock)
case OneTimeTrigger => OneTimeExecutor()
case _ => throw new IllegalStateException(s"Unknown type of trigger: $trigger")
}
即使用 ProcessingTime 会使用 ProcessingTimeExecutor 来周期性生成 batch query,其 execution 方法代码如下:
override def execute(triggerHandler: () => Boolean): Unit = {
while (true) {
val triggerTimeMs = clock.getTimeMillis
val nextTriggerTimeMs = nextBatchTime(triggerTimeMs)
val terminated = !triggerHandler()
if (intervalMs > 0) {
val batchElapsedTimeMs = clock.getTimeMillis - triggerTimeMs
if (batchElapsedTimeMs > intervalMs) {
notifyBatchFallingBehind(batchElapsedTimeMs)
}
if (terminated) {
return
}
clock.waitTillTime(nextTriggerTimeMs)
} else {
if (terminated) {
return
}
}
}
}
伪代码如下:
def execute(triggerHandler: () => Boolean): Unit = {
while(true) {
获取current_time
根据current_time和interval获取下一个批次start_time
执行query任务获取并获取是否结束stream的标志位
if(interval > 0) {
query使用时间 = 新获取的current_time - 旧的current_time
if(query使用时间 > interval) {
notifyBatchFallingBehind // 目前只是打印warn日志
}
if(stream终止标志位为true){
return // 结束这个while循环退出方法
}
// Clock.waitTillTime SystemClock子类通过while + sleep(ms)实现,其余子类通过while + wait(ms) 来实现,使用while是为了防止外部中断导致wait时间不够
} else {
if(stream终止标志位为true){
return // 结束这个while循环退出方法
}
}
}
}
即stream没有停止情况下,下一个batch的提交时间为 = 当前batch使用时间 > interval ? 当前batch结束时间:本批次开始时间 / interval * interval + interval
ContinuousExecution的runActivatedStream的实现
源码如下:
override protected def runActivatedStream(sparkSessionForStream: SparkSession): Unit = {
val stateUpdate = new UnaryOperator[State] {
override def apply(s: State) = s match {
// If we ended the query to reconfigure, reset the state to active.
case RECONFIGURING => ACTIVE
case _ => s
}
}
do {
runContinuous(sparkSessionForStream)
} while (state.updateAndGet(stateUpdate) == ACTIVE)
}
其中,runContinuous 源码如下:
/**
* Do a continuous run.
* @param sparkSessionForQuery Isolated [[SparkSession]] to run the continuous query with.
*/
private def runContinuous(sparkSessionForQuery: SparkSession): Unit = {
// A list of attributes that will need to be updated.
val replacements = new ArrayBuffer[(Attribute, Attribute)]
// Translate from continuous relation to the underlying data source.
var nextSourceId = 0
continuousSources = logicalPlan.collect {
case ContinuousExecutionRelation(dataSource, extraReaderOptions, output) =>
val metadataPath = s"$resolvedCheckpointRoot/sources/$nextSourceId"
nextSourceId += 1 dataSource.createContinuousReader(
java.util.Optional.empty[StructType](),
metadataPath,
new DataSourceOptions(extraReaderOptions.asJava))
}
uniqueSources = continuousSources.distinct val offsets = getStartOffsets(sparkSessionForQuery) var insertedSourceId = 0
val withNewSources = logicalPlan transform {
case ContinuousExecutionRelation(source, options, output) =>
val reader = continuousSources(insertedSourceId)
insertedSourceId += 1
val newOutput = reader.readSchema().toAttributes assert(output.size == newOutput.size,
s"Invalid reader: ${Utils.truncatedString(output, ",")} != " +
s"${Utils.truncatedString(newOutput, ",")}")
replacements ++= output.zip(newOutput) val loggedOffset = offsets.offsets(0)
val realOffset = loggedOffset.map(off => reader.deserializeOffset(off.json))
reader.setStartOffset(java.util.Optional.ofNullable(realOffset.orNull))
StreamingDataSourceV2Relation(newOutput, source, options, reader)
} // Rewire the plan to use the new attributes that were returned by the source.
val replacementMap = AttributeMap(replacements)
val triggerLogicalPlan = withNewSources transformAllExpressions {
case a: Attribute if replacementMap.contains(a) =>
replacementMap(a).withMetadata(a.metadata)
case (_: CurrentTimestamp | _: CurrentDate) =>
throw new IllegalStateException(
"CurrentTimestamp and CurrentDate not yet supported for continuous processing")
} val writer = sink.createStreamWriter(
s"$runId",
triggerLogicalPlan.schema,
outputMode,
new DataSourceOptions(extraOptions.asJava))
val withSink = WriteToContinuousDataSource(writer, triggerLogicalPlan) val reader = withSink.collect {
case StreamingDataSourceV2Relation(_, _, _, r: ContinuousReader) => r
}.head reportTimeTaken("queryPlanning") {
lastExecution = new IncrementalExecution(
sparkSessionForQuery,
withSink,
outputMode,
checkpointFile("state"),
runId,
currentBatchId,
offsetSeqMetadata)
lastExecution.executedPlan // Force the lazy generation of execution plan
} sparkSessionForQuery.sparkContext.setLocalProperty(
StreamExecution.IS_CONTINUOUS_PROCESSING, true.toString)
sparkSessionForQuery.sparkContext.setLocalProperty(
ContinuousExecution.START_EPOCH_KEY, currentBatchId.toString)
// Add another random ID on top of the run ID, to distinguish epoch coordinators across
// reconfigurations.
val epochCoordinatorId = s"$runId--${UUID.randomUUID}"
currentEpochCoordinatorId = epochCoordinatorId
sparkSessionForQuery.sparkContext.setLocalProperty(
ContinuousExecution.EPOCH_COORDINATOR_ID_KEY, epochCoordinatorId)
sparkSessionForQuery.sparkContext.setLocalProperty(
ContinuousExecution.EPOCH_INTERVAL_KEY,
trigger.asInstanceOf[ContinuousTrigger].intervalMs.toString) // Use the parent Spark session for the endpoint since it's where this query ID is registered.
val epochEndpoint =
EpochCoordinatorRef.create(
writer, reader, this, epochCoordinatorId, currentBatchId, sparkSession, SparkEnv.get)
val epochUpdateThread = new Thread(new Runnable {
override def run: Unit = {
try {
triggerExecutor.execute(() => {
startTrigger() if (reader.needsReconfiguration() && state.compareAndSet(ACTIVE, RECONFIGURING)) {
if (queryExecutionThread.isAlive) {
queryExecutionThread.interrupt()
}
false
} else if (isActive) {
currentBatchId = epochEndpoint.askSync[Long](IncrementAndGetEpoch)
logInfo(s"New epoch $currentBatchId is starting.")
true
} else {
false
}
})
} catch {
case _: InterruptedException =>
// Cleanly stop the query.
return
}
}
}, s"epoch update thread for $prettyIdString") try {
epochUpdateThread.setDaemon(true)
epochUpdateThread.start() reportTimeTaken("runContinuous") {
SQLExecution.withNewExecutionId(sparkSessionForQuery, lastExecution) {
lastExecution.executedPlan.execute()
}
}
} catch {
case t: Throwable
if StreamExecution.isInterruptionException(t) && state.get() == RECONFIGURING =>
logInfo(s"Query $id ignoring exception from reconfiguring: $t")
// interrupted by reconfiguration - swallow exception so we can restart the query
} finally {
// The above execution may finish before getting interrupted, for example, a Spark job having
// 0 partitions will complete immediately. Then the interrupted status will sneak here.
//
// To handle this case, we do the two things here:
//
// 1. Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase
// the waiting time of `stop` but should be minor because the operations here are very fast
// (just sending an RPC message in the same process and stopping a very simple thread).
// 2. Clear the interrupted status at the end so that it won't impact the `runContinuous`
// call. We may clear the interrupted status set by `stop`, but it doesn't affect the query
// termination because `runActivatedStream` will check `state` and exit accordingly.
queryExecutionThread.runUninterruptibly {
try {
epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)
} finally {
SparkEnv.get.rpcEnv.stop(epochEndpoint)
epochUpdateThread.interrupt()
epochUpdateThread.join()
stopSources()
// The following line must be the last line because it may fail if SparkContext is stopped
sparkSession.sparkContext.cancelJobGroup(runId.toString)
}
}
Thread.interrupted()
}
}
// TODO 伪代码,后续整理
总结
总而言之,structured streaming设置 trigger后,不会造成后续任务的挤压,但会影响后续任务的提交时间
大batch任务对structured streaming任务影响的更多相关文章
- DataFlow编程模型与Spark Structured streaming
流式(streaming)和批量( batch):流式数据,实际上更准确的说法应该是unbounded data(processing),也就是无边界的连续的数据的处理:对应的批量计算,更准确的说法是 ...
- Spark之Structured Streaming
目录 Part V. Streaming Stream Processing Fundamentals Structured Streaming Basics Event-Time and State ...
- Spark Streaming揭秘 Day29 深入理解Spark2.x中的Structured Streaming
Spark Streaming揭秘 Day29 深入理解Spark2.x中的Structured Streaming 在Spark2.x中,Spark Streaming获得了比较全面的升级,称为St ...
- Structured Streaming教程(1) —— 基本概念与使用
近年来,大数据的计算引擎越来越受到关注,spark作为最受欢迎的大数据计算框架,也在不断的学习和完善中.在Spark2.x中,新开放了一个基于DataFrame的无下限的流式处理组件--Structu ...
- Structured Streaming编程向导
简介 Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark ...
- 学习Spark2.0中的Structured Streaming(一)
转载自:http://lxw1234.com/archives/2016/10/772.htm Spark2.0新增了Structured Streaming,它是基于SparkSQL构建的可扩展和容 ...
- Apache Spark 2.2.0 中文文档 - Structured Streaming 编程指南 | ApacheCN
Structured Streaming 编程指南 概述 快速示例 Programming Model (编程模型) 基本概念 处理 Event-time 和延迟数据 容错语义 API 使用 Data ...
- Spark Streaming vs. Structured Streaming
简介 Spark Streaming Spark Streaming是spark最初的流处理框架,使用了微批的形式来进行流处理. 提供了基于RDDs的Dstream API,每个时间间隔内的数据为一个 ...
- Structured Streaming编程 Programming Guide
Structured Streaming编程 Programming Guide Overview Quick Example Programming Model Basic Concepts Han ...
随机推荐
- 十九种Elasticsearch字符串搜索方式终极介绍
前言 刚开始接触Elasticsearch的时候被Elasticsearch的搜索功能搞得晕头转向,每次想在Kibana里面查询某个字段的时候,查出来的结果经常不是自己想要的,然而又不知道问题出在了哪 ...
- 【公益线上自习室】不连麦,无微信群,无qq群
马上就要到5月了,从刚开始的放纵已经逐渐变得慌乱. 疫情还没有完全过去,居家学习.工作是最好的选择. 但是,问题是,在家太舒服了,一点也不想学习. 一开始“哈哈哈哈哈哈哈哈”朋友了,现在已经开始“唉… ...
- Crowd 批量添加用户(Postman 数据驱动)
背景 最近公司大量新员工入职,需要批量创建 Crowd 用户.设置密码.分配应用组等机械性重复工作(主要还是懒~),故把这个加餐任务分配给刚来的测试同学去研究. 一是:让他了解下 Postman 的数 ...
- 批量重命名脚本(Python)
便携的批处理脚本,代码如下: import os import sys def rename(): path=input("请输入路径(例如D:/picture):") name= ...
- C#开发BIMFACE系列31 服务端API之模型对比2:获取模型对比状态
系列目录 [已更新最新开发文章,点击查看详细] 在上一篇<C#开发BIMFACE系列30 服务端API之模型对比1:发起模型对比>中发起了2个模型对比,由于模型对比是在BIMFAC ...
- selenium 获取页面<input>标签的个数和各个属性的值
获取页面某个标签的数量.id.name.class的值,来辅助定位 List<WebElement> lw =driver.findElements(By.tagName( ...
- Spring Cloud 系列之 Sleuth 链路追踪(三)
本篇文章为系列文章,未读前几集的同学请猛戳这里: Spring Cloud 系列之 Sleuth 链路追踪(一) Spring Cloud 系列之 Sleuth 链路追踪(二) 本篇文章讲解 Sleu ...
- foreach里的按引用传值问题
1.foreach($arr as $k=>&$v){ } 这样循环时候最后一个结果前边会有&,出现输出不了的情况,这时候只需要加一个unset($v),加在循环里和外均可. 2 ...
- 前端日期时间处理建议使用Momen
使用方法 下载: http://momentjs.cn/downloads/moment.js 多语言版本:http://momentjs.cn/downloads/moment-with-local ...
- Prometheus+Grafana 监控 Docker
cAdvisor (Container Advisor) :用于收集正在运行的容器资源使用和性能信息. https://github.com/google/cadvisor Prometheus(普罗 ...