Spark Streaming性能优化系列-怎样获得和持续使用足够的集群计算资源？

一：数据峰值的巨大影响

1. 数据确实不稳定，比如晚上的时候訪问流量特别大

2. 在处理的时候比如GC的时候耽误时间会产生delay延迟

二：Backpressure：数据的反压机制

基本思想：依据上一次计算的Job的一些信息评估来决定下一个Job数据接收的速度。

怎样限制Spark接收数据的速度？

Spark Streaming在接收数据的时候必须把当前的数据接收完毕才干接收下一条数据。

源代码解析

RateController：

1. RateController是监听器。继承自StreamingListener.

/**

 * A StreamingListener that receives batch completion updates, and maintains

 * an estimate of the speed at which this stream should ingest messages,

 * given an estimate computation from a `RateEstimator`

 */

private[streaming] abstract class RateController(val streamUID: Int, rateEstimator: RateEstimator)

    extends StreamingListener with Serializable {

问题来了。RateContoller什么时候被调用的呢？

BackPressure是依据上一次计算的Job信息来评估下一个Job数据接收的速度。

因此肯定是在JobScheduler中被调用的。

1. 在JobScheduler的start方法中rateController方法是从inputStream中获取的。

// attach rate controllers of input streams to receive batch completion updates

for {

  inputDStream <- ssc.graph.getInputStreams

  rateController <- inputDStream.rateController

} ssc.addStreamingListener(rateController)

2.  然后将此消息增加到listenerBus中。

/** Add a [[org.apache.spark.streaming.scheduler.StreamingListener]] object for

  * receiving system events related to streaming.

  */

def addStreamingListener(streamingListener: StreamingListener) {

  scheduler.listenerBus.addListener(streamingListener)

}

}

3. 在StreamingListenerBus源代码例如以下：

/** Asynchronously passes StreamingListenerEvents to registered StreamingListeners. */

private[spark] class StreamingListenerBus

  extends AsynchronousListenerBus[StreamingListener, StreamingListenerEvent]("StreamingListenerBus")

  with Logging {

  private val logDroppedEvent = new AtomicBoolean(false)

  override def onPostEvent(listener: StreamingListener, event: StreamingListenerEvent): Unit = {

    event match {

      case receiverStarted: StreamingListenerReceiverStarted =>

        listener.onReceiverStarted(receiverStarted)

      case receiverError: StreamingListenerReceiverError =>

        listener.onReceiverError(receiverError)

      case receiverStopped: StreamingListenerReceiverStopped =>

        listener.onReceiverStopped(receiverStopped)

      case batchSubmitted: StreamingListenerBatchSubmitted =>

        listener.onBatchSubmitted(batchSubmitted)

      case batchStarted: StreamingListenerBatchStarted =>

        listener.onBatchStarted(batchStarted)

      case batchCompleted: StreamingListenerBatchCompleted =>

        listener.onBatchCompleted(batchCompleted)

4.  在RateController就实现了onBatchCompleted

5. RateController中onBatchCompleted详细实现例如以下：

override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) {

  val elements = batchCompleted.batchInfo.streamIdToInputInfo

  for {

    processingEnd <- batchCompleted.batchInfo.processingEndTime

    workDelay <- batchCompleted.batchInfo.processingDelay

    waitDelay <- batchCompleted.batchInfo.schedulingDelay

    elems <- elements.get(streamUID).map(_.numRecords)

  } computeAndPublish(processingEnd, elems, workDelay, waitDelay)

}

6.  RateController中computeAndPulish源代码例如以下：

/**

 * Compute the new rate limit and publish it asynchronously.

 */

private def computeAndPublish(time: Long, elems: Long, workDelay: Long, waitDelay: Long): Unit =

  Future[Unit] {

//评估新的更加合适Rate速度。

val newRate = rateEstimator.compute(time, elems, workDelay, waitDelay)

    newRate.foreach { s =>

      rateLimit.set(s.toLong)

      publish(getLatestRate())

    }

  }

7.  当中publish实现是在ReceiverRateController中。

8. 将pulish消息给ReceiverTracker.

/**

 * A RateController that sends the new rate to receivers, via the receiver tracker.

 */

private[streaming] class ReceiverRateController(id: Int, estimator: RateEstimator)

    extends RateController(id, estimator) {

  override def publish(rate: Long): Unit =

//由于会有非常多RateController所以会有详细Id

    ssc.scheduler.receiverTracker.sendRateUpdate(id, rate)

}

9.  在ReceiverTracker中sendRateUpdate源代码例如以下：

此时的endpoint是ReceiverTrackerEndpoint.

/** Update a receiver's maximum ingestion rate */

def sendRateUpdate(streamUID: Int, newRate: Long): Unit = synchronized {

  if (isTrackerStarted) {

    endpoint.send(UpdateReceiverRateLimit(streamUID, newRate))

  }

}

10. 在ReceiverTrackerEndpoint的receive方法中就接收到了发来的消息。

case UpdateReceiverRateLimit(streamUID, newRate) =>

//依据receiverTrackingInfos获取info信息，然后依据endpoint获取通信句柄。

//此时endpoint是ReceiverSupervisor的endpoint通信实体。

  for (info <- receiverTrackingInfos.get(streamUID); eP <- info.endpoint) {

    eP.send(UpdateRateLimit(newRate))

  }

11. 因此在ReceiverSupervisorImpl中接收到ReceiverTracker发来的消息。

/** RpcEndpointRef for receiving messages from the ReceiverTracker in the driver */

private val endpoint = env.rpcEnv.setupEndpoint(

  "Receiver-" + streamId + "-" + System.currentTimeMillis(), new ThreadSafeRpcEndpoint {

    override val rpcEnv: RpcEnv = env.rpcEnv

    override def receive: PartialFunction[Any, Unit] = {

      case StopReceiver =>

        logInfo("Received stop signal")

        ReceiverSupervisorImpl.this.stop("Stopped by driver", None)

      case CleanupOldBlocks(threshTime) =>

        logDebug("Received delete old batch signal")

        cleanupOldBlocks(threshTime)

      case UpdateRateLimit(eps) =>

        logInfo(s"Received a new rate limit: $eps.")

        registeredBlockGenerators.foreach { bg =>

          bg.updateRate(eps)

        }

    }

  })

12. RateLimiter中updateRate源代码例如以下：

/**

 * Set the rate limit to `newRate`. The new rate will not exceed the maximum rate configured by

//这里有最大限制，由于你的集群处理规模是有限的。

//Spark Streaming可能执行在YARN之上。由于多个计算框架都在执行的话。资源就//更有限了。

 * {{{spark.streaming.receiver.maxRate}}}, even if `newRate` is higher than that.

 *

 * @param newRate A new rate in events per second. It has no effect if it's 0 or negative.

 */

private[receiver] def updateRate(newRate: Long): Unit =

  if (newRate > 0) {

    if (maxRateLimit > 0) {

      rateLimiter.setRate(newRate.min(maxRateLimit))

    } else {

      rateLimiter.setRate(newRate)

    }

  }

整体流程图例如以下：

总结:

每次上一个Batch Duration的Job执行完毕之后。都会返回JobCompleted等信息，基于这些信息产生一个新的Rate，然后将新的Rate通过远程通信交给了Executor中，而Executor也会依据Rate又一次设置Rate大小。

Spark Streaming性能优化系列-怎样获得和持续使用足够的集群计算资源？的更多相关文章

Spark Streaming性能优化: 如何在生产环境下应对流数据峰值巨变
1.为什么引入Backpressure 默认情况下,Spark Streaming通过Receiver以生产者生产数据的速率接收数据,计算过程中会出现batch processing time > ...
Spark Streaming性能调优
数据接收并行度调优(一) 通过网络接收数据时(比如Kafka.Flume),会将数据反序列化,并存储在Spark的内存中.如果数据接收称为系统的瓶颈,那么可以考虑并行化数据接收.每一个输入DStrea ...
SparkSQL的一些用法建议和Spark的性能优化
1.写在前面 Spark是专为大规模数据处理而设计的快速通用的计算引擎,在计算能力上优于MapReduce,被誉为第二代大数据计算框架引擎.Spark采用的是内存计算方式.Spark的四大核心是Spa ...
[MySQL性能优化系列]提高缓存命中率
1. 背景通常情况下,能用一条sql语句完成的查询,我们尽量不用多次查询完成.因为,查询次数越多,通信开销越大.但是,分多次查询,有可能提高缓存命中率.到底使用一个复合查询还是多个独立查询,需要根据 ...
[MySQL性能优化系列]巧用索引
1. 普通青年的索引使用方式假设我们有一个用户表 tb_user,内容如下: name age sex jack 22 男 rose 21 女 tom 20 男 ... ... ... 执行SQL语 ...
[MySQL性能优化系列]LIMIT语句优化
1. 背景假设有如下SQL语句: SELECT * FROM table1 LIMIT offset, rows 这是一条典型的LIMIT语句,常见的使用场景是,某些查询返回的内容特别多,而客户端处 ...
PLSQL_性能优化系列14_Oracle High Water Level高水位分析
2014-10-04 Created By BaoXinjian 一.摘要 PLSQL_性能优化系列14_Oracle High Water Level高水位分析高水位线好比水库中储水的水位线,用于 ...
[Android 性能优化系列]降低你的界面布局层次结构的一部分
大家假设喜欢我的博客,请关注一下我的微博,请点击这里(http://weibo.com/kifile),谢谢转载请标明出处(http://blog.csdn.net/kifile),再次感谢原文地 ...
Spark Streaming性能调优详解
Spark Streaming性能调优详解 Spark 2015-04-28 7:43:05 7896℃ 0评论分享到微博下载为PDF 2014 Spark亚太峰会会议资料下载.< ...

随机推荐

python基础学习笔记——模块
自定义模块我们今天来学习一下自定义模块(也就是私人订制),我们要自定义模块,首先就要知道什么是模块啊一个函数封装一个功能,比如现在有一个软件,不可能将所有程序都写入一个文件,所以咱们应该分文件,组 ...
vue 的 scroller 使用
一安装使用npm 安装npm install vue-scroller -d 二引入 import VueScroller from "vue-scroller" Vue.u ...
关于Linux下安装Oracle时报错：out of memory的问题分析说明
一.说明在Oracle安装过程中,可能遇到out of memory这种错误,这是由于系统内存不足导致!我们可以通过加内存的方式解决! 而如果是另一种情况呢: 例如我在主机上装了两个Oracle服务 ...
本机机器ssh docker容器
https://blog.csdn.net/u010324465/article/details/77184506 1.在docker中安装openssh-server 2.sudo /etc/ini ...
wp8.1 sdk preview 预览版
http://pan.baidu.com/s/1hqyusja?qq-pf-to=pcqq.c2c#dir/path=%2FWPSDK%208.1%20DevPreview%20Installerwp ...
Linux 指令的快捷键
[转载] Laya性能优化精选内容整理
第一是性能统计工具,这是LayaAir引擎内置的性能统计工具,在代码加入Laya.Stat.show(); 引擎内置的性能统计工具打开这个工具后,可以用于观察性能,除了FPS越高越好外,其它的值越低 ...
【UML】9种图+包图
导读:在UML的学习中,介绍了9种图,外加一个包图.这9种图和4大关系,可以说是UML的一个核心内容.我根据自己的笔记,以及查阅的一些资料,对这9种图和包图,做一个总结. 一.基本定义 1.1 总体 ...
SQL2012 分页（最新）
--提取分页数据,返回总记录数 ALTER procedure [dbo].[sp_Common_GetDataPaging_ReturnDataCount] ( @SqlString varchar ...
简单介绍一下solr?
简单介绍一下solr? Solr是一个独立的企业级搜索应用服务器,它对外提供类似于web-service的API接口.用户可以通过http请求,向搜索引擎服务器提交一定格式的XML文件,生成索引:也可 ...

Spark Streaming性能优化系列-怎样获得和持续使用足够的集群计算资源？

Spark Streaming性能优化系列-怎样获得和持续使用足够的集群计算资源？的更多相关文章

随机推荐

热门专题