15、Spark Streaming源码解读之No Receivers彻底思考
在前几期文章里讲了带Receiver的Spark Streaming 应用的相关源码解读,但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Approach)的方式,No Receiver的方式的优势:
1. 更强的控制自由度
2. 语义一致性
object DirectKafkaWordCount {def main(args: Array[String]) {if (args.length < 2) {System.err.println(s"""|Usage: DirectKafkaWordCount <brokers> <topics>| <brokers> is a list of one or more Kafka brokers| <topics> is a list of one or more kafka topics to consume from|""".stripMargin)System.exit(1)}StreamingExamples.setStreamingLogLevels()val Array(brokers, topics) = args// Create context with 2 second batch intervalval sparkConf = new SparkConf().setAppName("DirectKafkaWordCount")val ssc = new StreamingContext(sparkConf, Seconds(2))// Create direct kafka stream with brokers and topicsval topicsSet = topics.split(",").toSetval kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)// Get the lines, split them into words, count the words and printval lines = messages.map(_._2)val words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _)wordCounts.print()// Start the computationssc.start()ssc.awaitTermination()}}
/*** A batch-oriented interface for consuming from Kafka.* Starting and ending offsets are specified in advance,* so that you can control exactly-once semantics.* @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">* configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set* with Kafka broker(s) specified in host1:port1,host2:port2 form.* @param offsetRanges offset ranges that define the Kafka data belonging to this RDD* @param messageHandler function for translating each message into the desired type*/private[kafka]class KafkaRDD[K: ClassTag,V: ClassTag,U <: Decoder[_]: ClassTag,T <: Decoder[_]: ClassTag,R: ClassTag] private[spark] (sc: SparkContext,kafkaParams: Map[String, String],val offsetRanges: Array[OffsetRange], //该RDD的数据偏移量leaders: Map[TopicAndPartition, (String, Int)],messageHandler: MessageAndMetadata[K, V] => R) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges
trait HasOffsetRanges {def offsetRanges: Array[OffsetRange]}
inal class OffsetRange private(val topic: String,val partition: Int,val fromOffset: Long,val untilOffset: Long) extends Serializable
override def getPartitions: Array[Partition] = {offsetRanges.zipWithIndex.map { case (o, i) =>val (host, port) = leaders(TopicAndPartition(o.topic, o.partition))new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port)}.toArray}
private[kafka]class KafkaRDDPartition(val index: Int,val topic: String,val partition: Int,val fromOffset: Long,val untilOffset: Long,val host: String,val port: Int) extends Partition {/** Number of messages this partition refers to */def count(): Long = untilOffset - fromOffset}
KafkaRDDPartition清晰的描述了数据的具体位置,每个KafkaRDDPartition分区的数据交给KafkaRDD的compute方法计算:
override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {val part = thePart.asInstanceOf[KafkaRDDPartition]assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))if (part.fromOffset == part.untilOffset) {log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +s"skipping ${part.topic} ${part.partition}")Iterator.empty} else {new KafkaRDDIterator(part, context)}}
private class KafkaRDDIterator(part: KafkaRDDPartition,context: TaskContext) extends NextIterator[R] {context.addTaskCompletionListener{ context => closeIfNeeded() }log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +s"offsets ${part.fromOffset} -> ${part.untilOffset}")val kc = new KafkaCluster(kafkaParams)val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties]).newInstance(kc.config.props).asInstanceOf[Decoder[K]]val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties]).newInstance(kc.config.props).asInstanceOf[Decoder[V]]val consumer = connectLeadervar requestOffset = part.fromOffsetvar iter: Iterator[MessageAndOffset] = null- //..................
- }
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
def createDirectStream[K: ClassTag,V: ClassTag,KD <: Decoder[K]: ClassTag,VD <: Decoder[V]: ClassTag] (ssc: StreamingContext,kafkaParams: Map[String, String],topics: Set[String]): InputDStream[(K, V)] = {val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)- //创建KakfaCluster对象
val kc = new KafkaCluster(kafkaParams)- //更具kc的信息获取数据偏移量
val fromOffsets = getFromOffsets(kc, kafkaParams, topics)new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](ssc, kafkaParams, fromOffsets, messageHandler)}
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {//计算最近的数据终止偏移量val untilOffsets = clamp(latestLeaderOffsets(maxRetries))- //利用数据的偏移量创建KafkaRDD
val rdd = KafkaRDD[K, V, U, T, R](context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)// Report the record number and metadata of this batch interval to InputInfoTracker.val offsetRanges = currentOffsets.map { case (tp, fo) =>val uo = untilOffsets(tp)OffsetRange(tp.topic, tp.partition, fo, uo.offset)}val description = offsetRanges.filter { offsetRange =>// Don't display empty ranges.offsetRange.fromOffset != offsetRange.untilOffset}.map { offsetRange =>s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"}.mkString("\n")// Copy offsetRanges to immutable.List to prevent from being modified by the userval metadata = Map("offsets" -> offsetRanges.toList,StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)val inputInfo = StreamInputInfo(id, rdd.count, metadata)ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)Some(rdd)}
15、Spark Streaming源码解读之No Receivers彻底思考的更多相关文章
- Spark Streaming源码解读之No Receivers彻底思考
本期内容 : Direct Acess Kafka Spark Streaming接收数据现在支持的两种方式: 01. Receiver的方式来接收数据,及输入数据的控制 02. No Receive ...
- Spark Streaming源码解读之JobScheduler内幕实现和深度思考
本期内容 : JobScheduler内幕实现 JobScheduler深度思考 JobScheduler 是整个Spark Streaming调度的核心,需要设置多线程,一条用于接收数据不断的循环, ...
- Spark Streaming源码解读之流数据不断接收和全生命周期彻底研究和思考
本节的主要内容: 一.数据接受架构和设计模式 二.接受数据的源码解读 Spark Streaming不断持续的接收数据,具有Receiver的Spark 应用程序的考虑. Receiver和Drive ...
- Spark Streaming源码解读之流数据不断接收全生命周期彻底研究和思考
本期内容 : 数据接收架构设计模式 数据接收源码彻底研究 一.Spark Streaming数据接收设计模式 Spark Streaming接收数据也相似MVC架构: 1. Mode相当于Rece ...
- Spark Streaming源码解读之Receiver生成全生命周期彻底研究和思考
本期内容 : Receiver启动的方式设想 Receiver启动源码彻底分析 多个输入源输入启动,Receiver启动失败,只要我们的集群存在就希望Receiver启动成功,运行过程中基于每个Tea ...
- Spark Streaming源码解读之生成全生命周期彻底研究与思考
本期内容 : DStream与RDD关系彻底研究 Streaming中RDD的生成彻底研究 问题的提出 : 1. RDD是怎么生成的,依靠什么生成 2.执行时是否与Spark Core上的RDD执行有 ...
- Spark Streaming源码解读之Job动态生成和深度思考
本期内容 : Spark Streaming Job生成深度思考 Spark Streaming Job生成源码解析 Spark Core中的Job就是一个运行的作业,就是具体做的某一件事,这里的JO ...
- 16.Spark Streaming源码解读之数据清理机制解析
原创文章,转载请注明:转载自 听风居士博客(http://www.cnblogs.com/zhouyf/) 本期内容: 一.Spark Streaming 数据清理总览 二.Spark Streami ...
- 11.Spark Streaming源码解读之Driver中的ReceiverTracker架构设计以及具体实现彻底研究
上篇文章详细解析了Receiver不断接收数据的过程,在Receiver接收数据的过程中会将数据的元信息发送给ReceiverTracker: 本文将详细解析ReceiverTracker的的架构 ...
随机推荐
- Qt ------ 设置透明度
void setWindowOpacity(qreal level); //设置所有控件的不透明度 setAttribute(Qt::WA_TranslucentBackground); // ...
- 怎样安装Command Line Tools in OS x Mavericks&Yosemite(Without xcode)--转载
How to Install Command Line Tools in OS X Mavericks & Yosemite (Without Xcode) Mac users who pre ...
- JQ笔记-加强版
Query初级 一.介绍.基本写法 什么是JQ: 一个优秀的JS库,大型开发必备 JQ的好处: 简化JS的复杂操作 不再需要关心兼容性 提供大量实用方法 如何学习JQ: www.jquery. ...
- Codeforces 807 C. Success Rate
http://codeforces.com/problemset/problem/807/C C. Success Rate time limit per test 2 seconds memory ...
- spfa+剪枝 或者 dij+手写堆+剪枝 UOJ 111
http://uoj.ac/problem/111 好像NOIP里面的题目...有好多都是...能通过xjbg剪枝来...AC题目的? 得好好学一下这些剪枝黑科技了... 思路:我觉得这位大佬说的很完 ...
- .net core 中 identity server 4 之Topic --定义Client
客户端指能够从id4获取Token的角色. 客户端的共性: a unique client ID a secret if needed the allowed interactions with th ...
- 【BZOJ】1798: [Ahoi2009]Seq 维护序列seq 线段树多标记(区间加+区间乘)
[题意]给定序列,支持区间加和区间乘,查询区间和取模.n<=10^5. [算法]线段树 [题解]线段树多重标记要考虑标记与标记之间的相互影响. 对于sum*b+a,+c直接加上即可. *c后就是 ...
- JSP分页之结合Bootstrap分页插件进行简单分页
结合Bootstrap的分页插件实现分页,其中策略是每次显示5个按钮,然后根据当前页的不同来进行不同的显示: 1. 当前页<3,如果当前页大于5页就显示前五页,不然就显示1~totalPage. ...
- hdu 5328 Problem Killer(杭电多校赛第四场)
题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=5328 题目大意:找到连续的最长的等差数列or等比数列. 解题思路:1.等差等比的性质有很多.其中比较重 ...
- Django1.10中文文档—模型
模型是你的数据的唯一的.权威的信息源.它包含你所储存数据的必要字段和操作行为.通常,每个模型都对应着数据库中的唯一一张表. 基础认识: 每个model都是一个继承django.db.models. ...