spark-streaming-kafka包源码分析

转载请注明原创地址 http://www.cnblogs.com/dongxiao-yang/p/5443789.html

最近由于使用sparkstreaming的同学需要对接到部门内部的的kafka集群，由于官方的spark-streaming-kafka包和现有公司的kafka集群权限系统无法对接，需要研究下spark-streaming-kafka包原有代码以便改造，本文研究的代码版本为spark在github的tag的v1.6.1版本。

官方给出的JavaKafkaWordCount以及KafkaWordCount代码里产生kafka-streaming消费流数据的调用代码分别如下

 JavaPairReceiverInputDStream<String, String> messages =

            KafkaUtils.createStream(jssc, args[0], args[1], topicMap);

 val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)

可以看到无论是java还是scala调用的都是KafkaUtils内重载实现的createStream方法。

object KafkaUtils {

  /**

   * Create an input stream that pulls messages from Kafka Brokers.

   * @param ssc       StreamingContext object

   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)

   * @param groupId   The group id for this consumer

   * @param topics    Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                  in its own thread

   * @param storageLevel  Storage level to use for storing the received objects

   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream(

      ssc: StreamingContext,

      zkQuorum: String,

      groupId: String,

      topics: Map[String, Int],

      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2

    ): ReceiverInputDStream[(String, String)] = {

    val kafkaParams = Map[String, String](

      "zookeeper.connect" -> zkQuorum, "group.id" -> groupId,

      "zookeeper.connection.timeout.ms" -> "10000")

    createStream[String, String, StringDecoder, StringDecoder](

      ssc, kafkaParams, topics, storageLevel)

  }

  /**

   * Create an input stream that pulls messages from Kafka Brokers.

   * @param ssc         StreamingContext object

   * @param kafkaParams Map of kafka configuration parameters,

   *                    see http://kafka.apache.org/08/configuration.html

   * @param topics      Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                    in its own thread.

   * @param storageLevel Storage level to use for storing the received objects

   * @tparam K type of Kafka message key

   * @tparam V type of Kafka message value

   * @tparam U type of Kafka message key decoder

   * @tparam T type of Kafka message value decoder

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream[K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag](

      ssc: StreamingContext,

      kafkaParams: Map[String, String],

      topics: Map[String, Int],

      storageLevel: StorageLevel

    ): ReceiverInputDStream[(K, V)] = {

    val walEnabled = WriteAheadLogUtils.enableReceiverLog(ssc.conf)

    new KafkaInputDStream[K, V, U, T](ssc, kafkaParams, topics, walEnabled, storageLevel)

  }

  /**

   * Create an input stream that pulls messages from Kafka Brokers.

   * Storage level of the data will be the default StorageLevel.MEMORY_AND_DISK_SER_2.

   * @param jssc      JavaStreamingContext object

   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)

   * @param groupId   The group id for this consumer

   * @param topics    Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                  in its own thread

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream(

      jssc: JavaStreamingContext,

      zkQuorum: String,

      groupId: String,

      topics: JMap[String, JInt]

    ): JavaPairReceiverInputDStream[String, String] = {

    createStream(jssc.ssc, zkQuorum, groupId, Map(topics.asScala.mapValues(_.intValue()).toSeq: _*))

  }

  /**

   * Create an input stream that pulls messages from Kafka Brokers.

   * @param jssc      JavaStreamingContext object

   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..).

   * @param groupId   The group id for this consumer.

   * @param topics    Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                  in its own thread.

   * @param storageLevel RDD storage level.

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream(

      jssc: JavaStreamingContext,

      zkQuorum: String,

      groupId: String,

      topics: JMap[String, JInt],

      storageLevel: StorageLevel

    ): JavaPairReceiverInputDStream[String, String] = {

    createStream(jssc.ssc, zkQuorum, groupId, Map(topics.asScala.mapValues(_.intValue()).toSeq: _*),

      storageLevel)

  }

  /**

   * Create an input stream that pulls messages from Kafka Brokers.

   * @param jssc      JavaStreamingContext object

   * @param keyTypeClass Key type of DStream

   * @param valueTypeClass value type of Dstream

   * @param keyDecoderClass Type of kafka key decoder

   * @param valueDecoderClass Type of kafka value decoder

   * @param kafkaParams Map of kafka configuration parameters,

   *                    see http://kafka.apache.org/08/configuration.html

   * @param topics  Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                in its own thread

   * @param storageLevel RDD storage level.

   * @tparam K type of Kafka message key

   * @tparam V type of Kafka message value

   * @tparam U type of Kafka message key decoder

   * @tparam T type of Kafka message value decoder

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream[K, V, U <: Decoder[_], T <: Decoder[_]](

      jssc: JavaStreamingContext,

      keyTypeClass: Class[K],

      valueTypeClass: Class[V],

      keyDecoderClass: Class[U],

      valueDecoderClass: Class[T],

      kafkaParams: JMap[String, String],

      topics: JMap[String, JInt],

      storageLevel: StorageLevel

    ): JavaPairReceiverInputDStream[K, V] = {

    implicit val keyCmt: ClassTag[K] = ClassTag(keyTypeClass)

    implicit val valueCmt: ClassTag[V] = ClassTag(valueTypeClass)

    implicit val keyCmd: ClassTag[U] = ClassTag(keyDecoderClass)

    implicit val valueCmd: ClassTag[T] = ClassTag(valueDecoderClass)

    createStream[K, V, U, T](

      jssc.ssc,

      kafkaParams.asScala.toMap,

      Map(topics.asScala.mapValues(_.intValue()).toSeq: _*),

      storageLevel)

  }

其中java相关的第三个和第四个createStream调用了第一个createStream，而第一个createStream最后调用的是第二个createStream，所以所有的rdd数据流都是从下面这句代码产生的：

new KafkaInputDStream[K, V, U, T](ssc, kafkaParams, topics, walEnabled, storageLevel)

查看KafkaInputDStream类定义，发现获取receiver有两种类型：KafkaReceiver和ReliableKafkaReceiver。

  def getReceiver(): Receiver[(K, V)] = {

    if (!useReliableReceiver) {

      new KafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel)

    } else {

      new ReliableKafkaReceiver[K, V, U, T](kafkaParams, topics, storageLevel)

    }

  }

其中，KafkaReceiver实现比较简单，调用的是kafka的high level api产生数据流，产生的每个线程的数据流都被放到一个线程池由单独的线程来消费

val topicMessageStreams = consumerConnector.createMessageStreams(
  topics, keyDecoder, valueDecoder)

　ReliableKafkaReceiver是结合了spark的预写日志（Write Ahead Logs）功能，开启这个功能需要设置sparkconf属性 spark.streaming.receiver.writeAheadLog.enable为真（默认值是假）

这个receiver会把收到的kafka数据首先存储到日志上，然后才会向kafka提交offset，这样保证了在driver程序出现问题的时候不会丢失kafka数据。

参考文章 Spark Streaming容错的改进和零数据丢失

spark-streaming-kafka包源码分析的更多相关文章

spark的存储系统--BlockManager源码分析
spark的存储系统--BlockManager源码分析根据之前的一系列分析,我们对spark作业从创建到调度分发,到执行,最后结果回传driver的过程有了一个大概的了解.但是在分析源码的过程中也 ...
【Spark篇】---Spark中资源和任务调度源码分析与资源配置参数应用
一.前述 Spark中资源调度是一个非常核心的模块,尤其对于我们提交参数来说,需要具体到某些配置,所以提交配置的参数于源码一一对应,掌握此节对于Spark在任务执行过程中的资源分配会更上一层楼.由于源 ...
spark(1.1) mllib 源码分析(二)-相关系数
原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/4024733.html 在spark mllib 1.1版本中增加stat包,里面包含了一些统计相关的函数 ...
spark(1.1) mllib 源码分析(一)-卡方检验
原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/4019131.html 在spark mllib 1.1版本中增加stat包,里面包含了一些统计相关的函数 ...
Spark MLlib - Decision Tree源码分析
http://spark.apache.org/docs/latest/mllib-decision-tree.html 以决策树作为开始,因为简单,而且也比较容易用到,当前的boosting或ran ...
sklearn包源码分析（一）--neighbors
python如何查看内置函数的用法及其源码? 在anaconda的安装目录下,有一块会放着我们安装的所有包,在里面可以找到所有的包找到scikit learn包,进入这里面又有了多个子包,每个子包 ...
spark(1.1) mllib 源码分析(三)-朴素贝叶斯
原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/4042467.html 本文主要以mllib 1.1版本为基础,分析朴素贝叶斯的基本原理与源码一.基本原 ...
Spark 1.6.1 源码分析
由于gitbook网速不好,所以复制自https://zx150842.gitbooks.io/spark-1-6-1-source-code/content/,非原创,纯属搬运工,若作者要求,可删除 ...
spark(1.1) mllib 源码分析(三)-决策树
本文主要以mllib 1.1版本为基础,分析决策树的基本原理与源码一.基本原理二.源码分析 1.决策树构造指定决策树训练数据集与策略(Strategy)通过train函数就能得到决策树模型Dec ...

随机推荐

Function.prototype.apply()
文章地址:https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Function/apply ...
Winfroms---看看吧客官~
假如你的人生有理想,那么就一定要去追,不管你现在的理想在别人看来是多么的可笑 , 你也不用在乎 , 人生蹉跎几十年 ...
《wc》-linux命令五分钟系列之十七
本原创文章属于<Linux大棚>博客,博客地址为http://roclinux.cn.文章作者为rocrocket. 为了防止某些网站的恶性转载,特在每篇文章前加入此信息,还望读者体谅. ...
php完整验证码代码
<?php require_once 'string.func.php'; //通过GD库做验证码 /** *添加验证文字 * @param int $type * @param int $le ...
destoon代码从头到尾捋一遍
destoon® B2B网站管理系统(以下简称destoon)由西安嘉客信息科技有限责任公司独立研发并推出,对其拥有完全知识产权,中国国家版权局计算机软件著作权登记号:2009SR037570. 系统 ...
MySQL配置文件路径及‘The total number of locks exceeds the lock table size’问题
在删除mysql中的数据时,遇到报错: ERROR 1206 (HY000): The total number of locks exceeds the lock table size 查了查,发现 ...
__isset()检测类内部变量是否设置
__isset()--检测类内部私有变量是否存在当执行isset方法时自动执行 class Per{ private $name; private $age; function __construc ...
RHEL 7特性说明（六）：集群
来自:Linux中国 2014-07-16 00:00:00 ed Hat Enterprise Linux 7.0 是 Red Hat 的下一代操作系统完整套件,旨在用于关键任务企业级计算以及顶 ...
WPF 核心体系结构
WPF 体系结构主题提供 Windows Presentation Foundation (WPF) 类层次结构,涵盖了 WPF 的大部分主要子系统,并描述它们是如何交互的. System.Obje ...
我学C的那些年[ch02]:宏,结构体,typedef
c语言的编译过程: 预处理编译汇编链接而预处理中有三种情况: 文件包含( #include ) 条件编译(#if,#ifndef,#endif) 宏定义( #define ) 宏就是预处理中的 ...

spark-streaming-kafka包源码分析

spark-streaming-kafka包源码分析的更多相关文章

随机推荐

热门专题