kafka 的 createDirectStream

kafka api中给出2类直接获取流的接口：createStream和createDirectStream。

createStream比较简单，只需topic、groupid、zookeeper就可以直接获取流，brokers和offset都是黑盒无需进行控制，但在项目中往往不受控。以下是部分源码：

/**

   * Create an input stream that pulls messages from Kafka Brokers.

   * @param ssc       StreamingContext object

   * @param zkQuorum  Zookeeper quorum (hostname:port,hostname:port,..)

   * @param groupId   The group id for this consumer

   * @param topics    Map of (topic_name -> numPartitions) to consume. Each partition is consumed

   *                  in its own thread

   * @param storageLevel  Storage level to use for storing the received objects

   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createStream(

      ssc: StreamingContext,

      zkQuorum: String,

      groupId: String,

      topics: Map[String, Int],

      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2

    ): ReceiverInputDStream[(String, String)] = {

    val kafkaParams = Map[String, String](

      "zookeeper.connect" -> zkQuorum, "group.id" -> groupId,

      "zookeeper.connection.timeout.ms" -> "10000")

    createStream[String, String, StringDecoder, StringDecoder](

      ssc, kafkaParams, topics, storageLevel)

  }

KafkaUtils.createStream

createDirectStream直接去操作kafka，需要自己手动保存offset，方法的注释写的还是很明白的，以下是部分源码：

/**

   * Create an input stream that directly pulls messages from Kafka Brokers

   * without using any receiver. This stream can guarantee that each message

   * from Kafka is included in transformations exactly once (see points below).

   *

   * Points to note:

   *  - No receivers: This stream does not use any receiver. It directly queries Kafka

   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked

   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on

   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.

   *    You can access the offsets used in each batch from the generated RDDs (see

   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).

   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing

   *    in the [[StreamingContext]]. The information on consumed offset can be

   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).

   *  - End-to-end semantics: This stream ensures that every records is effectively received and

   *    transformed exactly once, but gives no guarantees on whether the transformed data are

   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure

   *    that the output operation is idempotent, or use transactions to output records atomically.

   *    See the programming guide for more details.

   *

   * @param ssc StreamingContext object

   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">

   *    configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"

   *    to be set with Kafka broker(s) (NOT zookeeper servers) specified in

   *    host1:port1,host2:port2 form.

   * @param fromOffsets Per-topic/partition Kafka offsets defining the (inclusive)

   *    starting point of the stream

   * @param messageHandler Function for translating each message and metadata into the desired type

   * @tparam K type of Kafka message key

   * @tparam V type of Kafka message value

   * @tparam KD type of Kafka message key decoder

   * @tparam VD type of Kafka message value decoder

   * @tparam R type returned by messageHandler

   * @return DStream of R

   */

  def createDirectStream[

    K: ClassTag,

    V: ClassTag,

    KD <: Decoder[K]: ClassTag,

    VD <: Decoder[V]: ClassTag,

    R: ClassTag] (

      ssc: StreamingContext,

      kafkaParams: Map[String, String],

      fromOffsets: Map[TopicAndPartition, Long],

      messageHandler: MessageAndMetadata[K, V] => R

  ): InputDStream[R] = {

    val cleanedHandler = ssc.sc.clean(messageHandler)

    new DirectKafkaInputDStream[K, V, KD, VD, R](

      ssc, kafkaParams, fromOffsets, cleanedHandler)

  }

KafkaUtils.createDirectStream

和

/**

   * Create an input stream that directly pulls messages from Kafka Brokers

   * without using any receiver. This stream can guarantee that each message

   * from Kafka is included in transformations exactly once (see points below).

   *

   * Points to note:

   *  - No receivers: This stream does not use any receiver. It directly queries Kafka

   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked

   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on

   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.

   *    You can access the offsets used in each batch from the generated RDDs (see

   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).

   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing

   *    in the [[StreamingContext]]. The information on consumed offset can be

   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).

   *  - End-to-end semantics: This stream ensures that every records is effectively received and

   *    transformed exactly once, but gives no guarantees on whether the transformed data are

   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure

   *    that the output operation is idempotent, or use transactions to output records atomically.

   *    See the programming guide for more details.

   *

   * @param ssc StreamingContext object

   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">

   *   configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"

   *   to be set with Kafka broker(s) (NOT zookeeper servers), specified in

   *   host1:port1,host2:port2 form.

   *   If not starting from a checkpoint, "auto.offset.reset" may be set to "largest" or "smallest"

   *   to determine where the stream starts (defaults to "largest")

   * @param topics Names of the topics to consume

   * @tparam K type of Kafka message key

   * @tparam V type of Kafka message value

   * @tparam KD type of Kafka message key decoder

   * @tparam VD type of Kafka message value decoder

   * @return DStream of (Kafka message key, Kafka message value)

   */

  def createDirectStream[

    K: ClassTag,

    V: ClassTag,

    KD <: Decoder[K]: ClassTag,

    VD <: Decoder[V]: ClassTag] (

      ssc: StreamingContext,

      kafkaParams: Map[String, String],

      topics: Set[String]

  ): InputDStream[(K, V)] = {

    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)

    val kc = new KafkaCluster(kafkaParams)

    val fromOffsets = getFromOffsets(kc, kafkaParams, topics)

    new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](

      ssc, kafkaParams, fromOffsets, messageHandler)

  }

KafkaUtils.createDirectStream

项目中需要的是手动去控制这个偏移量，由此可以看到多了2个参数：fromOffsets: Map[TopicAndPartition, Long] 和 messageHandler: MessageAndMetadata[K, V] => R。

获取fromOffsets的思路应该就是：

1. 连接到zk

2. 获取topic和partitions

3. 遍历topic的partitions，读取每个partitions的offset（存在zk中的地址为：/consumers/[group id]/offsets/[topic]/[0 ... N]）

4. 有可能读取的路径为空，那么得去取leader中的offset

因此，对应代码：（可以参考这些源码：kafka.utils.ZkUtils，org.apache.spark.streaming.kafka.KafkaUtils，kafka.tools.GetOffsetShell，及其对应的调用类）

private def getOffset = {

    val fromOffset: mutable.Map[TopicAndPartition, Long] = mutable.Map()

    val (zkClient, zkConnection) = ZkUtils.createZkClientAndConnection(kafkaZkQuorum, kafkaZkSessionTimeout, kafkaZkSessionTimeout)

    val zkUtil = new ZkUtils(zkClient, zkConnection, false)

    zkUtil.getPartitionsForTopics(kafkaTopic.split(",").toSeq)

      .foreach({ topic2Partition =>

        val topic = topic2Partition._1

        val partitions = topic2Partition._2

        val topicDirs = new ZKGroupTopicDirs(groupId, topic)

        partitions.foreach(partition => {

          val zkPath = s"${topicDirs.consumerOffsetDir}/$partition"

          zkUtil.makeSurePersistentPathExists(zkPath)

          val untilOffset = zkUtil.zkClient.readData[String](zkPath)

          val tp = TopicAndPartition(topic, partition)

          val offset = {

            if (null == untilOffset)

              getLatestLeaderOffsets(tp, zkUtil)

            else untilOffset.toLong

          }

          fromOffset += (tp -> offset)

        }

        )

      })

    zkUtil.close()

    fromOffset.toMap

  }

getOffset

获取messageHandler，就跟其第二个构造函数一样即可：

messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)

messageHandler

接着就是getLatestLeaderOffsets：

private def getLatestLeaderOffsets(tp: TopicAndPartition, zkUtil: ZkUtils): Long = {

    try {

      val brokerId = zkUtil.getLeaderForPartition(tp.topic, tp.partition).get

      val brokerInfoString = zkUtil.readDataMaybeNull(s"${ZkUtils.BrokerIdsPath}/$brokerId")._1.get

      val brokerInfo = Json.parseFull(brokerInfoString).get.asInstanceOf[Map[String, Any]]

      val host = brokerInfo("host").asInstanceOf[String]

      val port = brokerInfo("port").asInstanceOf[Int]

      val consumer = new SimpleConsumer(host, port, 10000, 100000, "getLatestLeaderOffsets")

      val request = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.LatestTime, 1)))

      val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(tp).offsets

      offsets.head

    } catch {

      case _ => throw new Exception("获取最新offset异常：" + TopicAndPartition)

    }

  }

getLatestLeaderOffsets

最后就是调用的方式了：

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc,

      kafkaParams, getOffset, (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message))

KafkaUtils.createDirectStream

由于要从灾难中还原，做到7*24，需要设置checkpoint，业务逻辑需要包含在checkpoint的方法里，代码如下：

def main(args: Array[String]): Unit = {

    val run = gatewayIsEnable || urlAnalysIsEnable

    if (run) {

      val ssc = StreamingContext.getOrCreate(checkpointDir, createStreamingContext _)

      ssc.start()

      ssc.awaitTermination()

    }

  }

  def createStreamingContext() = {

    val duration = SysConfig.duration(2)

    val sparkConf = new SparkConf().setAppName("cmhi")

    val ssc = new StreamingContext(sparkConf, Seconds(duration))

    ssc.checkpoint(checkpointDir)

    Osgi.init(ssc, debug)

    ssc

  }

main

kafka 的 createDirectStream的更多相关文章

【python】spark+kafka使用
网上用python写spark+kafka的资料好少啊自己记录一点踩到的坑~ spark+kafka介绍的官方网址:http://spark.apache.org/docs/latest/strea ...
【Spark】SparkStreaming-Kafka-集成-终极参考资料
SparkStreaming-Kafka-集成-终极参考资料 Spark Streaming和Kafka整合开发指南(二) – 过往记忆 Streamingkafka零丢失 | 等英博客 spark- ...
Spark createDirectStream 维护 Kafka offset（Scala）
createDirectStream方式需要自己维护offset,使程序可以实现中断后从中断处继续消费数据. KafkaManager.scala import kafka.common.TopicA ...
pyspark kafka createDirectStream和createStream 区别
from pyspark.streaming.kafka import KafkaUtils kafkaStream = KafkaUtils.createStream(streamingContex ...
spark读取kafka数据 createStream和createDirectStream的区别
1.KafkaUtils.createDstream 构造函数为KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic, ...
Spark Streaming + Kafka 整合向导之createDirectStream
启动zk: zkServer.sh start 启动kafka:kafka-server-start.sh $KAFKA_HOME/config/server.properties 创建一个topic ...
spark streaming 与 kafka 结合使用的一些概念理解
1. createStream会使用 Receiver:而createDirectStream不会,数据会通过driver接收. 2.createStream使用 Receiver 源源不断的接收数据 ...
spark streaming kafka example
// scalastyle:off println package org.apache.spark.examples.streaming import kafka.serializer.String ...
Spark Streaming消费Kafka Direct方式数据零丢失实现
使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...

随机推荐

201521123088《Java程序设计》第11周学习总结
1. 本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多线程相关内容. 2. 书面作业本次PTA作业题集多线程 1. 互斥访问与同步访问完成题集4-4(互斥访问)与4-5(同步访问) ...
201521145048 《Java程序设计》第7周学习总结
1. 本周学习总结 2. 书面作业 Q1.ArrayList代码分析 1.1 解释ArrayList的contains源代码 1.2 解释E remove(int index)源代码 1.3 结合1. ...
Java第十三周总结
1. 本周学习总结以你喜欢的方式(思维导图.OneNote或其他)归纳总结多网络相关内容. 2. 书面作业 1. 网络基础 1.1 比较ping www.baidu.com与ping cec.jmu ...
JDBC第四篇--【数据库连接池、DbUtils框架、分页】
1.数据库连接池什么是数据库连接池简单来说:数据库连接池就是提供连接的. 为什么我们要使用数据库连接池数据库的连接的建立和关闭是非常消耗资源的频繁地打开.关闭连接造成系统性能低下编写连接池 ...
json-java处理-jackson
使用jackson处理json数据 maven中的配置,这里没有写版本信息 <dependency> <groupId>org.codehaus.jackson</gro ...
对Java的初识
什么是计算机程序: 为了让计算机执行某些操作或解决某个问题而编写的一系列有序指令的集合.(简单来说就是记算机为完成某些功能生产的一系列有序指令集合); Java的来历: Java的初期开发早在 ...
【BBED】BBED模拟并修复ORA-08102错误
[BBED]BBED模拟并修复ORA-08102错误 1.1 BLOG文档结构图 1.2 前言部分 1.2.1 导读和注意事项各位技术爱好者,看完本文后,你可以掌握如下的技能,也可以学到一些其 ...
逆向实战干货,快速定位自动捡阳光Call,或者标志
逆向实战干货,快速定位自动捡阳光Call,或者标志注意: 关于CE和OD的使用,这里不再多说,快速定位,默认大家已经有了CE基础,或者OD基础. 第一种方法,找Call 第一步,打开CE,搜索阳光值 ...
GitHub使用（二） - 新建文件夹
1.首先打开我们已经建好的仓库 "test.github.com" 页面,可以看到如下图页面,找到“新建文件Create new file”按钮并点击.
IS 和AS
http://www.cnblogs.com/haiyang1985/archive/2009/03/12/1410023.html 1一. as 运算符用于在兼容的引用类型之间执行某些类型的转换. ...

kafka 的 createDirectStream

kafka 的 createDirectStream的更多相关文章

随机推荐

热门专题