kafka 的 createDirectStream
kafka api中给出2类直接获取流的接口:createStream和createDirectStream。
createStream比较简单,只需topic、groupid、zookeeper就可以直接获取流,brokers和offset都是黑盒无需进行控制,但在项目中往往不受控。以下是部分源码:
/**
* Create an input stream that pulls messages from Kafka Brokers.
* @param ssc StreamingContext object
* @param zkQuorum Zookeeper quorum (hostname:port,hostname:port,..)
* @param groupId The group id for this consumer
* @param topics Map of (topic_name -> numPartitions) to consume. Each partition is consumed
* in its own thread
* @param storageLevel Storage level to use for storing the received objects
* (default: StorageLevel.MEMORY_AND_DISK_SER_2)
* @return DStream of (Kafka message key, Kafka message value)
*/
def createStream(
ssc: StreamingContext,
zkQuorum: String,
groupId: String,
topics: Map[String, Int],
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[(String, String)] = {
val kafkaParams = Map[String, String](
"zookeeper.connect" -> zkQuorum, "group.id" -> groupId,
"zookeeper.connection.timeout.ms" -> "10000")
createStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics, storageLevel)
}
KafkaUtils.createStream
createDirectStream直接去操作kafka,需要自己手动保存offset,方法的注释写的还是很明白的,以下是部分源码:
/**
* Create an input stream that directly pulls messages from Kafka Brokers
* without using any receiver. This stream can guarantee that each message
* from Kafka is included in transformations exactly once (see points below).
*
* Points to note:
* - No receivers: This stream does not use any receiver. It directly queries Kafka
* - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked
* by the stream itself. For interoperability with Kafka monitoring tools that depend on
* Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.
* You can access the offsets used in each batch from the generated RDDs (see
* [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).
* - Failure Recovery: To recover from driver failures, you have to enable checkpointing
* in the [[StreamingContext]]. The information on consumed offset can be
* recovered from the checkpoint. See the programming guide for details (constraints, etc.).
* - End-to-end semantics: This stream ensures that every records is effectively received and
* transformed exactly once, but gives no guarantees on whether the transformed data are
* outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure
* that the output operation is idempotent, or use transactions to output records atomically.
* See the programming guide for more details.
*
* @param ssc StreamingContext object
* @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">
* configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"
* to be set with Kafka broker(s) (NOT zookeeper servers) specified in
* host1:port1,host2:port2 form.
* @param fromOffsets Per-topic/partition Kafka offsets defining the (inclusive)
* starting point of the stream
* @param messageHandler Function for translating each message and metadata into the desired type
* @tparam K type of Kafka message key
* @tparam V type of Kafka message value
* @tparam KD type of Kafka message key decoder
* @tparam VD type of Kafka message value decoder
* @tparam R type returned by messageHandler
* @return DStream of R
*/
def createDirectStream[
K: ClassTag,
V: ClassTag,
KD <: Decoder[K]: ClassTag,
VD <: Decoder[V]: ClassTag,
R: ClassTag] (
ssc: StreamingContext,
kafkaParams: Map[String, String],
fromOffsets: Map[TopicAndPartition, Long],
messageHandler: MessageAndMetadata[K, V] => R
): InputDStream[R] = {
val cleanedHandler = ssc.sc.clean(messageHandler)
new DirectKafkaInputDStream[K, V, KD, VD, R](
ssc, kafkaParams, fromOffsets, cleanedHandler)
}
KafkaUtils.createDirectStream
和
/**
* Create an input stream that directly pulls messages from Kafka Brokers
* without using any receiver. This stream can guarantee that each message
* from Kafka is included in transformations exactly once (see points below).
*
* Points to note:
* - No receivers: This stream does not use any receiver. It directly queries Kafka
* - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked
* by the stream itself. For interoperability with Kafka monitoring tools that depend on
* Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.
* You can access the offsets used in each batch from the generated RDDs (see
* [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).
* - Failure Recovery: To recover from driver failures, you have to enable checkpointing
* in the [[StreamingContext]]. The information on consumed offset can be
* recovered from the checkpoint. See the programming guide for details (constraints, etc.).
* - End-to-end semantics: This stream ensures that every records is effectively received and
* transformed exactly once, but gives no guarantees on whether the transformed data are
* outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure
* that the output operation is idempotent, or use transactions to output records atomically.
* See the programming guide for more details.
*
* @param ssc StreamingContext object
* @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">
* configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"
* to be set with Kafka broker(s) (NOT zookeeper servers), specified in
* host1:port1,host2:port2 form.
* If not starting from a checkpoint, "auto.offset.reset" may be set to "largest" or "smallest"
* to determine where the stream starts (defaults to "largest")
* @param topics Names of the topics to consume
* @tparam K type of Kafka message key
* @tparam V type of Kafka message value
* @tparam KD type of Kafka message key decoder
* @tparam VD type of Kafka message value decoder
* @return DStream of (Kafka message key, Kafka message value)
*/
def createDirectStream[
K: ClassTag,
V: ClassTag,
KD <: Decoder[K]: ClassTag,
VD <: Decoder[V]: ClassTag] (
ssc: StreamingContext,
kafkaParams: Map[String, String],
topics: Set[String]
): InputDStream[(K, V)] = {
val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
val kc = new KafkaCluster(kafkaParams)
val fromOffsets = getFromOffsets(kc, kafkaParams, topics)
new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
ssc, kafkaParams, fromOffsets, messageHandler)
}
KafkaUtils.createDirectStream
项目中需要的是手动去控制这个偏移量,由此可以看到多了2个参数:fromOffsets: Map[TopicAndPartition, Long] 和 messageHandler: MessageAndMetadata[K, V] => R。
获取fromOffsets的思路应该就是:
1. 连接到zk
2. 获取topic和partitions
3. 遍历topic的partitions,读取每个partitions的offset(存在zk中的地址为:/consumers/[group id]/offsets/[topic]/[0 ... N])
4. 有可能读取的路径为空,那么得去取leader中的offset
因此,对应代码:(可以参考这些源码:kafka.utils.ZkUtils,org.apache.spark.streaming.kafka.KafkaUtils,kafka.tools.GetOffsetShell,及其对应的调用类)
private def getOffset = {
val fromOffset: mutable.Map[TopicAndPartition, Long] = mutable.Map() val (zkClient, zkConnection) = ZkUtils.createZkClientAndConnection(kafkaZkQuorum, kafkaZkSessionTimeout, kafkaZkSessionTimeout)
val zkUtil = new ZkUtils(zkClient, zkConnection, false) zkUtil.getPartitionsForTopics(kafkaTopic.split(",").toSeq)
.foreach({ topic2Partition =>
val topic = topic2Partition._1
val partitions = topic2Partition._2
val topicDirs = new ZKGroupTopicDirs(groupId, topic) partitions.foreach(partition => {
val zkPath = s"${topicDirs.consumerOffsetDir}/$partition"
zkUtil.makeSurePersistentPathExists(zkPath) val untilOffset = zkUtil.zkClient.readData[String](zkPath) val tp = TopicAndPartition(topic, partition)
val offset = {
if (null == untilOffset)
getLatestLeaderOffsets(tp, zkUtil)
else untilOffset.toLong
}
fromOffset += (tp -> offset)
}
)
})
zkUtil.close()
fromOffset.toMap
}
getOffset
获取messageHandler,就跟其第二个构造函数一样即可:
messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
messageHandler
接着就是getLatestLeaderOffsets:
private def getLatestLeaderOffsets(tp: TopicAndPartition, zkUtil: ZkUtils): Long = {
try {
val brokerId = zkUtil.getLeaderForPartition(tp.topic, tp.partition).get
val brokerInfoString = zkUtil.readDataMaybeNull(s"${ZkUtils.BrokerIdsPath}/$brokerId")._1.get
val brokerInfo = Json.parseFull(brokerInfoString).get.asInstanceOf[Map[String, Any]] val host = brokerInfo("host").asInstanceOf[String]
val port = brokerInfo("port").asInstanceOf[Int] val consumer = new SimpleConsumer(host, port, 10000, 100000, "getLatestLeaderOffsets")
val request = OffsetRequest(Map(tp -> PartitionOffsetRequestInfo(OffsetRequest.LatestTime, 1)))
val offsets = consumer.getOffsetsBefore(request).partitionErrorAndOffsets(tp).offsets offsets.head
} catch {
case _ => throw new Exception("获取最新offset异常:" + TopicAndPartition)
}
}
getLatestLeaderOffsets
最后就是调用的方式了:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc,
kafkaParams, getOffset, (mmd: MessageAndMetadata[String, String]) => (mmd.key, mmd.message))
KafkaUtils.createDirectStream
由于要从灾难中还原,做到7*24,需要设置checkpoint,业务逻辑需要包含在checkpoint的方法里,代码如下:
def main(args: Array[String]): Unit = { val run = gatewayIsEnable || urlAnalysIsEnable if (run) { val ssc = StreamingContext.getOrCreate(checkpointDir, createStreamingContext _) ssc.start()
ssc.awaitTermination()
}
} def createStreamingContext() = {
val duration = SysConfig.duration(2) val sparkConf = new SparkConf().setAppName("cmhi")
val ssc = new StreamingContext(sparkConf, Seconds(duration)) ssc.checkpoint(checkpointDir) Osgi.init(ssc, debug) ssc
}
main
kafka 的 createDirectStream的更多相关文章
- 【python】spark+kafka使用
网上用python写spark+kafka的资料好少啊 自己记录一点踩到的坑~ spark+kafka介绍的官方网址:http://spark.apache.org/docs/latest/strea ...
- 【Spark】SparkStreaming-Kafka-集成-终极参考资料
SparkStreaming-Kafka-集成-终极参考资料 Spark Streaming和Kafka整合开发指南(二) – 过往记忆 Streamingkafka零丢失 | 等英博客 spark- ...
- Spark createDirectStream 维护 Kafka offset(Scala)
createDirectStream方式需要自己维护offset,使程序可以实现中断后从中断处继续消费数据. KafkaManager.scala import kafka.common.TopicA ...
- pyspark kafka createDirectStream和createStream 区别
from pyspark.streaming.kafka import KafkaUtils kafkaStream = KafkaUtils.createStream(streamingContex ...
- spark读取kafka数据 createStream和createDirectStream的区别
1.KafkaUtils.createDstream 构造函数为KafkaUtils.createDstream(ssc, [zk], [consumer group id], [per-topic, ...
- Spark Streaming + Kafka 整合向导之createDirectStream
启动zk: zkServer.sh start 启动kafka:kafka-server-start.sh $KAFKA_HOME/config/server.properties 创建一个topic ...
- spark streaming 与 kafka 结合使用的一些概念理解
1. createStream会使用 Receiver:而createDirectStream不会,数据会通过driver接收. 2.createStream使用 Receiver 源源不断的接收数据 ...
- spark streaming kafka example
// scalastyle:off println package org.apache.spark.examples.streaming import kafka.serializer.String ...
- Spark Streaming消费Kafka Direct方式数据零丢失实现
使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...
随机推荐
- 自定义win8资源管理器左侧导航窗格的方法
Win8自定义资源管理器左侧导航窗格: 快捷键Win+R – 输入regedit: 删除“网络”项目 HKEY_CLASSES_ROOTCLSID{F02C1A0D-BE21-4350-88B0-73 ...
- 201521123064 《Java程序设计》第9周学习总结
1. 本章学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结异常相关内容. 2. 书面作业 本次作业题集异常 Q1:常用异常 题目5-1 1.1 截图你的提交结果(出现学号) 1.2 自己以前 ...
- 201521123103 《java学习笔记》 第十四周学习总结
一.本周学习总结 1.1 以你喜欢的方式(思维导图或其他)归纳总结多数据库相关内容. 二.书面作业 1. MySQL数据库基本操作 1.1建立数据库,将自己的姓名.学号作为一条记录插入.(截图,需出现 ...
- 控制结构(4) 局部化(localization)
// 上一篇:状态机(state machine) // 下一篇:必经之地(using) 基于语言提供的基本控制结构,更好地组织和表达程序,需要良好的控制结构. 前情回顾 上一次,我们说到状态机结构( ...
- 内置open()函数对外部文件的操作
>>> file=open('c://333.csv','r') 一些基本打开关闭操作 >>> s=file.read() >>> print s ...
- [转]Xcode的快捷键及代码格式化
Xcode比较常用的快捷键,特别是红色标注的,很常用.1. 文件CMD + N: 新文件CMD + SHIFT + N: 新项目CMD + O: 打开CMD + S: 保存CMD+OPt+S:保存所有 ...
- 【个人笔记】《知了堂》MySQL中的数据类型
MySQL中的数据类型 1.整型 MySQL数据类型 含义(有符号) tinyint(m) 1个字节 范围(-128~127) smallint(m) 2个字节 范围(-32768~32767) ...
- 深入理解计算机系统chapter5
编写高效的程序需要:1.选择合适的数据结构和算法 2.编译器能够有效优化以转换为高效可执行代码的源代码 3.利用并行性 优化编译器的局限性 程序示例: combine3的汇编代码: load-> ...
- 【转】独立游戏如何对接STEAM SDK
独立开发者在对接STEAM SDK之前 首先得先登上青睐之光,也就是我们俗称的"绿光" 一般要先对接G胖家的SDK,然后提交版本,最后等待审核... 我本身是unity 开发,对C ...
- 【JVM命令系列】jstat
命令基本概述 Jstat是JDK自带的一个轻量级小工具.全称"Java Virtual Machine statistics monitoring tool",它位于java的bi ...