话不多说,可以看上篇博文,关于offset存储到zookeeper

https://www.cnblogs.com/niutao/p/10547718.html

本篇博文主要告诉你如何将offset写到Hbase做存储:

最后存储到Hbase的展现形式:

testDirect:co:1552667595000  column=info:0, timestamp=1552667594784, value=66
testDirect:co:1552667595000 column=info:1, timestamp=1552667594784, value=269
testDirect:co:1552667595000 column=info:2, timestamp=1552667594784, value=67
testDirect:co:1552667600000 column=info:0, timestamp=1552667599864, value=66
testDirect:co:1552667600000 column=info:1, timestamp=1552667599864, value=269
testDirect:co:1552667600000 column=info:2, timestamp=1552667599864, value=67
testDirect:co:1552667605000 column=info:0, timestamp=1552667604778, value=66
testDirect:co:1552667605000 column=info:1, timestamp=1552667604778, value=269
testDirect:co:1552667605000 column=info:2, timestamp=1552667604778, value=67
testDirect:co:1552667610000 column=info:0, timestamp=1552667609777, value=66
testDirect:co:1552667610000 column=info:1, timestamp=1552667609777, value=269
版本:
scala:2.11.8
spark:2.11
hbase:1.2.0-cdh5.14.0
遇到的问题:

`java.lang.IllegalStateException: Consumer is not subscribed to any topics or assigned any partitions`
分析原因:

从指定的主题或者分区获取数据,在poll之前,你没有订阅任何主题或分区是不行的,每一次poll,消费者都会尝试使用最后一次消费的offset作为接下来获取数据的start offset,最后一次消费的offset也可以通过seek(TopicPartition, long)设置或者自动设置
通过源码可以找到:
public ConsumerRecords<K, V> poll(long timeout) {
acquire();
try {
if (timeout < 0)
throw new IllegalArgumentException("Timeout must not be negative");
// 如果没有任何订阅,抛出异常
if (this.subscriptions.hasNoSubscriptionOrUserAssignment())
throw new IllegalStateException("Consumer is not subscribed to any topics or assigned any partitions"); // 一直poll新数据直到超时
long start = time.milliseconds();
// 距离超时还剩余多少时间
long remaining = timeout;
do {
// 获取数据,如果自动提交,则进行偏移量自动提交,如果设置offset重置,则进行offset重置
Map<TopicPartition, List<ConsumerRecord<K, V>>> records = pollOnce(remaining);
if (!records.isEmpty()) {
// 再返回结果之前,我们可以进行下一轮的fetch请求,避免阻塞等待
fetcher.sendFetches();
client.pollNoWakeup();
// 如果有拦截器进行拦截,没有直接返回
if (this.interceptors == null)
return new ConsumerRecords<>(records);
else
return this.interceptors.onConsume(new ConsumerRecords<>(records));
} long elapsed = time.milliseconds() - start;
remaining = timeout - elapsed;
} while (remaining > 0); return ConsumerRecords.empty();
} finally {
release();
}
}

解决:

因此,需要订阅当前的topic才能消费,我之前使用的api是:(适用于非新--已经被消费者消费过的)
因此,需要订阅当前的topic才能消费,我之前使用的api是:(适用于非新--已经被消费者消费过的)
`val inputDStream1 = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Assign[String, String](
fromOffsets.keys,kafkaParams,fromOffsets)
)` 修改:(全新的topic,没有被消费者消费过)
`val inputDStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)`

完整代码:

package offsetInHbase
import kafka.utils.ZkUtils
import org.apache.hadoop.hbase.filter.PrefixFilter
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{TableName, HBaseConfiguration}
import org.apache.hadoop.hbase.client.{Scan, Put, ConnectionFactory}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.ConsumerStrategies._
import org.apache.spark.streaming.kafka010.{OffsetRange, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.kafka010.LocationStrategies._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by angel
*/
object KafkaOffsetsBlogStreamingDriver { def main(args: Array[String]) { if (args.length < 6) {
System.err.println("Usage: KafkaDirectStreamTest " +
"<batch-duration-in-seconds> " +
"<kafka-bootstrap-servers> " +
"<kafka-topics> " +
"<kafka-consumer-group-id> " +
"<hbase-table-name> " +
"<kafka-zookeeper-quorum>")
System.exit(1)
}
//5 cdh1:9092,cdh2:2181,cdh3:2181 testDirect co testDirect cdh1:2181,cdh2:2181,cdh3:2181 val batchDuration = args(0)
val bootstrapServers = args(1).toString
val topicsSet = args(2).toString.split(",").toSet
val consumerGroupID = args(3)
val hbaseTableName = args(4)
val zkQuorum = args(5)
val zkKafkaRootDir = "kafka"
val zkSessionTimeOut = 10000
val zkConnectionTimeOut = 10000 val sparkConf = new SparkConf().setAppName("Kafka-Offset-Management-Blog")
.setMaster("local[4]")//Uncomment this line to test while developing on a workstation
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(batchDuration.toLong))
val topics = topicsSet.toArray
val topic = topics(0) val kafkaParams = Map[String, Object](
"bootstrap.servers" -> bootstrapServers,
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> consumerGroupID,
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
) /*
Create a dummy process that simply returns the message as is.
*/
def processMessage(message:ConsumerRecord[String,String]):ConsumerRecord[String,String]={
message
} /*
Save Offsets into HBase
*/
def saveOffsets(
TOPIC_NAME:String,
GROUP_ID:String,
offsetRanges:Array[OffsetRange],
hbaseTableName:String,
batchTime: org.apache.spark.streaming.Time
) ={
val hbaseConf = HBaseConfiguration.create()
hbaseConf.addResource("src/main/resources/hbase-site.xml")
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTableName))
val rowKey = TOPIC_NAME + ":" + GROUP_ID + ":" + String.valueOf(batchTime.milliseconds)
val put = new Put(rowKey.getBytes)
for(offset <- offsetRanges){
put.addColumn(Bytes.toBytes("info"),Bytes.toBytes(offset.partition.toString),
Bytes.toBytes(offset.untilOffset.toString))
}
table.put(put)
conn.close()
} /*
Returns last committed offsets for all the partitions of a given topic from HBase in following cases.
- CASE 1: SparkStreaming job is started for the first time. This function gets the number of topic partitions from
Zookeeper and for each partition returns the last committed offset as 0
- CASE 2: SparkStreaming is restarted and there are no changes to the number of partitions in a topic. Last
committed offsets for each topic-partition is returned as is from HBase.
- CASE 3: SparkStreaming is restarted and the number of partitions in a topic increased. For old partitions, last
committed offsets for each topic-partition is returned as is from HBase as is. For newly added partitions,
function returns last committed offsets as 0
*/
def getLastCommittedOffsets(
TOPIC_NAME:String,
GROUP_ID:String,
hbaseTableName:String,
zkQuorum:String,
zkRootDir:String,
sessionTimeout:Int,
connectionTimeOut:Int
):Map[TopicPartition,Long] ={ val hbaseConf = HBaseConfiguration.create()
hbaseConf.addResource("src/main/resources/hbase-site.xml")
val zkUrl = zkQuorum+"/"+zkRootDir
val zkClientAndConnection = ZkUtils.createZkClientAndConnection(zkUrl,sessionTimeout,connectionTimeOut)
val zkUtils = new ZkUtils(zkClientAndConnection._1, zkClientAndConnection._2,false)
val zKNumberOfPartitionsForTopic = zkUtils.getPartitionsForTopics(Seq(TOPIC_NAME)).get(TOPIC_NAME).toList.head.size //Connect to HBase to retrieve last committed offsets
val conn = ConnectionFactory.createConnection(hbaseConf)
val table = conn.getTable(TableName.valueOf(hbaseTableName))
val startRow = TOPIC_NAME + ":" + GROUP_ID + ":" + String.valueOf(System.currentTimeMillis())
val stopRow = TOPIC_NAME + ":" + GROUP_ID + ":" + 0
val scan = new Scan()
val scanner = table.getScanner(scan.setStartRow(startRow.getBytes).setStopRow(stopRow.getBytes).setReversed(true))
val result = scanner.next()
//Set the number of partitions discovered for a topic in HBase to 0
var hbaseNumberOfPartitionsForTopic = 0
if (result != null){
//If the result from hbase scanner is not null, set number of partitions from hbase to the number of cells
//listCells 获取列族下的列
hbaseNumberOfPartitionsForTopic = result.listCells().size()
} val fromOffsets = collection.mutable.Map[TopicPartition,Long]()
//初始化时候的hbase
if(hbaseNumberOfPartitionsForTopic == 0){
// initialize fromOffsets to beginning
for (partition <- 0 to zKNumberOfPartitionsForTopic-1){
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)}
//增加了topic的分区数
} else if(zKNumberOfPartitionsForTopic > hbaseNumberOfPartitionsForTopic){
// handle scenario where new partitions have been added to existing kafka topic
for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1){
val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes(partition.toString)))
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)}
//将新增的分区也添加上
for (partition <- hbaseNumberOfPartitionsForTopic to zKNumberOfPartitionsForTopic-1){
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)}
} else {
//initialize fromOffsets from last run
for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1 ){
val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes(partition.toString)))
fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)}
}
scanner.close()
conn.close()
fromOffsets.toMap
} val fromOffsets= getLastCommittedOffsets(
topic,
consumerGroupID,
hbaseTableName,
zkQuorum,
zkKafkaRootDir,
zkSessionTimeOut,
zkConnectionTimeOut)
//刚开始时候启动,全新的topic会报错
val inputDStream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Assign[String, String](
fromOffsets.keys,kafkaParams,fromOffsets)
)
//如果报错,则使用下面的api
// val inputDStream = KafkaUtils.createDirectStream[String, String](
// ssc,
// PreferConsistent,
// Subscribe[String, String](topics, kafkaParams)
// ) /*
For each RDD in a DStream apply a map transformation that processes the message.
*/
inputDStream.foreachRDD((rdd,batchTime) => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetRanges.foreach(offset => println(offset.topic, offset.partition, offset.fromOffset,offset.untilOffset))
val newRDD = rdd.map(message => processMessage(message))
newRDD.count()
saveOffsets(topic,consumerGroupID,offsetRanges,hbaseTableName,batchTime) //save the offsets to HBase
}) println("Number of messages processed " + inputDStream.count())
ssc.start()
ssc.awaitTermination()
}
}

sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到Hbase)的更多相关文章

  1. SparkStreaming消费kafka中数据的方式

    有两种:Direct直连方式.Receiver方式 1.Receiver方式: 使用kafka高层次的consumer API来实现,receiver从kafka中获取的数据都保存在spark exc ...

  2. SparkStreaming消费Kafka,手动维护Offset到Mysql

    目录 说明 整体逻辑 offset建表语句 代码实现 说明 当前处理只实现手动维护offset到mysql,只能保证数据不丢失,可能会重复 要想实现精准一次性,还需要将数据提交和offset提交维护在 ...

  3. sparkstreaming消费kafka后bulk到es

    不使用es-hadoop的saveToES,与scala版本冲突问题太多.不使用bulkprocessor,异步提交,es容易oom,速度反而不快.使用BulkRequestBuilder同步提交. ...

  4. Spark Streaming消费Kafka Direct保存offset到Redis,实现数据零丢失和exactly once

    一.概述 上次写这篇文章文章的时候,Spark还是1.x,kafka还是0.8x版本,转眼间spark到了2.x,kafka也到了2.x,存储offset的方式也发生了改变,笔者根据上篇文章和网上文章 ...

  5. sparkStreaming读取kafka的两种方式

    概述 Spark Streaming 支持多种实时输入源数据的读取,其中包括Kafka.flume.socket流等等.除了Kafka以外的实时输入源,由于我们的业务场景没有涉及,在此将不会讨论.本篇 ...

  6. spark-streaming集成Kafka处理实时数据

    在这篇文章里,我们模拟了一个场景,实时分析订单数据,统计实时收益. 场景模拟 我试图覆盖工程上最为常用的一个场景: 1)首先,向Kafka里实时的写入订单数据,JSON格式,包含订单ID-订单类型-订 ...

  7. sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到zookeeper)-- 2

    参考上篇博文:https://www.cnblogs.com/niutao/p/10547718.html 同样的逻辑,不同的封装 package offsetInZookeeper /** * Cr ...

  8. sparkStreaming消费kafka-1.0.1方式:direct方式(存储offset到zookeeper)

    版本声明: kafka:1.0.1 spark:2.1.0 注意:在使用过程中可能会出现servlet版本不兼容的问题,因此在导入maven的pom文件的时候,需要做适当的排除操作 <?xml ...

  9. sparkStreaming消费kafka-0.8方式:direct方式(存储offset到zookeeper)

    生产中,为了保证kafka的offset的安全性,并且防止丢失数据现象,会手动维护偏移量(offset) 版本:kafka:0.8 其中需要注意的点: 1:获取zookeeper记录的分区偏移量 2: ...

随机推荐

  1. 缓存系列之二:CDN与其他层面缓存

    缓存系列之二:CDN与其他层面缓存 一:内容分发网络(Content Delivery Network),通过将服务内容分发至全网加速节点,利用全球调度系统使用户能够就近获取,有效降低访问延迟,提升服 ...

  2. java使用RunTime调用windows命令行

    当Java需要调用windows系统进行交互时,可以使用Runtime进行操作. 例子: 1.调用window中获取关于java相关的进行信息 Runtime rt = Runtime.getRunt ...

  3. why should the parameter in copy construction be a reference

    if not, it will lead to an endless loop!!! # include<iostream> using namespace std; class A { ...

  4. WinCE平台的程序编译到Win32平台下运行

    最近做的项目中,有一个在WinCE平台上跑的程序,后来随着项目的发展,要求此程序在PC上也能跑.感谢VS 2005提供的多平台支持,只需要几分钟就可以解决这个问题,方法很简单,下面是我处理的过程. 1 ...

  5. Android apk动态加载机制

    参考链接:http://blog.csdn.net/singwhatiwanna/article/details/22597587

  6. Gradle缓存目录文件命名规则

    在打开Android Studio项目的时候,会下载项目对应版本的gradle,该版本是在项目根目录下\gradle\wrapper\gradle-wrapper.properties文件中指定的: ...

  7. pip的常用命令

    前言 pip作为Python的御用包管理工具有着强大的功能,但是许多命令需要我们使用的时候借助搜索引擎查找(尤其是我), 于是我想将我使用到的命令整合下来,以后不用麻烦去找了,也希望能给你带来帮助.文 ...

  8. ant 相关命令

    # jmeter-ant A Simple Ant project for JMeter Performance Test # Pre-Requisite* Java 1.7 or above* JM ...

  9. IOS 将状态栏改为白色

    1.将 View controller-based status bar appearance 删除(默认为 YES),或设置为YES  2.设置rootViewcontroller,如果为viewC ...

  10. ionic3 Injectable 引入NavController

    在service里 引入 navcontroller 报错 And I get error No provider for NavController. 一个比较容易解决的方法, import {Io ...