sparkStreaming消费kafka-1.0.1方式：direct方式（存储offset到Hbase）

话不多说，可以看上篇博文，关于offset存储到zookeeper

https://www.cnblogs.com/niutao/p/10547718.html

本篇博文主要告诉你如何将offset写到Hbase做存储：

最后存储到Hbase的展现形式：

testDirect:co:1552667595000  column=info:0, timestamp=1552667594784, value=66

 testDirect:co:1552667595000   column=info:1, timestamp=1552667594784, value=269

 testDirect:co:1552667595000   column=info:2, timestamp=1552667594784, value=67

 testDirect:co:1552667600000   column=info:0, timestamp=1552667599864, value=66

 testDirect:co:1552667600000   column=info:1, timestamp=1552667599864, value=269

 testDirect:co:1552667600000   column=info:2, timestamp=1552667599864, value=67

 testDirect:co:1552667605000   column=info:0, timestamp=1552667604778, value=66

 testDirect:co:1552667605000   column=info:1, timestamp=1552667604778, value=269

 testDirect:co:1552667605000   column=info:2, timestamp=1552667604778, value=67

 testDirect:co:1552667610000   column=info:0, timestamp=1552667609777, value=66

 testDirect:co:1552667610000   column=info:1, timestamp=1552667609777, value=269

版本：
scala:2.11.8
spark:2.11
hbase:1.2.0-cdh5.14.0

遇到的问题：

`java.lang.IllegalStateException: Consumer is not subscribed to any topics or assigned any partitions`

分析原因：

从指定的主题或者分区获取数据，在poll之前，你没有订阅任何主题或分区是不行的,每一次poll,消费者都会尝试使用最后一次消费的offset作为接下来获取数据的start offset，最后一次消费的offset也可以通过seek(TopicPartition, long)设置或者自动设置
通过源码可以找到：

public ConsumerRecords<K, V> poll(long timeout) {

    acquire();

    try {

        if (timeout < 0)

            throw new IllegalArgumentException("Timeout must not be negative");

        // 如果没有任何订阅，抛出异常

        if (this.subscriptions.hasNoSubscriptionOrUserAssignment())

            throw new IllegalStateException("Consumer is not subscribed to any topics or assigned any partitions");

        // 一直poll新数据直到超时

        long start = time.milliseconds();

        // 距离超时还剩余多少时间

        long remaining = timeout;

        do {

            // 获取数据，如果自动提交，则进行偏移量自动提交，如果设置offset重置，则进行offset重置

            Map<TopicPartition, List<ConsumerRecord<K, V>>> records = pollOnce(remaining);

            if (!records.isEmpty()) {

                // 再返回结果之前，我们可以进行下一轮的fetch请求，避免阻塞等待

                fetcher.sendFetches();

                client.pollNoWakeup();

                // 如果有拦截器进行拦截，没有直接返回

                if (this.interceptors == null)

                    return new ConsumerRecords<>(records);

                else

                    return this.interceptors.onConsume(new ConsumerRecords<>(records));

            }

            long elapsed = time.milliseconds() - start;

            remaining = timeout - elapsed;

        } while (remaining > 0);

        return ConsumerRecords.empty();

    } finally {

        release();

    }

}

解决：

因此，需要订阅当前的topic才能消费，我之前使用的api是：（适用于非新--已经被消费者消费过的）

因此，需要订阅当前的topic才能消费，我之前使用的api是：（适用于非新--已经被消费者消费过的）

`val inputDStream1 = KafkaUtils.createDirectStream[String, String](

  ssc,

  PreferConsistent,

  Assign[String, String](

  fromOffsets.keys,kafkaParams,fromOffsets)

)`

修改：（全新的topic，没有被消费者消费过）

`val inputDStream = KafkaUtils.createDirectStream[String, String](

  ssc,

  PreferConsistent,

  Subscribe[String, String](topics, kafkaParams)

)`

完整代码：

package offsetInHbase

import kafka.utils.ZkUtils

import org.apache.hadoop.hbase.filter.PrefixFilter

import org.apache.hadoop.hbase.util.Bytes

import org.apache.hadoop.hbase.{TableName, HBaseConfiguration}

import org.apache.hadoop.hbase.client.{Scan, Put, ConnectionFactory}

import org.apache.kafka.clients.consumer.ConsumerRecord

import org.apache.kafka.common.TopicPartition

import org.apache.kafka.common.serialization.StringDeserializer

import org.apache.spark.streaming.kafka010.ConsumerStrategies._

import org.apache.spark.streaming.kafka010.{OffsetRange, HasOffsetRanges, KafkaUtils}

import org.apache.spark.streaming.kafka010.LocationStrategies._

import org.apache.spark.streaming.{Seconds, StreamingContext}

import org.apache.spark.{SparkContext, SparkConf}

/**

  * Created by angel

  */

object KafkaOffsetsBlogStreamingDriver {

  def main(args: Array[String]) {

    if (args.length < 6) {

      System.err.println("Usage: KafkaDirectStreamTest " +

        "<batch-duration-in-seconds> " +

        "<kafka-bootstrap-servers> " +

        "<kafka-topics> " +

        "<kafka-consumer-group-id> " +

        "<hbase-table-name> " +

        "<kafka-zookeeper-quorum>")

      System.exit(1)

    }

    //5 cdh1:9092,cdh2:2181,cdh3:2181 testDirect co testDirect cdh1:2181,cdh2:2181,cdh3:2181

    val batchDuration = args(0)

    val bootstrapServers = args(1).toString

    val topicsSet = args(2).toString.split(",").toSet

    val consumerGroupID = args(3)

    val hbaseTableName = args(4)

    val zkQuorum = args(5)

    val zkKafkaRootDir = "kafka"

    val zkSessionTimeOut = 10000

    val zkConnectionTimeOut = 10000

    val sparkConf = new SparkConf().setAppName("Kafka-Offset-Management-Blog")

      .setMaster("local[4]")//Uncomment this line to test while developing on a workstation

    val sc = new SparkContext(sparkConf)

    val ssc = new StreamingContext(sc, Seconds(batchDuration.toLong))

    val topics = topicsSet.toArray

    val topic = topics(0)

    val kafkaParams = Map[String, Object](

      "bootstrap.servers" -> bootstrapServers,

      "key.deserializer" -> classOf[StringDeserializer],

      "value.deserializer" -> classOf[StringDeserializer],

      "group.id" -> consumerGroupID,

      "auto.offset.reset" -> "latest",

      "enable.auto.commit" -> (false: java.lang.Boolean)

    )

    /*

    Create a dummy process that simply returns the message as is.

     */

    def processMessage(message:ConsumerRecord[String,String]):ConsumerRecord[String,String]={

      message

    }

    /*

    Save Offsets into HBase

     */

    def saveOffsets(

                     TOPIC_NAME:String,

                     GROUP_ID:String,

                     offsetRanges:Array[OffsetRange],

                     hbaseTableName:String,

                     batchTime: org.apache.spark.streaming.Time

                   ) ={

      val hbaseConf = HBaseConfiguration.create()

      hbaseConf.addResource("src/main/resources/hbase-site.xml")

      val conn = ConnectionFactory.createConnection(hbaseConf)

      val table = conn.getTable(TableName.valueOf(hbaseTableName))

      val rowKey = TOPIC_NAME + ":" + GROUP_ID + ":" + String.valueOf(batchTime.milliseconds)

      val put = new Put(rowKey.getBytes)

      for(offset <- offsetRanges){

        put.addColumn(Bytes.toBytes("info"),Bytes.toBytes(offset.partition.toString),

          Bytes.toBytes(offset.untilOffset.toString))

      }

      table.put(put)

      conn.close()

    }

    /*

    Returns last committed offsets for all the partitions of a given topic from HBase in following cases.

      - CASE 1: SparkStreaming job is started for the first time. This function gets the number of topic partitions from

        Zookeeper and for each partition returns the last committed offset as 0

      - CASE 2: SparkStreaming is restarted and there are no changes to the number of partitions in a topic. Last

        committed offsets for each topic-partition is returned as is from HBase.

      - CASE 3: SparkStreaming is restarted and the number of partitions in a topic increased. For old partitions, last

        committed offsets for each topic-partition is returned as is from HBase as is. For newly added partitions,

        function returns last committed offsets as 0

     */

    def getLastCommittedOffsets(

                                 TOPIC_NAME:String,

                                 GROUP_ID:String,

                                 hbaseTableName:String,

                                 zkQuorum:String,

                                 zkRootDir:String,

                                 sessionTimeout:Int,

                                 connectionTimeOut:Int

                               ):Map[TopicPartition,Long] ={

      val hbaseConf = HBaseConfiguration.create()

      hbaseConf.addResource("src/main/resources/hbase-site.xml")

      val zkUrl = zkQuorum+"/"+zkRootDir

      val zkClientAndConnection = ZkUtils.createZkClientAndConnection(zkUrl,sessionTimeout,connectionTimeOut)

      val zkUtils = new ZkUtils(zkClientAndConnection._1, zkClientAndConnection._2,false)

      val zKNumberOfPartitionsForTopic = zkUtils.getPartitionsForTopics(Seq(TOPIC_NAME)).get(TOPIC_NAME).toList.head.size

      //Connect to HBase to retrieve last committed offsets

      val conn = ConnectionFactory.createConnection(hbaseConf)

      val table = conn.getTable(TableName.valueOf(hbaseTableName))

      val startRow = TOPIC_NAME + ":" + GROUP_ID + ":" + String.valueOf(System.currentTimeMillis())

      val stopRow = TOPIC_NAME + ":" + GROUP_ID + ":" + 0

      val scan = new Scan()

      val scanner = table.getScanner(scan.setStartRow(startRow.getBytes).setStopRow(stopRow.getBytes).setReversed(true))

      val result = scanner.next()

      //Set the number of partitions discovered for a topic in HBase to 0

      var hbaseNumberOfPartitionsForTopic = 0

      if (result != null){

        //If the result from hbase scanner is not null, set number of partitions from hbase to the number of cells

        //listCells 获取列族下的列

        hbaseNumberOfPartitionsForTopic = result.listCells().size()

      }

      val fromOffsets = collection.mutable.Map[TopicPartition,Long]()

      //初始化时候的hbase

      if(hbaseNumberOfPartitionsForTopic == 0){

        // initialize fromOffsets to beginning

        for (partition <- 0 to zKNumberOfPartitionsForTopic-1){

          fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)}

        //增加了topic的分区数

      } else if(zKNumberOfPartitionsForTopic > hbaseNumberOfPartitionsForTopic){

        // handle scenario where new partitions have been added to existing kafka topic

        for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1){

          val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes(partition.toString)))

          fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)}

        //将新增的分区也添加上

        for (partition <- hbaseNumberOfPartitionsForTopic to zKNumberOfPartitionsForTopic-1){

          fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> 0)}

      } else {

        //initialize fromOffsets from last run

        for (partition <- 0 to hbaseNumberOfPartitionsForTopic-1 ){

          val fromOffset = Bytes.toString(result.getValue(Bytes.toBytes("info"),Bytes.toBytes(partition.toString)))

          fromOffsets += (new TopicPartition(TOPIC_NAME,partition) -> fromOffset.toLong)}

      }

      scanner.close()

      conn.close()

      fromOffsets.toMap

    }

    val fromOffsets= getLastCommittedOffsets(

      topic,

      consumerGroupID,

      hbaseTableName,

      zkQuorum,

      zkKafkaRootDir,

      zkSessionTimeOut,

      zkConnectionTimeOut)

    //刚开始时候启动，全新的topic会报错

    val inputDStream = KafkaUtils.createDirectStream[String, String](

      ssc,

      PreferConsistent,

      Assign[String, String](

      fromOffsets.keys,kafkaParams,fromOffsets)

    )

    //如果报错，则使用下面的api

//    val inputDStream = KafkaUtils.createDirectStream[String, String](

//      ssc,

//      PreferConsistent,

//      Subscribe[String, String](topics, kafkaParams)

//    )

    /*

      For each RDD in a DStream apply a map transformation that processes the message.

    */

    inputDStream.foreachRDD((rdd,batchTime) => {

      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

      offsetRanges.foreach(offset => println(offset.topic, offset.partition, offset.fromOffset,offset.untilOffset))

      val newRDD = rdd.map(message => processMessage(message))

      newRDD.count()

      saveOffsets(topic,consumerGroupID,offsetRanges,hbaseTableName,batchTime) //save the offsets to HBase

    })

    println("Number of messages processed " + inputDStream.count())

    ssc.start()

    ssc.awaitTermination()

  }

}

sparkStreaming消费kafka-1.0.1方式：direct方式（存储offset到Hbase）的更多相关文章

SparkStreaming消费kafka中数据的方式
有两种:Direct直连方式.Receiver方式 1.Receiver方式: 使用kafka高层次的consumer API来实现,receiver从kafka中获取的数据都保存在spark exc ...
SparkStreaming消费Kafka，手动维护Offset到Mysql
目录说明整体逻辑 offset建表语句代码实现说明当前处理只实现手动维护offset到mysql,只能保证数据不丢失,可能会重复要想实现精准一次性,还需要将数据提交和offset提交维护在 ...
sparkstreaming消费kafka后bulk到es
不使用es-hadoop的saveToES,与scala版本冲突问题太多.不使用bulkprocessor,异步提交,es容易oom,速度反而不快.使用BulkRequestBuilder同步提交. ...
Spark Streaming消费Kafka Direct保存offset到Redis，实现数据零丢失和exactly once
一.概述上次写这篇文章文章的时候,Spark还是1.x,kafka还是0.8x版本,转眼间spark到了2.x,kafka也到了2.x,存储offset的方式也发生了改变,笔者根据上篇文章和网上文章 ...
sparkStreaming读取kafka的两种方式
概述 Spark Streaming 支持多种实时输入源数据的读取,其中包括Kafka.flume.socket流等等.除了Kafka以外的实时输入源,由于我们的业务场景没有涉及,在此将不会讨论.本篇 ...
spark-streaming集成Kafka处理实时数据
在这篇文章里,我们模拟了一个场景,实时分析订单数据,统计实时收益. 场景模拟我试图覆盖工程上最为常用的一个场景: 1)首先,向Kafka里实时的写入订单数据,JSON格式,包含订单ID-订单类型-订 ...
sparkStreaming消费kafka-1.0.1方式：direct方式（存储offset到zookeeper）-- 2
参考上篇博文:https://www.cnblogs.com/niutao/p/10547718.html 同样的逻辑,不同的封装 package offsetInZookeeper /** * Cr ...
sparkStreaming消费kafka-1.0.1方式：direct方式（存储offset到zookeeper）
版本声明: kafka:1.0.1 spark:2.1.0 注意:在使用过程中可能会出现servlet版本不兼容的问题,因此在导入maven的pom文件的时候,需要做适当的排除操作 <?xml ...
sparkStreaming消费kafka-0.8方式：direct方式（存储offset到zookeeper）
生产中,为了保证kafka的offset的安全性,并且防止丢失数据现象,会手动维护偏移量(offset) 版本:kafka:0.8 其中需要注意的点: 1:获取zookeeper记录的分区偏移量 2: ...

随机推荐

关于国产手机（含山寨机）的mrp格式文件使用
目前国内的大多数国产手机(山寨机)均支持MRP格式软件,本文将教你如何测试或安装!(MRP格式游戏,是由[杭州斯凯网络科技有限公司]开发的一种轻量级的虚拟平台MINIJ平台格式文件,用标准的ANSI ...
mysql定时任务,每天的零点执行一个存储过程
1 前言利用navicat工具来写存储过程及定时执行,此文章是按照自身经验总结的,仅作为记录使用. 2 步骤 2.1 新建过程 2.2 在函数体写你需要执行的代码 CREATE DEFINER=`r ...
1)requests模块
一:requests 介绍 requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装, 从而使得Pyt ...
Golang服务器热重启、热升级、热更新(safe and graceful hot-restart/reload http server)详解
服务端代码经常需要升级,对于线上系统的升级常用的做法是,通过前端的负载均衡(如nginx)来保证升级时至少有一个服务可用,依次(灰度)升级. 而另一种更方便的方法是在应用上做热重启,直接更新源码.配置 ...
mac系统安装 JDK 并配置环境
第一步 : 下载 mac 版的 JDK 下载地址: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-213 ...
Confluence 6 禁用或者重新启用一个任务
在默认的情况下,所有的 Confluence 计划任务都是默认启用的. 使用启用(Disable )/ 禁用(Enable )连接操作来启用和禁用每一个计划任务. 不是所有的加护任务都可以被禁用的. ...
Confluence 6 服务器的许可证信息
Confluence 6 服务器的许可证信息. https://www.cwiki.us/display/CONFLUENCEWIKI/Managing+your+Confluence+License
Cookie禁用了，Session还能用吗？原因详解
Cookie与 Session,一般认为是两个独立的东西,Session采用的是在服务器端保持状态的方案,而Cookie采用的是在客户端保持状态的方案.但为什么禁用Cookie就不能得到Session ...
vue 之webpack打包工具的使用
一.什么是webpack? webpack是一个模块打包工具.用vue项目来举例:浏览器它是只认识js,不认识vue的.而我们写的代码后缀大多是.vue的,在每个.vue文件中都可能html.js.c ...
vue-cli脚手架（框架）
一.创建vue项目 npm install vue-cli -g #-g全局 (sudo)npm install vue-cli -g #mac笔记本 vue-init webpack myvue # ...

sparkStreaming消费kafka-1.0.1方式：direct方式（存储offset到Hbase）

sparkStreaming消费kafka-1.0.1方式：direct方式（存储offset到Hbase）的更多相关文章

随机推荐

热门专题