Spark Streaming接收Kafka数据存储到Hbase

fly

spark

hbase

kafka

主要参考了这篇文章https://yq.aliyun.com/articles/60712（[点我]）(https://yq.aliyun.com/articles/60712), 不过这篇文章使用的spark貌似是spark1.x的。我这里主要是改为了spark2.x的方式

kafka生产数据

闲话少叙，直接上代码：

import java.util.{Properties, UUID}
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer
import scala.util.Random
object KafkaProducerTest {
def main(args: Array[String]): Unit = {
val rnd = new Random()
// val topics = "world"
val topics = "test"
val brokers = "localhost:9092"
val props = new Properties()
props.put("delete.topic.enable", "true")
props.put("key.serializer", classOf[StringSerializer])
// props.put("value.serializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.serializer", classOf[StringSerializer])
props.put("bootstrap.servers", brokers)
//props.put("batch.num.messages","10");//props.put("batch.num.messages","10");
//props.put("queue.buffering.max.messages", "20");
//linger.ms should be 0~100 ms
props.put("linger.ms", "50")
//props.put("block.on.buffer.full", "true");
//props.put("max.block.ms", "100000");
//batch.size and buffer.memory should be changed with "the kafka message size and message sending speed"
props.put("batch.size", "16384")
props.put("buffer.memory", "1638400")
props.put("queue.buffering.max.messages", "1000000")
props.put("queue.enqueue.timeout.ms", "20000000")
props.put("producer.type", "sync")
val producer = new KafkaProducer[String,String](props)
for(i <- 1001 to 2000){
val key = UUID.randomUUID().toString.substring(0,5)
val value = "fly_" + i + "_" + key
producer.send(new ProducerRecord[String, String](topics,key, value))//.get()
}
producer.flush()
}
}

生产的数据格式为（key,value) = （uuid, fly_i_key）的形式

spark streaming 读取kafka并保存到hbase

当kafka里面有数据后，使用spark streaming 读取，并存。直接上代码：

import java.util.UUID
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{Mutation, Put}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.mapreduce.OutputFormat
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* spark streaming 写入到hbase
* Sparkstreaming读取Kafka消息再结合SparkSQL，将结果保存到HBase
*/
object OBDSQL {
case class Person(name: String, age: Int, key: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("sparkSql")
.master("local[4]")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val topics = Array("test")
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
// "group.id" -> "use_a_separate_group_id_for_each_stream",
"group.id" -> "use_a_separate_group_id_for_each_stream_fly",
// "auto.offset.reset" -> "latest",
"auto.offset.reset" -> "earliest",
// "auto.offset.reset" -> "none",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val lines = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
// lines.map(record => (record.key, record.value)).print()
// lines.map(record => record.value.split("_")).map(x=> (x(0),x(1), x(2))).print()
lines.foreachRDD((rdd: RDD[ConsumerRecord[String, String]]) => {
import spark.implicits._
if (!rdd.isEmpty()) {
// temp table
rdd.map(_.value.split("_")).map(p => Person(p(0), p(1).trim.toInt, p(2))).toDF.createOrReplaceTempView("temp")
// use spark sql
val rs = spark.sql("select * from temp")
// create hbase conf
val hconf = HBaseConfiguration.create
hconf.set("hbase.zookeeper.quorum", "localhost"); //ZKFC
hconf.set("hbase.zookeeper.property.clientPort", "2181")
hconf.set("hbase.defaults.for.version.skip", "true")
hconf.set(TableOutputFormat.OUTPUT_TABLE, "t1") // t1是表名，表里面有一个列簇 cf1
hconf.setClass("mapreduce.job.outputformat.class", classOf[TableOutputFormat[String]], classOf[OutputFormat[String, Mutation]])
val jobConf = new JobConf(hconf)
// convert every line to hbase lines
rs.rdd.map(line => {
val put = new Put(Bytes.toBytes(UUID.randomUUID().toString.substring(0, 9)))
put.addColumn(Bytes.toBytes("cf1")
, Bytes.toBytes("name")
, Bytes.toBytes(line.get(0).toString)
)
put.addColumn(Bytes.toBytes("cf1")
, Bytes.toBytes("age")
, Bytes.toBytes(line.get(1).toString)
)
put.addColumn(Bytes.toBytes("cf1")
, Bytes.toBytes("key")
, Bytes.toBytes(line.get(2).toString)
)
(new ImmutableBytesWritable, put)
}).saveAsNewAPIHadoopDataset(jobConf)
}
})
lines.map(record => record.value.split("_")).map(x=> (x(0),x(1), x(2))).print()
ssc start()
ssc awaitTermination()
}
}

Spark Streaming接收Kafka数据存储到Hbase的更多相关文章

demo1 spark streaming 接收 kafka 数据java代码WordCount示例
1. 首先启动zookeeper windows上的安装见zk 02之 Windows安装和使用zookeeper 启动后见: 2. 启动kafka windows的安装kafka见Windows上搭 ...
spark streaming 接收 kafka 数据java代码WordCount示例
http://www.cnblogs.com/gaopeng527/p/4959633.html
Kafka：ZK+Kafka+Spark Streaming集群环境搭建（二十二）Spark Streaming接收流数据及使用窗口函数
官网文档:<http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example> Sp ...
spark streaming 接收kafka消息之五 -- spark streaming 和 kafka 的对接总结
Spark streaming 和kafka 处理确保消息不丢失的总结接入kafka 我们前面的1到4 都在说 spark streaming 接入 kafka 消息的事情.讲了两种接入方式,以及s ...
spark streaming 接收kafka消息之四 -- 运行在 worker 上的 receiver
使用分布式receiver来获取数据使用 WAL 来实现 exactly-once 操作: conf.set("spark.streaming.receiver.writeAheadLog. ...
spark streaming 接收kafka消息之一 -- 两种接收方式
源码分析的spark版本是1.6. 首先,先看一下 org.apache.spark.streaming.dstream.InputDStream 的类说明: This is the abstrac ...
spark streaming 接收kafka消息之二 -- 运行在driver端的receiver
先从源码来深入理解一下 DirectKafkaInputDStream 的将 kafka 作为输入流时,如何确保 exactly-once 语义. val stream: InputDStream[( ...
spark streaming 接收kafka消息之三 -- kafka broker 如何处理 fetch 请求
首先看一下 KafkaServer 这个类的声明: Represents the lifecycle of a single Kafka broker. Handles all functionali ...
Spark streaming消费Kafka的正确姿势
前言在游戏项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark streaming从kafka中不 ...

随机推荐

keystone身份认证服务
Keystone介绍 keystone 是OpenStack的组件之一,用于为OpenStack家族中的其它组件成员提供统一的认证服务,包括身份验证.令牌的发放和校验.服务列表.用户权限的定义等等.云 ...
composer 下载安装
linux/mac os curl -sS https://getcomposer.org/installer | php mv composer.phar /usr/local/bin/compos ...
python 画图
1.根据实际图形,用符号画出原来图形 from PIL import Image import argparse #命令行输入参数处理 parser = argparse.ArgumentParser ...
BZOJ 2946 SA/SAM
思路: 1. 二分+后缀数组 2.SAM //By SiriusRen #include <cstdio> #include <cstring> #include <al ...
Select2插件ajax方式加载数据并刷新页面数据回显
今天在优化项目当中,有个要在下拉框中搜索数据的需求:最后选择使用selec2进行开发: 官网:http://select2.github.io/ 演示: 准备工作: 文件需要引入select2.ful ...
C语言常量
Constant包括4种类型: 整型浮点型枚举字符型 #include <stddef.h> #include <uchar.h> int main() { /* Int ...
node.js安装及其环境配置
nodejs: 实际上是采用google的chrome浏览器V8引擎,由C++编写的本质上是一个javascript的运行环境浏览器引擎可以解析js代码 nodejs可以解析js代码,没有浏览器端 ...
html5——全屏显示
基本概念 1.HTML5规范允许用户自定义网页上任一元素全屏显示 2.requestFullscreen() 开启全屏显示.cancleFullscreen() 关闭全屏显示 3.不同浏览器兼容性不一 ...
mysql跟java时间类型转换
参照这个就行了,这个对应注入类型.===========java注入数据库==========java类型 mysql类型成功与否date date yesdate time nodate time ...
css 字体单位之间的区分以及字体响应式实现
问题场景: 在实现响应式布局的过程中,如何设置字体大小在不同的视窗尺寸以及不同的移动设备的可读性? 需要了解的有: 1.px,em,pt之间的换算关系 1em = 16px 1px = 1/16 e ...

Spark Streaming接收Kafka数据存储到Hbase

Spark Streaming接收Kafka数据存储到Hbase

kafka生产数据

spark streaming 读取kafka并保存到hbase

Spark Streaming接收Kafka数据存储到Hbase的更多相关文章

随机推荐

热门专题