先列参考文献:

Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher):http://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html

kafka(Java Client端Producer API):http://kafka.apache.org/documentation/#producerapi

版本:

spark:2.1.1
scala:2.11.12
kafka运行版本:2.3.0
spark-streaming-kafka-0-10_2.11:2.2.0

开发环境:

  3台虚拟机部署kafka,域名分别为coo1、coo2、coo3,部署版本如上,zookeeper版本3.4.7

  在kafka上创建topic:xzrz,replica为3,partition为4;

./kafka-topics.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --create --topic xzrz --replication-factor 3 --partitions 4

  准备代码环境:

  一个是Java端的kafka发送端:KafkaSender.java

  另一个是scale端的spark-streaming-kafka消费端,KafkaStreaming.scala

kafka发送端:

  maven配置:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>kafkaTest</groupId>
<artifactId>kafkaTest</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13-beta-2</version>
<scope>compile</scope>
</dependency>
</dependencies>
</project>
KafkaSender.java
 package gjm;

 import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.junit.Test; import java.util.Properties;
import java.util.concurrent.ExecutionException; public class KafkaSender {
@Test
public void producer() throws InterruptedException, ExecutionException {
Properties props = new Properties();
props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "coo1:9092,coo2:9092,coo3:9092");
// props.put(ProducerConfig.BATCH_SIZE_CONFIG,"1024");
// props.put(ProducerConfig.BUFFER_MEMORY_CONFIG,"0");
//配置kafka语义exacts once语义
props.put("acks", "all");
props.put("enable.idempotence", "true");
Producer<Integer, String> kafkaProducer = new KafkaProducer<Integer, String>(props);
for (int j = 0; j < 1; j++)
for (int i = 0; i < 100; i++) {
ProducerRecord<Integer, String> message = new ProducerRecord<Integer, String>("xzrz", "{wo|2019-12-12|1|2|0|5}");
kafkaProducer.send(message);
}
//这个flush和close一定要写,类似于流操作
//因为kafka自带betch和buffer,如果没有这两行代码,一是浪费资源,二是有可能消息没有发送到kafka中,依旧保留在本地betch中
kafkaProducer.flush();
kafkaProducer.close();
}
}

kafka消费端-->sparkstreaming-kafka-->KafkaStreaming.scala代码:

  maven:

 <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <groupId>sparkMVN</groupId>
<artifactId>sparkMVN</artifactId>
<version>1.0-SNAPSHOT</version> <properties>
<spark.version>2.1.1</spark.version>
<hadoop.version>2.7.3</hadoop.version>
<hbase.version>0.98.17-hadoop2</hbase.version>
</properties>
<dependencies> <dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency> <dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<!--这行在local模式中,一定要注销,否则会导致找不到spark context类的异常-->
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency> <!-- hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency> <!--hbase-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
</dependencies>
</project>
KafkaStreaming.scala代码:
 package gjm.sparkDemos

 import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.{CanCommitOffsets, HasOffsetRanges, KafkaUtils}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory object KafkaStreaming {
def main(args: Array[String]): Unit = {
val LOG = LoggerFactory.getLogger(KafkaStreaming.getClass)
LOG.info("Streaming start----->") val conf = new SparkConf().setMaster("local[6]")//这里设置消费kafka的线程数为6,看看会有什么情况
.setAppName("KafkaStreaming")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(3))
val topics = Array("xzrz")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "coo1:9092,coo2:9092,coo3:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "fwjkcx",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
// "heartbeat.interval.ms" -> (90000: java.lang.Integer),
// "session.timeout.ms" -> (120000: java.lang.Integer),
// "group.max.session.timeout.ms" -> (120000: java.lang.Integer),
// "request.timeout.ms" -> (130000: java.lang.Integer),
// "fetch.max.wait.ms" -> (120000: java.lang.Integer)
) val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
LOG.info("Streaming had Created----->")
LOG.info("Streaming Consuming msg----->")
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd.foreachPartition(recordIt => {
for (record <- recordIt) {
LOG.info("Message recode info : Topics-->{},partition-->{}, checkNum-->{}, offset-->{}, value-->{}", record.topic(), record.partition().toString, record.checksum().toString, record.offset().toString, record.value())
}
})
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
ssc.start()
ssc.awaitTermination()
}
}

验证测试:

1、使用发送端发送100条消息;
2、启动kafka自带的consumer消费端,group id为test;
sh kafka-console-consumer.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --topic xzrz --from-beginning --group test
3、启动spark-stream-kafka,在代码中已经设置流的间隔时间为每3s一次;
4、使用kafka自带的group消费情况查询消费情况:
 ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group test
./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx
结果:
1、首先是测试的test消费端:消费情况
[root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group test

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
test xzrz 0 25 25 0 consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217 consumer-1
test xzrz 1 25 25 0 consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217 consumer-1
test xzrz 2 25 25 0 consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217 consumer-1
test xzrz 3 25 25 0 consumer-1-4adfdb85-45ef-40a5-9127-7bb6239e0e29 /192.168.0.217 consumer-1
观察发现,4个分区同一个消费者线程,一共消费了100条。
2、spark-streaming的消费端:消费情况
[root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
fwjkcx xzrz 0 25 25 0 consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60 consumer-1
fwjkcx xzrz 1 25 25 0 consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60 consumer-1
fwjkcx xzrz 2 25 25 0 consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60 consumer-1
fwjkcx xzrz 3 25 25 0 consumer-1-0cca92be-5970-4030-abd1-b8552dea9718 /192.168.0.60 consumer-1
观察发现,4个分区每个分区消费25条,符合正常认知
3、多增加一个实验,现在将spart-stream的local数量改为3,更改消费者组为fwjkcx01,观察消费情况
[root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx01

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
fwjkcx01 xzrz 0 25 25 0 consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60 consumer-1
fwjkcx01 xzrz 1 25 25 0 consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60 consumer-1
fwjkcx01 xzrz 2 25 25 0 consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60 consumer-1
fwjkcx01 xzrz 3 25 25 0 consumer-1-03542086-c95d-41c2-b199-24a158708b65 /192.168.0.60 consumer-1
发现仍然是4个线程在消费,所以在local位置指定线程数量根本不生效。
4、这时候再发1000条消息,观察group:fwjkcx的消费情况
[root@coo3 bin]# ./kafka-consumer-groups.sh --bootstrap-server coo3:9092,coo2:9092,coo1:9092 --describe --group fwjkcx

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                     HOST            CLIENT-ID
fwjkcx xzrz 0 275 275 0 consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60 consumer-1
fwjkcx xzrz 1 275 275 0 consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60 consumer-1
fwjkcx xzrz 2 275 275 0 consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60 consumer-1
fwjkcx xzrz 3 275 275 0 consumer-1-fc238353-1b55-4efa-9c4f-54580ed81b0e /192.168.0.60 consumer-1
一切正常。

Spark-stream,kafka结合的更多相关文章

  1. Spark踩坑记——Spark Streaming+Kafka

    [TOC] 前言 在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark strea ...

  2. Spark Streaming+Kafka

    Spark Streaming+Kafka 前言 在WeTest舆情项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端, ...

  3. spark streaming kafka example

    // scalastyle:off println package org.apache.spark.examples.streaming import kafka.serializer.String ...

  4. spark与kafka集成进行实时 nginx代理 这种sdk埋点 原生日志实时解析 处理

    日志格式202.108.16.254^A1546795482.600^A/cntv.gif?appId=3&areaId=8213&srcContId=2535575&area ...

  5. spark streaming - kafka updateStateByKey 统计用户消费金额

    场景 餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现 从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...

  6. Spark Streaming + Kafka整合(Kafka broker版本0.8.2.1+)

    这篇博客是基于Spark Streaming整合Kafka-0.8.2.1官方文档. 本文主要讲解了Spark Streaming如何从Kafka接收数据.Spark Streaming从Kafka接 ...

  7. 进行Spark,Kafka针对Kerberos相关配置

    1. 提交任务的命令 spark-submit \--class <classname> \--master yarn \--deploy-mode client \--executor- ...

  8. 本机spark 消费kafka失败(无法连接)

    本机spark 消费kafka失败(无法连接) 终端也不报错 就特么不消费:  但是用console的consumer  却可以 经过各种改版本 ,测试配置,最后发现 只要注释掉 kafka 配置se ...

  9. 【Spark】Spark Streaming + Kafka direct 的 offset 存入Zookeeper并重用

    Spark Streaming + Kafka direct 的 offset 存入Zookeeper并重用 streaming offset设置_百度搜索 将 Spark Streaming + K ...

  10. spark读取 kafka nginx网站日志消息 并写入HDFS中(转)

    原文链接:spark读取 kafka nginx网站日志消息 并写入HDFS中 spark 版本为1.0 kafka 版本为0.8 首先来看看kafka的架构图 详细了解请参考官方 我这边有三台机器用 ...

随机推荐

  1. 【遗传编程/基因规划】Genetic Programming

    目录 背景介绍 程序表示 初始化 (Initialization) Depth定义 Grow方法 Full方法 Ramped half-and-half方法 适应度(Fitness)与选择(Selec ...

  2. 剑指Offer01之二维数组中查找目标数

    剑指Offer之二维数组中查找目标数 题目描述 ​ 在一个二维数组中(每个一维数组的长度相等),每一行都是从左到右递增的顺序排序,每一列都是从上到下递增的顺序排序,输入这样一个二维数组和一个整数,判断 ...

  3. vue修改对象的属性值后页面不重新渲染

    原文地址:vue修改对象的属性值后页面不重新渲染 最近项目在使用vue,遇到几次修改了对象的属性后,页面并不重新渲染,场景如下: HTML页面如下: [html] view plain copy &l ...

  4. MySQL InnoDB索引介绍以及在线添加索引实例分析

    引言:MySQL之所以能成为经典,不是没有道理的,B+树足矣! 一.索引概念 InnoDB引擎支持三种常见的索引:B+树索引,全文索引和(自适应)哈希索引.B+树索引是传统意义上的索引,构造类似二叉树 ...

  5. 【github龟速克星】如何下载快如闪电

    详见:https://www.kesci.com/home/project/5e96fe1ae7ec38002d03cd56 借助第三方网站:https://g.widora.cn/

  6. PHP 获取上个月1号和上个月最后一天时间戳,下个月1号和下个月最后一天的时间戳

    // 上个月的时间戳 $thismonth = date('m'); $thisyear = date('Y'); if ($thismonth == 1) { $lastmonth = 12; $l ...

  7. Longest Mountain in Array 数组中的最长山脉

    我们把数组 A 中符合下列属性的任意连续子数组 B 称为 “山脉”: B.length >= 3 存在 0 < i < B.length - 1 使得 B[0] < B[1] ...

  8. pyinstaller打包pyqt5,从入坑到填坑,详解

    以上省略pyinstaller安装步骤,直入主题.先分享我的心路历程. 1.pyinstaller -F -i 1.ico UI_Main.py (先在CMD中 cd到 py文件对应的路径) 第一步打 ...

  9. Docker部署nginx,tomcat,es,可视化

    nginx [root@iz2zeaet7s13lfkc8r3e2kz /]# docker pull nginx #下载 Using default tag: latest latest: Pull ...

  10. python argparse总结

    python2.7废除optparse,原因:http://code.google.com/p/argparse/ 说了半天好像是重开了个小号练级 抓紧写下来一会又得忘了 用法: import arg ...