Kakfa起初是由LinkedIn公司开发的一个分布式的消息系统,后成为Apache的一部分,它使用Scala编写,以可水平扩展和高吞吐率而被广泛使用。目前越来越多的开源分布式处理系统如Cloudera、Apache Storm、Spark等都支持与Kafka集成。

Spark streaming集成kafka是企业应用中最为常见的一种场景。

一、安装kafka

参考文档:

http://kafka.apache.org/quickstart#quickstart_createtopic

1、安装java

2、安装zookeeper集群

参考:http://www.cnblogs.com/wcwen1990/p/6652105.html

3、安装scala

4、安装kafka

下载kafka安装文件:

https://archive.apache.org/dist/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz

解压kafka安装包:

# tar -zxvf kafka_2.10-0.8.2.1.tgz -C /opt/cdh-5.3.6/

# chown -R hadoop:hadoop /opt/cdh-5.3.6/kafka_2.10-0.8.2.1/

删除kafka libs/zookeeper jar包,拷贝自己安装集群zookeeper jar包到kafka libs目录下:

$ rm libs/zookeeper-3.4.6.jar –rf

$ cp /opt/cdh-5.3.6/zookeeper-3.4.5-cdh5.3.6/zookeeper-3.4.5-cdh5.3.6.jar libs/

5、定义kafka配置文件

5.1)定义server.properties:

host.name=chavin.king

log.dirs=/opt/cdh-5.3.6/kafka_2.10-0.8.2.1/kafka-logs

zookeeper.connect=chavin.king:2181

定义producer.properties:

metadata.broker.list=chavin.king:9092

定义consumer.properties:

zookeeper.connect=chavin.king:2181

5.2)启动kafka server

$ bin/kafka-server-start.sh config/server.properties

$ jps

14020 NameNode

57749 Jps

14776 QuorumPeerMain

57690 Kafka

14507 NodeManager

14235 ResourceManager

14093 DataNode

14686 JobHistoryServer

57663 ZooKeeperMain

[zk: localhost:2181(CONNECTED) 3] ls /

[controller, controller_epoch, brokers, zookeeper, admin, consumers, config, hbase]

5.3)创建一个topic

$ bin/kafka-topics.sh --create --zookeeper chavin.king:2181 --replication-factor 1 --partitions 1 --topic test

$ bin/kafka-topics.sh --list --zookeeper chavin.king:2181

5.4)创建一个生产者,产生数据

$ bin/kafka-console-producer.sh --broker-list chavin.king:9092 --topic test

5.5)创建一个消费者,消费数据

$ bin/kafka-console-consumer.sh --zookeeper chavin.king:2181 --topic test --from-beginning

在生产者shell窗口输入数据,在消费者窗口可以看到数据输出到界面上。

二、spark streaming与kafka集成

参考文档:http://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html

一)准备工作

1、编译spark,获得集成kafka jar包:

参考文档:http://www.cnblogs.com/wcwen1990/p/7688027.html

说明:spark streaming集成flume或者kafka需要一些支持jar包,这些jar包在编译spark过程中会自动在external目录下生成相应的jar文件,因此,这里需要编译spark来获得这些jar包。

Spark streaming集成kafka主要需要:spark-streaming-kafka_2.10-1.3.0.jar包。

2、集成相关jar包

$ cp external/kafka/target/spark-streaming-kafka_2.10-1.3.0.jar /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/

$ cp libs/kafka_2.10-0.8.2.1.jar libs/kafka-clients-0.8.2.1.jar libs/zkclient-0.3.jar libs/metrics-core-2.2.0.jar /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/

[externalLibs]$ ls

kafka_2.10-0.8.2.1.jar

kafka-clients-0.8.2.1.jar

metrics-core-2.2.0.jar

spark-streaming-kafka_2.10-1.3.0.jar

zkclient-0.3.jar

二)集成方式1:Receiver-based Approach

1、编写spark streaming集成kafka的wordcount

import java.util.HashMap

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

val ssc = new StreamingContext(sc, Seconds(5))

val topicMap = Map("test" -> 1)

// read data

val lines = KafkaUtils.createStream(ssc, "chavin.king:2181", "testWordCountGroup", topicMap).map(_._2)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start() // Start the computation

ssc.awaitTermination() // Wait for the computation to terminate

2、spark-shell local模式启动,并运行步骤1程序

bin/spark-shell --master local[2] --jars \

/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/metrics-core-2.2.0.jar

scala> import java.util.HashMap

import java.util.HashMap

scala> import org.apache.spark._

import org.apache.spark._

scala> import org.apache.spark.streaming._

import org.apache.spark.streaming._

scala> import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.StreamingContext._

scala> import org.apache.spark.streaming.kafka._

import org.apache.spark.streaming.kafka._

scala> val ssc = new StreamingContext(sc, Seconds(5))

ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@1a28f9a0

scala> val topicMap = Map("test" -> 1)

topicMap: scala.collection.immutable.Map[String,Int] = Map(test -> 1)

scala> val lines = KafkaUtils.createStream(ssc, "chavin.king:2181", "testWordCountGroup", topicMap).map(_._2)

lines: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.MappedDStream@27267641

scala>

scala> val words = lines.flatMap(_.split(" "))

words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@169b0639

scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.ShuffledDStream@14f2b1ba

scala> wordCounts.print()

scala> ssc.start()

scala>ssc.awaitTermination()

3、测试

在kafka生产者shell端输入:

hadoop oracle mysql mysql mysql

这是我们在kafka消费者端可以看到如下输出:

hadoop oracle mysql mysql mysql

同时在spark streaming端也可以看到如下输出:

-------------------------------------------

Time: 1500021590000 ms

-------------------------------------------

(mysql,3)

(oracle,1)

(hadoop,1)

三)集成方式2:Direct Approach (No Receivers)

1、编写spark streaming集成kafka的wordcount

import kafka.serializer.StringDecoder

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

val ssc = new StreamingContext(sc, Seconds(5))

val kafkaParams = Map[String, String]("metadata.broker.list" -> "chavin.king:9092")

val topicsSet = Set("test")

// read data

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

val lines = messages.map(_._2)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start() // Start the computation

ssc.awaitTermination() // Wait for the computation to terminate

2、spark-shell local模式启动,并运行步骤1程序

bin/spark-shell --master local[2] --jars \

/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/metrics-core-2.2.0.jar

scala> import kafka.serializer.StringDecoder

import kafka.serializer.StringDecoder

scala> import org.apache.spark._

import org.apache.spark._

scala> import org.apache.spark.streaming._

import org.apache.spark.streaming._

scala> import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.StreamingContext._

scala> import org.apache.spark.streaming.kafka._

import org.apache.spark.streaming.kafka._

scala>

scala> val ssc = new StreamingContext(sc, Seconds(5))

ssc: org.apache.spark.streaming.StreamingContext = org.apache.spark.streaming.StreamingContext@2d05daca

scala>

scala> val kafkaParams = Map[String, String]("metadata.broker.list" -> "chavin.king:9092")

kafkaParams: scala.collection.immutable.Map[String,String] = Map(metadata.broker.list -> chavin.king:9092)

scala> val topicsSet = Set("test")

topicsSet: scala.collection.immutable.Set[String] = Set(test)

scala>

scala> // read data

scala> val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

17/07/14 16:59:31 INFO VerifiableProperties: Verifying properties

17/07/14 16:59:31 INFO VerifiableProperties: Property group.id is overridden to

17/07/14 16:59:31 INFO VerifiableProperties: Property zookeeper.connect is overridden to

messages: org.apache.spark.streaming.dstream.InputDStream[(String, String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@375c2870

scala>

scala> val lines = messages.map(_._2)

lines: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.MappedDStream@1dda179e

scala> val words = lines.flatMap(_.split(" "))

words: org.apache.spark.streaming.dstream.DStream[String] = org.apache.spark.streaming.dstream.FlatMappedDStream@996294c

scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Int)] = org.apache.spark.streaming.dstream.ShuffledDStream@19cd9e6a

scala> wordCounts.print()

scala> ssc.start()

scala>ssc.awaitTermination()

3、测试

在kafka生产者shell端输入:

hadoop oracle mysql mysql mysql

这是我们在kafka消费者端可以看到如下输出:

hadoop oracle mysql mysql mysql

同时在spark streaming端也可以看到如下输出:

-------------------------------------------

Time: 1500021590000 ms

-------------------------------------------

(mysql,3)

(oracle,1)

(hadoop,1)

至此,spark streaming集成kafka两种方式演示OK。但是通过上述案例我们可以发现,目前的spark streaming仅仅对每次的输入值进行一次计算,而企业应用中,可能更需要将多次的输入值进行累加,那么该怎么实现呢?看下面案例?

四)使用UpdataStateByKey实现spark streaming多次输入值的累加操作

1、创建文件udsb.scala文件,输入如下内容:

$ cat udsb.scala

import kafka.serializer.StringDecoder

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.kafka._

val ssc = new StreamingContext(sc, Seconds(5))

ssc.checkpoint(".")

val kafkaParams = Map[String, String]("metadata.broker.list" -> "chavin.king:9092")

val topicsSet = Set("test")

val updateFunc = (values: Seq[Int], state: Option[Int]) => {

val currentCount = values.sum

val previousCount = state.getOrElse(0)

Some(currentCount + previousCount)

}

// read data

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

val lines = messages.map(_._2)

val words = lines.flatMap(_.split(" "))

val wordDstream = words.map(x => (x, 1))

val stateDstream = wordDstream.updateStateByKey[Int](updateFunc)

stateDstream.print()

ssc.start()

ssc.awaitTermination()

2、spark-shell local模式启动,并运行步骤1程序

bin/spark-shell --master local[2] --jars \

/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/spark-streaming-kafka_2.10-1.3.0.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/kafka_2.10-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/kafka-clients-0.8.2.1.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/zkclient-0.3.jar,/opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/externalLibs/metrics-core-2.2.0.jar

scala> :load /opt/cdh-5.3.6/spark-1.3.0-bin-2.5.0-cdh5.3.6/udsb.scala

3、测试

在kafka生产者shell端输入:

3.1)第一次输入:hadoop oracle mysql

Spark streaming端可以看到如下输出:

-------------------------------------------

Time: 1500023985000 ms

-------------------------------------------

(mysql,1)

(oracle,1)

(hadoop,1)

3.2)第二次输入:hadoop oracle mysql

Spark streaming端可以看到如下输出:

-------------------------------------------

Time: 1500023985000 ms

-------------------------------------------

(mysql,2)

(oracle,2)

(hadoop,2)

3.3)第三次输入:hadoop oracle mysql

Spark streaming端可以看到如下输出:

-------------------------------------------

Time: 1500023985000 ms

-------------------------------------------

(mysql,3)

(oracle,3)

(hadoop,3)

spark streaming集成kafka的更多相关文章

  1. spark streaming集成kafka接收数据的方式

    spark streaming是以batch的方式来消费,strom是准实时一条一条的消费.当然也可以使用trident和tick的方式来实现batch消费(官方叫做mini batch).效率嘛,有 ...

  2. 解决spark streaming集成kafka时只能读topic的其中一个分区数据的问题

    1. 问题描述 我创建了一个名称为myTest的topic,该topic有三个分区,在我的应用中spark streaming以direct方式连接kakfa,但是发现只能消费一个分区的数据,多次更换 ...

  3. Spark Streaming与Kafka集成

    Spark Streaming与Kafka集成 1.介绍 kafka是一个发布订阅消息系统,具有分布式.分区化.多副本提交日志特点.kafka项目在0.8和0.10之间引入了一种新型消费者API,注意 ...

  4. Spark Streaming之四:Spark Streaming 与 Kafka 集成分析

    前言 Spark Streaming 诞生于2013年,成为Spark平台上流式处理的解决方案,同时也给大家提供除Storm 以外的另一个选择.这篇内容主要介绍Spark Streaming 数据接收 ...

  5. spark streaming集成flume

    1. 安装flume flume安装,解压后修改flume_env.sh配置文件,指定java_home即可. cp hdfs jar包到flume lib目录下(否则无法抽取数据到hdfs上): $ ...

  6. Spark Streaming on Kafka解析和安装实战

    本课分2部分讲解: 第一部分,讲解Kafka的概念.架构和用例场景: 第二部分,讲解Kafka的安装和实战. 由于时间关系,今天的课程只讲到如何用官网的例子验证Kafka的安装是否成功.后续课程会接着 ...

  7. spark streaming 对接kafka记录

    spark streaming 对接kafka 有两种方式: 参考: http://group.jobbole.com/15559/ http://blog.csdn.net/kwu_ganymede ...

  8. Spark Streaming、Kafka结合Spark JDBC External DataSouces处理案例

    场景:使用Spark Streaming接收Kafka发送过来的数据与关系型数据库中的表进行相关的查询操作: Kafka发送过来的数据格式为:id.name.cityId,分隔符为tab zhangs ...

  9. 【转】Spark Streaming和Kafka整合开发指南

    基于Receivers的方法 这个方法使用了Receivers来接收数据.Receivers的实现使用到Kafka高层次的消费者API.对于所有的Receivers,接收到的数据将会保存在Spark ...

随机推荐

  1. CMD 命令2

    cd  %~dp0 切换到当前脚本所有目录 批处理常用命令总结 - 批处理命令简介 目录 echo 打开回显或关闭请求回显功能,或显示消息.如果没有任何参数,echo 命令将显示当前回显设置. ech ...

  2. 解决Visual Studio调试突然变慢卡死的问题

    最开始摸不到头脑,之前还能好好调试的啊.后来在VS的调试菜单的符号选项里面发现了系统环境变量_NT_SYMBOL_PATH 的值为:srv*c:\symbols*http://msdl.microso ...

  3. android mat 转 bitmap

    Bitmap bmp = null; Mat tmp = new Mat (height, width, CvType.CV_8U, new Scalar(4)); try { //Imgproc.c ...

  4. hdu 3068 最长回文(manacher&最长回文子串)

    最长回文 Time Limit: 4000/2000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) Total Submi ...

  5. Java之Builder模式(并用OC实现了这种模式)

    本人在学习Java,直接先学习Netty框架,因为Netty框架是业界最流行的NIO框架之一,在学习的过程中,了解到Netty服务端启动需要先创建服务器启动辅助类ServerBootstrap,它提供 ...

  6. windows 上搭建gitblit

    https://www.cnblogs.com/ucos/p/3924720.htmlhttps://www.cnblogs.com/sumuncle/p/6362697.htmlhttp://www ...

  7. win10 Faster-RCNN训练自己数据集遇到的问题集锦 (转)

    题注: 在win10下训练实在是有太多坑了,在此感谢网上的前辈和大神,虽然有的还会把你引向另一个坑~~. 最近,用faster rcnn跑一些自己的数据,数据集为某遥感图像数据集——RSOD,标注格式 ...

  8. Android WebRTC开发入门

    在学习 WebRTC 的过程中,学习的一个基本步骤是先通过 JS 学习 WebRTC的整体流程,在熟悉了整体流程之后,再学习其它端如何使用 WebRTC 进行互联互通. 申请权限 Camera 权限 ...

  9. C#访问gsoap的服务--可用

    问题来源: C++开发一个webservice,然后C#开发客户端,这样就需要C#的客户端访问gsoap的服务端.(大家都知道gsoap是C/C++开发webservice的最佳利器) 为什么不考虑直 ...

  10. [tomcat启动报错]registered the JDBC driver [com.alibaba.druid.proxy.DruidDriver] but failed to unregister it when the web application was stopped

    环境:一个tomcat ,一个工程配置了多数据源,在启动的时候报如下错误: SEVERE: The web application [/qdp-resource-job] registered the ...