欢迎转载，转载请注明出处，徽沪一郎。

概要

Spark应用开发实践性非常强，很多时候可能都会将时间花费在环境的搭建和运行上，如果有一个比较好的指导将会大大的缩短应用开发流程。Spark Streaming中涉及到和许多第三方程序的整合，源码中的例子如何真正跑起来，文档不是很多也不详细。

本篇主要讲述如何运行KafkaWordCount，这个需要涉及Kafka集群的搭建，还是说的越仔细越好。

搭建Kafka集群

步骤1：下载kafka 0.8.1及解压

wget https://www.apache.org/dyn/closer.cgi?path=/kafka/0.8.1.1/kafka_2.10-0.8.1.1.tgz

tar zvxf kafka_2.10-0.8.1.1.tgz

cd kafka_2.10-0.8.1.1

步骤2：启动zookeeper

bin/zookeeper-server-start.sh config/zookeeper.properties

步骤3：修改配置文件config/server.properties，添加如下内容

host.name=localhost

# Hostname the broker will advertise to producers and consumers. If not set, it uses the

# value for "host.name" if configured.  Otherwise, it will use the value returned from

# java.net.InetAddress.getCanonicalHostName().

advertised.host.name=localhost

步骤4：启动Kafka server

bin/kafka-server-start.sh config/server.properties

步骤5：创建topic

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1  --topic test

检验topic创建是否成功

bin/kafka-topics.sh --list --zookeeper localhost:2181

如果正常返回test

步骤6：打开producer，发送消息

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

##启动成功后，输入以下内容测试

This is a message

This is another message

步骤7：打开consumer，接收消息

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

###启动成功后，如果一切正常将会显示producer端输入的内容

This is a message

This is another message

运行KafkaWordCount

KafkaWordCount源文件位置 examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala

尽管里面有使用说明，见下文，但如果不是事先对Kafka有一定的了解的话，决然不知道这些参数是什么意思，也不知道该如何填写。

/**

 * Consumes messages from one or more topics in Kafka and does wordcount.

 * Usage: KafkaWordCount

 *    is a list of one or more zookeeper servers that make quorum

 *    is the name of kafka consumer group

 *    is a list of one or more kafka topics to consume from

 *    is the number of threads the kafka consumer should use

 *

 * Example:

 *    `$ bin/run-example \

 *      org.apache.spark.examples.streaming.KafkaWordCount zoo01,zoo02,zoo03 \

 *      my-consumer-group topic1,topic2 1`

 */

object KafkaWordCount {

  def main(args: Array[String]) {

    if (args.length < 4) {

      System.err.println("Usage: KafkaWordCount    ")

      System.exit(1)

    }

    StreamingExamples.setStreamingLogLevels()

    val Array(zkQuorum, group, topics, numThreads) = args

    val sparkConf = new SparkConf().setAppName("KafkaWordCount")

    val ssc =  new StreamingContext(sparkConf, Seconds(2))

    ssc.checkpoint("checkpoint")

    val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap

    val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)

    val words = lines.flatMap(_.split(" "))

    val wordCounts = words.map(x => (x, 1L))

      .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)

    wordCounts.print()

    ssc.start()

    ssc.awaitTermination()

  }

}

讲清楚了写这篇博客的主要原因之后，来看一看该如何运行KafkaWordCount

步骤1：停止运行刚才的kafka-console-producer和kafka-console-consumer

步骤2：运行KafkaWordCountProducer

bin/run-example org.apache.spark.examples.streaming.KafkaWordCountProducer localhost:9092 test 3 5

解释一下参数的意思，localhost:9092表示producer的地址和端口, test表示topic，3表示每秒发多少条消息，5表示每条消息中有几个单词

步骤3：运行KafkaWordCount

 bin/run-example org.apache.spark.examples.streaming.KafkaWordCount localhost:2181 test-consumer-group test 1

解释一下参数， localhost:2181表示zookeeper的监听地址，test-consumer-group表示consumer-group的名称，必须和$KAFKA_HOME/config/consumer.properties中的group.id的配置内容一致，test表示topic，1表示线程数。

Apache Spark技术实战之1 -- KafkaWordCount的更多相关文章

Apache Spark技术实战之6 --Standalone部署模式下的临时文件清理
问题导读 1.在Standalone部署模式下,Spark运行过程中会创建哪些临时性目录及文件? 2.在Standalone部署模式下分为几种模式? 3.在client模式和cluster模式下有什么 ...
Apache Spark技术实战之4 -- 利用Spark将json文件导入Cassandra
欢迎转载,转载请注明出处. 概要本文简要介绍如何使用spark-cassandra-connector将json文件导入到cassandra数据库,这是一个使用spark的综合性示例. 前提条件假 ...
Apache Spark技术实战之3 -- Spark Cassandra Connector的安装和使用
欢迎转载,转载请注明出处,徽沪一郎. 概要前提假设当前已经安装好如下软件 jdk sbt git scala 安装cassandra 以archlinux为例,使用如下指令来安装cassandra ...
Apache Spark技术实战之9 -- 日志级别修改
摘要在学习使用Spark的过程中,总是想对内部运行过程作深入的了解,其中DEBUG和TRACE级别的日志可以为我们提供详细和有用的信息,那么如何进行合理设置呢,不复杂但也绝不是将一个INFO换为TR ...
Apache Spark技术实战之8：Standalone部署模式下的临时文件清理
未经本人同意严禁转载,徽沪一郎. 概要在Standalone部署模式下,Spark运行过程中会创建哪些临时性目录及文件,这些临时目录和文件又是在什么时候被清理,本文将就这些问题做深入细致的解答. 从 ...
Apache Spark技术实战之6 -- spark-submit常见问题及其解决
除本人同意外,严禁一切转载,徽沪一郎. 概要编写了独立运行的Spark Application之后,需要将其提交到Spark Cluster中运行,一般会采用spark-submit来进行应用的提交 ...
Apache Spark技术实战之7 -- CassandraRDD高并发数据读取实现剖析
未经本人同意,严禁转载,徽沪一郎. 概要本文就 spark-cassandra-connector 的一些实现细节进行探讨,主要集中于如何快速将大量的数据从cassandra 中读取到本地内存或磁盘 ...
Apache Spark技术实战之5 -- SparkR的安装及使用
欢迎转载,转载请注明出处,徽沪一郎. 概要根据论坛上的信息,在Sparkrelease计划中,在Spark 1.3中有将SparkR纳入到发行版的可能.本文就提前展示一下如何安装及使用SparkR. ...
Apache Spark技术实战之2 -- PackratParsers实例
欢迎转载,转载请注明出处,徽沪一郎概要通过一个简明的Demo程序来说明如何使用scala中的PackratParsers DemoApp import scala.util.parsing.com ...

随机推荐

解决win7访问不了局域网共享文件
1.确认链接 2.确认服务TCP/IP NetBIOS Helper 启动 3.secpol.msc 确认本地策略->用户权限分配如图
解决 internet connection sharing 启动不了
1.确认Windows Firewall服务是否启动(有异常可参考下面) a.打开注册表,找到HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ ...
Android之canvas详解
首先说一下canvas类: Class Overview The Canvas class holds the "draw" calls. To draw something, y ...
iOS的I/O操作
一般而言,处理文件时都要经历以下四个步骤: 1.创建文件 2.打开文件,以便在后面的I/O操作中引用该文件 3.对打开的文件执行I/O操作(读取.写入.更新) 4.关闭文件 iOS中,对文件常见的处理 ...
查看MySQL配置文件路径及相关配置
[root@DB ~]# /usr/local/mysql/bin/mysqld --verbose --help |grep -A 1 'Default options' Default optio ...
svn 文件夹无法提交
[root@v01 www]# svn add localsvn/kkk/ svn: warning: 'localsvn/kkk' is already under version control ...
CentOS 6.5 下安装 Elasticsearch 5
安装最新的 Elasticsearch 5 需要Java 8.所有先要确定环境中是否有Java 8.如果没有则需要安装. 1. 安装Java 8 首先使用 yum list installed | g ...
struts2升级文档
http://www.linuxdiyf.com/viewarticle.php?id=537212
Redis笔记（三）Redis的数据类型
前面说过,Redis的一大特性是支持丰富的数据类型, 这为更多的应用场景提供了可能. Redis有五种数据类型,包括string,list,set,sorted set和hash,注意,Redis的数 ...
Ubuntu下自动挂载分区
参考文章:http://feierky.iteye.com/blog/1998602 1.查看分区的UUID sudo blkid /dev/sda1: UUID="3526b254-390 ...

Apache Spark技术实战之1 -- KafkaWordCount

概要

搭建Kafka集群

运行KafkaWordCount

Apache Spark技术实战之1 -- KafkaWordCount的更多相关文章

随机推荐

热门专题