Spark:java api实现word count统计
方案一:使用reduceByKey
数据word.txt
张三
李四
王五
李四
王五
李四
王五
李四
王五
王五
李四
李四
李四
李四
李四
代码:
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2; public class HelloWord {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate();
final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); RDD<String> rdd = spark.sparkContext().textFile("C:\\Users\\boco\\Desktop\\word.txt", 1);
JavaRDD<String> javaRDD = rdd.toJavaRDD(); JavaPairRDD<String, Integer> javaRDDMap = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}); JavaPairRDD<String, Integer> result = javaRDDMap.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
}); System.out.println(result.collect());
}
}
输出:
[(张三,1), (李四,9), (王五,5)]
方案二:使用spark sql
使用spark sql实现代码:
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType; import java.util.ArrayList; public class HelloWord {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate();
final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); JavaRDD<Row> rows = spark.read().text("C:\\Users\\boco\\Desktop\\word.txt").toJavaRDD(); ArrayList<StructField> fields = new ArrayList<StructField>();
StructField field = null;
field = DataTypes.createStructField("key", DataTypes.StringType, true);
fields.add(field); StructType schema = DataTypes.createStructType(fields); Dataset<Row> ds = spark.createDataFrame(rows, schema); ds.createOrReplaceTempView("words"); Dataset<Row> result = spark.sql("select key,count(0) as key_count from words group by key"); result.show();
}
}
结果:
+---+---------+
|key|key_count|
+---+---------+
| 王五| 5|
| 李四| 9|
| 张三| 1|
+---+---------+
方案二:使用spark streaming实时流分析
参考《http://spark.apache.org/docs/latest/streaming-programming-guide.html》
First, we create a JavaStreamingContext object, which is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.
import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2; // Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
Using this context, we can create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. localhost) and port (e.g. 9999).
// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);
This lines DStream represents the stream of data that will be received from the data server. Each record in this stream is a line of text. Then, we want to split the lines by space into words.
// Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator());
flatMap is a DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream. In this case, each line will be split into multiple words and the stream of words is represented as the words DStream. Note that we defined the transformation using a FlatMapFunction object. As we will discover along the way, there are a number of such convenience classes in the Java API that help defines DStream transformations.
Next, we want to count these words.
// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2); // Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();
The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs, using a PairFunction object. Then, it is reduced to get the frequency of words in each batch of data, using a Function2 object. Finally, wordCounts.print() will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it will perform after it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call start method.
jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
The complete code can be found in the Spark Streaming example JavaNetworkWordCount.
If you have already downloaded and built Spark, you can run this example as follows. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using
$ nc -lk 9999
Then, in a different terminal, you can start the example by using
$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999
完整代码:
import java.util.Arrays; import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class HelloWord {
public static void main(String[] args) throws InterruptedException {
// Create a local StreamingContext with two working thread and batch interval of
// 1 second
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount");
JavaSparkContext jsc=new JavaSparkContext(conf);
jsc.setLogLevel("WARN");
JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(60)); // Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("xx.xx.xx.xx", 19999); // Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator()); // Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2); // Print the first ten elements of each RDD generated in this DStream to the
// console
wordCounts.print(); jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}
测试:
[root@abced dx]# nc -lk
hellow wrd
hello word
hello word
hello dkk
hl
hello
hello
hello word
hello word
hello java
hello c@
hello hadoop]
hello spark
hello word
hello kafka
hello c
hello c#
hello .net core
net cre
workd
hle
hello words
hke hjh
hek
hel
hl3
hhk
hke
程序执行结果:
-------------------------------------------
Time: ms
-------------------------------------------
(c,)
(spark,)
(kafka,)
(c#,)
(hello,)
(java,)
(c@,)
(hadoop],)
(word,) // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
-------------------------------------------
Time: ms
-------------------------------------------
(hle,)
(words,)
(.net,)
(hello,)
(workd,)
(cre,)
(net,)
(core,) // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
-------------------------------------------
Time: ms
-------------------------------------------
(,)
(hhk,)
(hek,)
(hel,)
(,)
(hjh,)
(,)
(hke,)
(,)
(hl3,)
结论:是一批一批的处理的,不进行累加,每一批统计并不是累加之前的数据,而是针对当前接收到这一批数据的处理。
Spark:java api实现word count统计的更多相关文章
- Spark Java API 之 CountVectorizer
Spark Java API 之 CountVectorizer 由于在Spark中文本处理与分析的一些机器学习算法的输入并不是文本数据,而是数值型向量.因此,需要进行转换.而将文本数据转换成数值型的 ...
- 在 IntelliJ IDEA 中配置 Spark(Java API) 运行环境
1. 新建Maven项目 初始Maven项目完成后,初始的配置(pom.xml)如下: 2. 配置Maven 向项目里新建Spark Core库 <?xml version="1.0& ...
- Spark Java API 计算 Levenshtein 距离
Spark Java API 计算 Levenshtein 距离 在上一篇文章中,完成了Spark开发环境的搭建,最终的目标是对用户昵称信息做聚类分析,找出违规的昵称.聚类分析需要一个距离,用来衡量两 ...
- spark (java API) 在Intellij IDEA中开发并运行
概述:Spark 程序开发,调试和运行,intellij idea开发Spark java程序. 分两部分,第一部分基于intellij idea开发Spark实例程序并在intellij IDEA中 ...
- spark java API 实现二次排序
package com.spark.sort; import java.io.Serializable; import scala.math.Ordered; public class SecondS ...
- spark java api数据分析实战
1 spark关键包 <!--spark--> <dependency> <groupId>fakepath</groupId> <artifac ...
- 【Spark Java API】broadcast、accumulator
转载自:http://www.jianshu.com/p/082ef79c63c1 broadcast 官方文档描述: Broadcast a read-only variable to the cl ...
- Spark: 单词计数(Word Count)的MapReduce实现(Java/Python)
1 导引 我们在博客<Hadoop: 单词计数(Word Count)的MapReduce实现 >中学习了如何用Hadoop-MapReduce实现单词计数,现在我们来看如何用Spark来 ...
- Spark基础与Java Api介绍
原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3832405.html 一.Spark简介 1.什么是Spark 发源于AMPLab实验室的分布式内存计 ...
随机推荐
- CentOS 7下的KVM网卡配置为千兆网卡
在KVM下可以生成两种型号的网卡,RTL8139和E1000,其实应该是底层生成不同芯片的网卡,而不是附带宿主机网卡是什么型号就是什么型号的,其中默认为100兆网卡,即RTL8319的螃蟹卡,另一种是 ...
- OpenOCD Debug Adapter Configuration
Correctly installing OpenOCD includes making your operating system give OpenOCD access to debug adap ...
- Mustache.js语法
看了Mustache的github,学学此中的语法,做个笔记 1.简单的变量调换:{{name}} 1 var data = { "name": "Willy" ...
- js中 object.constructor
- git push origin master:master
$git push origin master:master (在local repository中找到名字为master的branch,使用它去更新remote repository下名字为mast ...
- JMeter学习(二十三)关联
话说LoadRunner有的一些功能,比如:参数化.检查点.集合点.关联,Jmeter也都有这些功能,只是功能可能稍弱一些,今天就关联来讲解一下. JMeter的关联方法有两种:后置处理器-正则表达式 ...
- unity3d 5.6烘焙教程
unity5.6是今年发布,作为5.x的最后一个版本,有很多烘焙优势,在此总结一些作为5.x系列完结的笔记这个版本在烘焙上的特点就是增加了渐进光照贴图(Progressive Lightmapper) ...
- 执行tsung时报"Maximum number of concurrent users in a single VM reached
原创作品,允许转载,转载时请务必以超链接形式标明文章 原始出处 .作者信息和本声明.否则将追究法律责任.http://ovcer.blog.51cto.com/1145188/1581326 [roo ...
- Scrollbar中滚动条的设置
insideOverlay 默认值,表示在padding区域内并且覆盖在view上 insideInset 表示在padding区域内并且插入在view后面 outsideOverlay 表示在p ...
- EditText 中文文档
本文来自:http://www.cnblogs.com/over140/archive/2010/09/02/1815439.html 属性名称 描述 android:autoLink 设置是否当文本 ...