方案一:使用reduceByKey

数据word.txt

张三
李四
王五
李四
王五
李四
王五
李四
王五
王五
李四
李四
李四
李四
李四

代码:

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2; public class HelloWord {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate();
final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); RDD<String> rdd = spark.sparkContext().textFile("C:\\Users\\boco\\Desktop\\word.txt", 1);
JavaRDD<String> javaRDD = rdd.toJavaRDD(); JavaPairRDD<String, Integer> javaRDDMap = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}); JavaPairRDD<String, Integer> result = javaRDDMap.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
}); System.out.println(result.collect());
}
}

输出:

[(张三,1), (李四,9), (王五,5)]

方案二:使用spark sql

使用spark sql实现代码:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType; import java.util.ArrayList; public class HelloWord {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate();
final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); JavaRDD<Row> rows = spark.read().text("C:\\Users\\boco\\Desktop\\word.txt").toJavaRDD(); ArrayList<StructField> fields = new ArrayList<StructField>();
StructField field = null;
field = DataTypes.createStructField("key", DataTypes.StringType, true);
fields.add(field); StructType schema = DataTypes.createStructType(fields); Dataset<Row> ds = spark.createDataFrame(rows, schema); ds.createOrReplaceTempView("words"); Dataset<Row> result = spark.sql("select key,count(0) as key_count from words group by key"); result.show();
}
}

结果:

+---+---------+
|key|key_count|
+---+---------+
| 王五| 5|
| 李四| 9|
| 张三| 1|
+---+---------+

方案二:使用spark streaming实时流分析

参考《http://spark.apache.org/docs/latest/streaming-programming-guide.html》

First, we create a JavaStreamingContext object, which is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.

import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2; // Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

Using this context, we can create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. localhost) and port (e.g. 9999).

// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);

This lines DStream represents the stream of data that will be received from the data server. Each record in this stream is a line of text. Then, we want to split the lines by space into words.

// Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator());

flatMap is a DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream. In this case, each line will be split into multiple words and the stream of words is represented as the words DStream. Note that we defined the transformation using a FlatMapFunction object. As we will discover along the way, there are a number of such convenience classes in the Java API that help defines DStream transformations.

Next, we want to count these words.

// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2); // Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();

The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs, using a PairFunction object. Then, it is reduced to get the frequency of words in each batch of data, using a Function2 object. Finally, wordCounts.print() will print a few of the counts generated every second.

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform after it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call start method.

jssc.start();              // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate

The complete code can be found in the Spark Streaming example JavaNetworkWordCount.

If you have already downloaded and built Spark, you can run this example as follows. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using

$ nc -lk 9999

Then, in a different terminal, you can start the example by using

$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999

完整代码:

import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class HelloWord {
public static void main(String[] args) throws InterruptedException {
// Create a local StreamingContext with two working thread and batch interval of
// 1 second
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount");
JavaSparkContext jsc=new JavaSparkContext(conf);
jsc.setLogLevel("WARN");
JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(60)); // Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("xx.xx.xx.xx", 19999); // Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator()); // Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2); // Print the first ten elements of each RDD generated in this DStream to the
// console
wordCounts.print(); jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}

测试:

[root@abced dx]# nc -lk
hellow wrd
hello word
hello word
hello dkk
hl
hello
hello
hello word
hello word
hello java
hello c@
hello hadoop]
hello spark
hello word
hello kafka
hello c
hello c#
hello .net core
net cre
workd
hle
hello words
hke hjh
hek
hel
hl3
hhk
hke

程序执行结果:

-------------------------------------------
Time: ms
-------------------------------------------
(c,)
(spark,)
(kafka,)
(c#,)
(hello,)
(java,)
(c@,)
(hadoop],)
(word,) // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
-------------------------------------------
Time: ms
-------------------------------------------
(hle,)
(words,)
(.net,)
(hello,)
(workd,)
(cre,)
(net,)
(core,) // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
-------------------------------------------
Time: ms
-------------------------------------------
(,)
(hhk,)
(hek,)
(hel,)
(,)
(hjh,)
(,)
(hke,)
(,)
(hl3,)

结论:是一批一批的处理的,不进行累加,每一批统计并不是累加之前的数据,而是针对当前接收到这一批数据的处理。

Spark:java api实现word count统计的更多相关文章

  1. Spark Java API 之 CountVectorizer

    Spark Java API 之 CountVectorizer 由于在Spark中文本处理与分析的一些机器学习算法的输入并不是文本数据,而是数值型向量.因此,需要进行转换.而将文本数据转换成数值型的 ...

  2. 在 IntelliJ IDEA 中配置 Spark(Java API) 运行环境

    1. 新建Maven项目 初始Maven项目完成后,初始的配置(pom.xml)如下: 2. 配置Maven 向项目里新建Spark Core库 <?xml version="1.0& ...

  3. Spark Java API 计算 Levenshtein 距离

    Spark Java API 计算 Levenshtein 距离 在上一篇文章中,完成了Spark开发环境的搭建,最终的目标是对用户昵称信息做聚类分析,找出违规的昵称.聚类分析需要一个距离,用来衡量两 ...

  4. spark (java API) 在Intellij IDEA中开发并运行

    概述:Spark 程序开发,调试和运行,intellij idea开发Spark java程序. 分两部分,第一部分基于intellij idea开发Spark实例程序并在intellij IDEA中 ...

  5. spark java API 实现二次排序

    package com.spark.sort; import java.io.Serializable; import scala.math.Ordered; public class SecondS ...

  6. spark java api数据分析实战

    1 spark关键包 <!--spark--> <dependency> <groupId>fakepath</groupId> <artifac ...

  7. 【Spark Java API】broadcast、accumulator

    转载自:http://www.jianshu.com/p/082ef79c63c1 broadcast 官方文档描述: Broadcast a read-only variable to the cl ...

  8. Spark: 单词计数(Word Count)的MapReduce实现(Java/Python)

    1 导引 我们在博客<Hadoop: 单词计数(Word Count)的MapReduce实现 >中学习了如何用Hadoop-MapReduce实现单词计数,现在我们来看如何用Spark来 ...

  9. Spark基础与Java Api介绍

    原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3832405.html  一.Spark简介 1.什么是Spark 发源于AMPLab实验室的分布式内存计 ...

随机推荐

  1. j.u.c系列(08)---之并发工具类:CountDownLatch

    写在前面 CountDownLatch所描述的是”在完成一组正在其他线程中执行的操作之前,它允许一个或多个线程一直等待“:用给定的计数 初始化 CountDownLatch.由于调用了 countDo ...

  2. spring-boot 速成(11) - 单元测试

    一.添加依赖项: testCompile 'org.springframework.boot:spring-boot-starter-test:1.5.2.RELEASE' 二.单元测试代码示例 im ...

  3. 前端构建和模块化工具-coolie

    [前言] 假设你之前用过前端模块化工具:seajs.requirejs. 用过前端构建工具grunt.gulp, 而且感到了一些不方便和痛苦,那么你能够试试coolie [coolie] 本文不是一篇 ...

  4. bash 脚本中分号的作用

    在Linux bash shell中,语句中的分号一般用作代码块标识 1.单行语句一般要用到分号来区分代码块.比如: weblogic@pmtest:/$if [ "$PS1" ] ...

  5. .NET轻量级ORM组件Dapper葵花宝典

    一.摘要 为什么取名叫<葵花宝典>? 从行走江湖的世界角度来讲您可以理解为一本"武功秘籍",站在我们IT编程的世界角度应该叫"开发宝典". 如果您在 ...

  6. .Net Discovery 系列之一--string从入门到精通(上)

    string是一种很特殊的数据类型,它既是基元类型又是引用类型,在编译以及运行时,.Net都对它做了一些优化工作,正式这些优化工作有时会迷惑编程人员,使string看起来难以琢磨,这篇文章分上下两章, ...

  7. ASP.NET Web API实践系列07,获取数据, 使用Ninject实现依赖倒置,使用Knockout实现页面元素和视图模型的双向绑定

    本篇接着上一篇"ASP.NET Web API实践系列06, 在ASP.NET MVC 4 基础上增加使用ASP.NET WEB API",尝试获取数据. 在Models文件夹下创 ...

  8. C#编程(小结)---------- 小总结

    总结 概括 委托是寻址方法的.NET版本,类似于C++中的指针.委托可以理解为指向函数的指针,它是类型安全的,定义了具体的参数和返回值. 定义一个委托,实际上是定义一个类,委托是对方法的引用,如方法F ...

  9. 【ELK】5.spring boot日志集成ELK,搭建日志系统

    阅读前必看: ELK在docker下搭建步骤 spring boot集成es,CRUD操作完整版 ============================================== 本章集成 ...

  10. FAQ:什么情况下使用 struct ?

    问: 什么情况下使用 struct ? 答: 使用 struct 有几个前提(必须全部满足): 容忍 struct 本身的限制,如:不能继承. 值语义. 足够小(<=16字节). 如果 stru ...