方案一:使用reduceByKey

数据word.txt

张三
李四
王五
李四
王五
李四
王五
李四
王五
王五
李四
李四
李四
李四
李四

代码:

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.rdd.RDD;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2; public class HelloWord {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate();
final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); RDD<String> rdd = spark.sparkContext().textFile("C:\\Users\\boco\\Desktop\\word.txt", 1);
JavaRDD<String> javaRDD = rdd.toJavaRDD(); JavaPairRDD<String, Integer> javaRDDMap = javaRDD.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
}); JavaPairRDD<String, Integer> result = javaRDDMap.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
}); System.out.println(result.collect());
}
}

输出:

[(张三,1), (李四,9), (王五,5)]

方案二:使用spark sql

使用spark sql实现代码:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType; import java.util.ArrayList; public class HelloWord {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").appName("Spark").getOrCreate();
final JavaSparkContext ctx = JavaSparkContext.fromSparkContext(spark.sparkContext()); JavaRDD<Row> rows = spark.read().text("C:\\Users\\boco\\Desktop\\word.txt").toJavaRDD(); ArrayList<StructField> fields = new ArrayList<StructField>();
StructField field = null;
field = DataTypes.createStructField("key", DataTypes.StringType, true);
fields.add(field); StructType schema = DataTypes.createStructType(fields); Dataset<Row> ds = spark.createDataFrame(rows, schema); ds.createOrReplaceTempView("words"); Dataset<Row> result = spark.sql("select key,count(0) as key_count from words group by key"); result.show();
}
}

结果:

+---+---------+
|key|key_count|
+---+---------+
| 王五| 5|
| 李四| 9|
| 张三| 1|
+---+---------+

方案二:使用spark streaming实时流分析

参考《http://spark.apache.org/docs/latest/streaming-programming-guide.html》

First, we create a JavaStreamingContext object, which is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.

import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2; // Create a local StreamingContext with two working thread and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

Using this context, we can create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. localhost) and port (e.g. 9999).

// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);

This lines DStream represents the stream of data that will be received from the data server. Each record in this stream is a line of text. Then, we want to split the lines by space into words.

// Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator());

flatMap is a DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream. In this case, each line will be split into multiple words and the stream of words is represented as the words DStream. Note that we defined the transformation using a FlatMapFunction object. As we will discover along the way, there are a number of such convenience classes in the Java API that help defines DStream transformations.

Next, we want to count these words.

// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2); // Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();

The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs, using a PairFunction object. Then, it is reduced to get the frequency of words in each batch of data, using a Function2 object. Finally, wordCounts.print() will print a few of the counts generated every second.

Note that when these lines are executed, Spark Streaming only sets up the computation it will perform after it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call start method.

jssc.start();              // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate

The complete code can be found in the Spark Streaming example JavaNetworkWordCount.

If you have already downloaded and built Spark, you can run this example as follows. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using

$ nc -lk 9999

Then, in a different terminal, you can start the example by using

$ ./bin/run-example streaming.JavaNetworkWordCount localhost 9999

完整代码:

import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class HelloWord {
public static void main(String[] args) throws InterruptedException {
// Create a local StreamingContext with two working thread and batch interval of
// 1 second
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("NetworkWordCount");
JavaSparkContext jsc=new JavaSparkContext(conf);
jsc.setLogLevel("WARN");
JavaStreamingContext jssc = new JavaStreamingContext(jsc, Durations.seconds(60)); // Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("xx.xx.xx.xx", 19999); // Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator()); // Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2); // Print the first ten elements of each RDD generated in this DStream to the
// console
wordCounts.print(); jssc.start(); // Start the computation
jssc.awaitTermination(); // Wait for the computation to terminate
}
}

测试:

[root@abced dx]# nc -lk
hellow wrd
hello word
hello word
hello dkk
hl
hello
hello
hello word
hello word
hello java
hello c@
hello hadoop]
hello spark
hello word
hello kafka
hello c
hello c#
hello .net core
net cre
workd
hle
hello words
hke hjh
hek
hel
hl3
hhk
hke

程序执行结果:

-------------------------------------------
Time: ms
-------------------------------------------
(c,)
(spark,)
(kafka,)
(c#,)
(hello,)
(java,)
(c@,)
(hadoop],)
(word,) // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
-------------------------------------------
Time: ms
-------------------------------------------
(hle,)
(words,)
(.net,)
(hello,)
(workd,)
(cre,)
(net,)
(core,) // :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
// :: WARN RandomBlockReplicationPolicy: Expecting replicas with only peer/s.
// :: WARN BlockManager: Block input-- replicated to only peer(s) instead of peers
-------------------------------------------
Time: ms
-------------------------------------------
(,)
(hhk,)
(hek,)
(hel,)
(,)
(hjh,)
(,)
(hke,)
(,)
(hl3,)

结论:是一批一批的处理的,不进行累加,每一批统计并不是累加之前的数据,而是针对当前接收到这一批数据的处理。

Spark:java api实现word count统计的更多相关文章

  1. Spark Java API 之 CountVectorizer

    Spark Java API 之 CountVectorizer 由于在Spark中文本处理与分析的一些机器学习算法的输入并不是文本数据,而是数值型向量.因此,需要进行转换.而将文本数据转换成数值型的 ...

  2. 在 IntelliJ IDEA 中配置 Spark(Java API) 运行环境

    1. 新建Maven项目 初始Maven项目完成后,初始的配置(pom.xml)如下: 2. 配置Maven 向项目里新建Spark Core库 <?xml version="1.0& ...

  3. Spark Java API 计算 Levenshtein 距离

    Spark Java API 计算 Levenshtein 距离 在上一篇文章中,完成了Spark开发环境的搭建,最终的目标是对用户昵称信息做聚类分析,找出违规的昵称.聚类分析需要一个距离,用来衡量两 ...

  4. spark (java API) 在Intellij IDEA中开发并运行

    概述:Spark 程序开发,调试和运行,intellij idea开发Spark java程序. 分两部分,第一部分基于intellij idea开发Spark实例程序并在intellij IDEA中 ...

  5. spark java API 实现二次排序

    package com.spark.sort; import java.io.Serializable; import scala.math.Ordered; public class SecondS ...

  6. spark java api数据分析实战

    1 spark关键包 <!--spark--> <dependency> <groupId>fakepath</groupId> <artifac ...

  7. 【Spark Java API】broadcast、accumulator

    转载自:http://www.jianshu.com/p/082ef79c63c1 broadcast 官方文档描述: Broadcast a read-only variable to the cl ...

  8. Spark: 单词计数(Word Count)的MapReduce实现(Java/Python)

    1 导引 我们在博客<Hadoop: 单词计数(Word Count)的MapReduce实现 >中学习了如何用Hadoop-MapReduce实现单词计数,现在我们来看如何用Spark来 ...

  9. Spark基础与Java Api介绍

    原创文章,转载请注明: 转载自http://www.cnblogs.com/tovin/p/3832405.html  一.Spark简介 1.什么是Spark 发源于AMPLab实验室的分布式内存计 ...

随机推荐

  1. CENTOS下搭建SVN服务器(转)

    1.安装svn yum install -y subversion 2.验证安装是否成功 svnserve --version 3.创建svn版本库 mkdir svn svnadmin create ...

  2. JS本地存储信息的实现方式(localStorage 与 userData)

    详细介绍请看: http://www.cnblogs.com/beiyuu/archive/2011/07/20/js-localstorage-userdata.html 里面涉及到的 demo 代 ...

  3. 关于pcie的备忘

    总线驱动:深度优先统计资源,深度滞后分配资源 资源包括Bus id和内存(prefectable和non-prefectable内存) 设备驱动:包括设备驱动层和消息通信 主要是四个部分: (1)中断 ...

  4. Java 数组的 12 个最佳方法

    1.  声明一个数组 String[] aArray = new String[5]; String[] bArray = {"a","b","c&q ...

  5. Netty Pipeline、channel、Context之间的数据流向

    以下所绘制图形均基于Netty4.0.28版本. 一.connect(outbound类型事件)  当用户调用channel的connect时,会发起一个outbound类型的事件,该事件将在pipe ...

  6. Netbeans 安装emmet插件

    一.下载插件 http://plugins.netbeans.org/plugin/48315/emmet 二.点击工具-> 插件 -> 已下载 -> 添加插件 ->(导入下载 ...

  7. 用 iOS-System-Services 框架获取iOS设备所用的设备信息

    参考资料地址 https://github.com/Shmoopi/iOS-System-Services 百度云盘下载地址 http://pan.baidu.com/s/1c05ot1m This ...

  8. Redis在Mac下的安装与使用方法

    首先从Redis官网http://www.redis.io去下载最新版本的Redis安装文件(此处以Redis版本为例进行说明).   Redis 2.6.16版本的下载地址:http://downl ...

  9. 实用ExtJS教程100例-003:进度条对话框Ext.MessageBox.progress

    在上一篇内容中我们介绍了三种常用的MessageBox提示框,在这篇文章中,我们将演示如何在对话框中使用进度条. 进度条对话框 我们可以使用下面的代码来在MessageBox中显示一个进度条: Ext ...

  10. SpringBoot 获取当前登录用户IP

    控制器方法: @RequestMapping(value = "/getIp", method = RequestMethod.POST) @ResponseBody public ...