不多说,直接上干货!

https://beam.apache.org/get-started/wordcount-example/

  来自官网的:

The WordCount examples demonstrate how to set up a processing pipeline that can read text, tokenize the text lines into individual words, and perform a frequency count on each of those words. The Beam SDKs contain a series of these four successively more detailed WordCount examples that build on each other. The input text for all the examples is a set of Shakespeare’s texts.

Each WordCount example introduces different concepts in the Beam programming model. Begin by understanding Minimal WordCount, the simplest of the examples. Once you feel comfortable with the basic principles in building a pipeline, continue on to learn more concepts in the other examples.

  • Minimal WordCount demonstrates the basic principles involved in building a pipeline.
  • WordCount introduces some of the more common best practices in creating re-usable and maintainable pipelines.
  • Debugging WordCount introduces logging and debugging practices.
  • Windowed WordCount demonstrates how you can use Beam’s programming model to handle both bounded and unbounded datasets.

  我这里仅以Minimal WordCount为例。

  首先说明一下,为了简单起见,我直接在代码中显式配置指定PipelineRunner,示例代码片段如下所示:

PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class);

  如果要部署到服务器上,可以通过命令行的方式指定PipelineRunner,比如要在Spark集群上运行,类似如下所示命令行:

spark-submit --class org.shirdrn.beam.examples.MinimalWordCountBasedSparkRunner -- --master spark://myserver:7077 target/my-beam-apps-0.0.1-SNAPSHOT-shaded.jar --runner=SparkRunner

  下面,我们从几个典型的例子来看(基于Apache Beam软件包的examples有所改动),Apache Beam如何构建Pipeline并运行在指定的PipelineRunner上:

  • WordCount(Count/Source/Sink)

  我们根据Apache Beam的MinimalWordCount示例代码开始,看如何构建一个Pipeline,并最终执行它。 MinimalWordCount的实现,代码如下所示:

package org.shirdrn.beam.examples;

import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Count;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.KV; public class MinimalWordCount { @SuppressWarnings("serial")
public static void main(String[] args) { PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class); // 显式指定PipelineRunner:DirectRunner(Local模式) Pipeline pipeline = Pipeline.create(options); pipeline.apply(TextIO.Read.from("/tmp/dataset/apache_beam.txt")) // 读取本地文件,构建第一个PTransform
.apply("ExtractWords", ParDo.of(new DoFn<String, String>() { // 对文件中每一行进行处理(实际上Split) @ProcessElement
public void processElement(ProcessContext c) {
for (String word : c.element().split("[\\s:\\,\\.\\-]+")) {
if (!word.isEmpty()) {
c.output(word);
}
}
} }))
.apply(Count.<String> perElement()) // 统计每一个Word的Count
.apply("ConcatResultKVs", MapElements.via( // 拼接最后的格式化输出(Key为Word,Value为Count)
new SimpleFunction<KV<String, Long>, String>() { @Override
public String apply(KV<String, Long> input) {
return input.getKey() + ": " + input.getValue();
} }))
.apply(TextIO.Write.to("wordcount")); // 输出结果 pipeline.run().waitUntilFinish();
}
}

  Pipeline的具体含义,可以看上面代码的注释信息。下面,我们考虑以HDFS数据源作为Source,如何构建第一个PTransform,代码片段如下所示:

PCollection<KV<LongWritable, Text>> resultCollection = pipeline.apply(HDFSFileSource.readFrom(
"hdfs://myserver:8020/data/ds/beam.txt",
TextInputFormat.class, LongWritable.class, Text.class))
 

  可以看到,返回的是具有键值分别为LongWritable、Text类型的KV对象集合,后续处理和上面处理逻辑类似。如果使用Maven构建Project,需要加上如下依赖(这里beam.version的值可以为最新Release版本0.4.0):

<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-hdfs</artifactId>
<version>${beam.version}</version>
</dependency>
 
 
  • 去重(Distinct)

去重也是对数据集比较常见的操作,使用Apache Beam来实现,示例代码如下所示:

package org.shirdrn.beam.examples;

import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Distinct; public class DistinctExample { public static void main(String[] args) throws Exception { PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class); // 显式指定PipelineRunner:DirectRunner(Local模式) Pipeline pipeline = Pipeline.create(options);
pipeline.apply(TextIO.Read.from("/tmp/dataset/MY_ID_FILE.txt"))
.apply(Distinct.<String> create()) // 创建一个处理String类型的PTransform:Distinct
.apply(TextIO.Write.to("deduped.txt")); // 输出结果
pipeline.run().waitUntilFinish();
}
}
 
  • 分组(GroupByKey)

对数据进行分组操作也非常普遍,我们拿一个最基础的PTransform实现GroupByKey来实现一个例子,代码如下所示:

package org.shirdrn.beam.examples;

import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.runners.direct.repackaged.com.google.common.base.Joiner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.GroupByKey;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.KV; public class GroupByKeyExample { @SuppressWarnings("serial")
public static void main(String[] args) { PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class); // 显式指定PipelineRunner:DirectRunner(Local模式) Pipeline pipeline = Pipeline.create(options); pipeline.apply(TextIO.Read.from("/tmp/dataset/MY_INFO_FILE.txt"))
.apply("ExtractFields", ParDo.of(new DoFn<String, KV<String, String>>() { @ProcessElement
public void processElement(ProcessContext c) {
// file format example: 35451605324179 3G CMCC
String[] values = c.element().split("\t");
if(values.length == ) {
c.output(KV.of(values[], values[]));
}
}
}))
.apply("GroupByKey", GroupByKey.<String, String>create()) // 创建一个GroupByKey实例的PTransform
.apply("ConcatResults", MapElements.via(
new SimpleFunction<KV<String, Iterable<String>>, String>() { @Override
public String apply(KV<String, Iterable<String>> input) {
return new StringBuffer()
.append(input.getKey()).append("\t")
.append(Joiner.on(",").join(input.getValue()))
.toString();
} }))
.apply(TextIO.Write.to("grouppedResults")); pipeline.run().waitUntilFinish(); }
}

  使用DirectRunner运行,输出文件名称类似于grouppedResults-00000-of-00002、grouppedResults-00001-of-00002等等。

  • 连接(Join)

  最后,我们通过实现一个Join的例子,其中,用户的基本信息包含ID和名称,对应文件格式如下所示:

    Jack
Jim
John
Linda

  另一个文件是用户使用手机的部分信息,文件格式如下所示:

    3G    中国移动
2G 中国电信
4G 中国移动

  我们希望通过Join操作后,能够知道用户使用的什么网络(用户名+网络),使用Apache Beam实现,具体实现代码如下所示:

package org.shirdrn.beam.examples;

import org.apache.beam.runners.direct.DirectRunner;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.transforms.join.CoGbkResult;
import org.apache.beam.sdk.transforms.join.CoGroupByKey;
import org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.TupleTag; public class JoinExample { @SuppressWarnings("serial")
public static void main(String[] args) { PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class); // 显式指定PipelineRunner:DirectRunner(Local模式) Pipeline pipeline = Pipeline.create(options); // create ID info collection
final PCollection<KV<String, String>> idInfoCollection = pipeline
.apply(TextIO.Read.from("/tmp/dataset/MY_ID_INFO_FILE.txt"))
.apply("CreateUserIdInfoPairs", MapElements.via(
new SimpleFunction<String, KV<String, String>>() { @Override
public KV<String, String> apply(String input) {
// line format example: 35451605324179 Jack
String[] values = input.split("\t");
return KV.of(values[], values[]);
} })); // create operation collection
final PCollection<KV<String, String>> opCollection = pipeline
.apply(TextIO.Read.from("/tmp/dataset/MY_ID_OP_INFO_FILE.txt"))
.apply("CreateIdOperationPairs", MapElements.via(
new SimpleFunction<String, KV<String, String>>() { @Override
public KV<String, String> apply(String input) {
// line format example: 35237005342309 3G CMCC
String[] values = input.split("\t");
return KV.of(values[], values[]);
} })); final TupleTag<String> idInfoTag = new TupleTag<String>();
final TupleTag<String> opInfoTag = new TupleTag<String>(); final PCollection<KV<String, CoGbkResult>> cogrouppedCollection = KeyedPCollectionTuple
.of(idInfoTag, idInfoCollection)
.and(opInfoTag, opCollection)
.apply(CoGroupByKey.<String>create()); final PCollection<KV<String, String>> finalResultCollection = cogrouppedCollection
.apply("CreateJoinedIdInfoPairs", ParDo.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() { @ProcessElement
public void processElement(ProcessContext c) {
KV<String, CoGbkResult> e = c.element();
String id = e.getKey();
String name = e.getValue().getOnly(idInfoTag);
for (String opInfo : c.element().getValue().getAll(opInfoTag)) {
// Generate a string that combines information from both collection values
c.output(KV.of(id, "\t" + name + "\t" + opInfo));
}
}
})); PCollection<String> formattedResults = finalResultCollection
.apply("FormatFinalResults", ParDo.of(new DoFn<KV<String, String>, String>() {
@ProcessElement
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + "\t" + c.element().getValue());
}
})); formattedResults.apply(TextIO.Write.to("joinedResults"));
pipeline.run().waitUntilFinish(); }
}
 
 
 
 

参考内容

Apache Beam WordCount编程实战及源码解读

  http://blog.csdn.net/dream_an/article/details/56277784

  http://blog.csdn.net/qq_23660243/article/details/54614167

Beam编程系列之Apache Beam WordCount Examples(MinimalWordCount example、WordCount example、Debugging WordCount example、WindowedWordCount example)(官网的推荐步骤)的更多相关文章

  1. Beam编程系列之Python SDK Quickstart(官网的推荐步骤)

    不多说,直接上干货! https://beam.apache.org/get-started/quickstart-py/ Beam编程系列之Java SDK Quickstart(官网的推荐步骤)

  2. Beam编程系列之Java SDK Quickstart(官网的推荐步骤)

    不多说,直接上干货! https://beam.apache.org/get-started/beam-overview/ https://beam.apache.org/get-started/qu ...

  3. 1.1 Introduction中 Apache Kafka™ is a distributed streaming platform. What exactly does that mean?(官网剖析)(博主推荐)

    不多说,直接上干货! 一切来源于官网 http://kafka.apache.org/documentation/ Apache Kafka™ is a distributed streaming p ...

  4. Beam概念学习系列之Pipeline Runners

    不多说,直接上干货! https://beam.apache.org/get-started/beam-overview/ 在 Beam 管道上运行引擎会根据你选择的分布式处理引擎,其中兼容的 API ...

  5. Apache Beam WordCount编程实战及源码解读

    概述:Apache Beam WordCount编程实战及源码解读,并通过intellij IDEA和terminal两种方式调试运行WordCount程序,Apache Beam对大数据的批处理和流 ...

  6. Apache Beam WordCount编程实战及源代码解读

    概述:Apache Beam WordCount编程实战及源代码解读,并通过intellij IDEA和terminal两种方式调试执行WordCount程序,Apache Beam对大数据的批处理和 ...

  7. Apache Beam实战指南 | 手把手教你玩转KafkaIO与Flink

    https://mp.weixin.qq.com/s?__biz=MzU1NDA4NjU2MA==&mid=2247492538&idx=2&sn=9a2bd9fe2d7fd6 ...

  8. Apache Beam,批处理和流式处理的融合!

    1. 概述 在本教程中,我们将介绍 Apache Beam 并探讨其基本概念. 我们将首先演示使用 Apache Beam 的用例和好处,然后介绍基本概念和术语.之后,我们将通过一个简单的例子来说明 ...

  9. Apache Beam的架构概览

    不多说,直接上干货! Apache Beam是一个开源的数据处理编程库,由Google贡献给Apache的项目,前不久刚刚成为Apache TLP项目.它提供了一个高级的.统一的编程模型,允许我们通过 ...

随机推荐

  1. javascript高级程序设计读书笔记----函数表达式

    定义函数两种方式: 1.函数声明 function sayHi(){ alert("Hi"); } sayHi();//调用函数 2.函数表达式 var sayHi = funct ...

  2. query聚类技术

    query聚类 目的 query聚类主要有以下两个目的 解决query空间稀疏问题(长尾query) 挖掘用户意图(一条行为包含的意图是稀疏的,当有一簇行为时,意图更明确) 可以说聚类是构建内容模型的 ...

  3. WPF中在摄像头视频上叠加控件的解决方案

    一.视频呈现 前段时间,在一个wpf的项目中需要实时显示ip摄像头,对此的解决方案想必大家都应该知道很多.在winform中,我们可以将一个控件(一般用panel或者pictruebox)的句柄丢给摄 ...

  4. 安卓开发时访问google方法

    启动浏览器后15秒左右,浏览器的右上角就会出现图标 启用防火墙功能(右上角墙形图标),这时候程序就会去寻找网上代理,从而达到访问GOOGLE的效果,提示如果不访问google网站,可再点击一下关闭防火 ...

  5. react-native自定义原生组件

    此文已由作者王翔授权网易云社区发布. 欢迎访问网易云社区,了解更多网易技术产品运营经验. 使用react-native的时候能够看到不少函数调用式的组件,像LinkIOS用来呼起url请求  Link ...

  6. Web渗透测试(xss漏洞)

    Xss介绍—— XSS (cross-site script) 跨站脚本自1996年诞生以来,一直被OWASP(open web application security project) 评为十大安 ...

  7. [AGC006] D - Median Pyramid Hard 二分

    Description ​ 现在有一个NN层的方块金字塔,从最顶层到最底层分别标号为1...N1...N. ​ 第ii层恰好有2i−12i−1个方块,且每一层的中心都是对齐的. 这是一个N=4N=4的 ...

  8. django自定义rbac权限组件(二级菜单)

    一.目录结构 二.表结构设计 model.py from django.db import models # Create your models here. class Menu(models.Mo ...

  9. jquery.from帮助类

    /** * 将form里面的内容序列化成json * 相同的checkbox用分号拼接起来 * @param {obj} 需要拼接在后面的json对象 * @method serializeJson ...

  10. GI缓存

    GI缓存 设置 属性 最大高速缓存大小(GB) 使用滑块设置最大GI缓存文件夹大小.只要可能,GI缓存文件夹将保持在此大小以下.定期删除未使用的文件以创建更多空间.这是由编辑器自动执行,不需要你做任何 ...