如何在flink中传递参数

众所周知，flink作为流计算引擎，处理源源不断的数据是其本意，但是在处理数据的过程中，往往可能需要一些参数的传递，那么有哪些方法进行参数的传递？在什么时候使用？这里尝试进行简单的总结。

使用configuration

　　在main函数中定义变量

 // Class in Flink to store parameters

 Configuration configuration = new Configuration();

 configuration.setString("genre", "Action");

 lines.filter(new FilterGenreWithParameters())

         // Pass parameters to a function

         .withParameters(configuration)

         .print();

　　使用参数的function需要继承自一个rich的function，这样才可以在open方法中获取相应的参数。

 class FilterGenreWithParameters extends RichFilterFunction<Tuple3<Long, String, String>> {

     String genre;

     @Override

     public void open(Configuration parameters) throws Exception {

         // Read the parameter

         genre = parameters.getString("genre", "");

     }

     @Override

     public boolean filter(Tuple3<Long, String, String> movie) throws Exception {

         String[] genres = movie.f2.split("\\|");

         return Stream.of(genres).anyMatch(g -> g.equals(genre));

     }

 }

使用ParameterTool

使用configuration虽然传递了参数，但显然不够动态，每次参数改变，都涉及到程序的变更，既然main函数能够接受参数，flink自然也提供了相应的承接的机制，即ParameterTool。

如果使用ParameterTool，则在参数传递上如下

 public static void main(String... args) {

     // Read command line arguments

     ParameterTool parameterTool = ParameterTool.fromArgs(args);

 final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

 env.getConfig().setGlobalJobParameters(parameterTool);

 ...

 // This function will be able to read these global parameters

 lines.filter(new FilterGenreWithGlobalEnv())

                 .print();

 }

如上面代码，使用parameterTool来承接main函数的参数，通过env来设置全局变量来进行分发，那么在继承了rich函数的逻辑中就可以使用这个全局参数。

 class FilterGenreWithGlobalEnv extends RichFilterFunction<Tuple3<Long, String, String>> {

     @Override

     public boolean filter(Tuple3<Long, String, String> movie) throws Exception {

         String[] genres = movie.f2.split("\\|");

         // Get global parameters

         ParameterTool parameterTool = (ParameterTool) getRuntimeContext().getExecutionConfig().getGlobalJobParameters();

         // Read parameter

         String genre = parameterTool.get("genre");

         return Stream.of(genres).anyMatch(g -> g.equals(genre));

     }

 }

使用broadcast变量

在上面使用configuration和parametertool进行参数传递会很方便，但是也仅仅适用于少量参数的传递，如果有比较大量的数据传递，flink则提供了另外的方式来进行，其中之一即是broadcast，这个也是在其他计算引擎中广泛使用的方法之一。

 DataSet<Integer> toBroadcast = env.fromElements(1, 2, 3);

 // Get a dataset with words to ignore

 DataSet<String> wordsToIgnore = ...

 data.map(new RichFlatMapFunction<String, String>() {

     // A collection to store words. This will be stored in memory

     // of a task manager

     Collection<String> wordsToIgnore;

     @Override

     public void open(Configuration parameters) throws Exception {

         // Read a collection of words to ignore

         wordsToIgnore = getRuntimeContext().getBroadcastVariable("wordsToIgnore");

     }

     @Override

     public String map(String line, Collector<String> out) throws Exception {

         String[] words = line.split("\\W+");

         for (String word : words)

             // Use the collection of words to ignore

             if (wordsToIgnore.contains(word))

                 out.collect(new Tuple2<>(word, 1));

     }

     // Pass a dataset via a broadcast variable

 }).withBroadcastSet(wordsToIgnore, "wordsToIgnore");

在第3行定义了需要进行广播的数据集，在第27行指定了将此数据集进行广播的目的地。

广播的变量会保存在tm的内存中，这个也必然会使用tm有限的内存空间，也因此不能广播太大量的数据。

那么，对于数据量更大的广播需要，要如何进行？flink也提供了缓存文件的机制，如下。

使用distributedCache

首先还是需要在定义dag图的时候指定缓存文件：

 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

 // Register a file from HDFS

 env.registerCachedFile("hdfs:///path/to/file", "machineLearningModel")

 ...

 env.execute()

flink本身支持指定本地的缓存文件，但一般而言，建议指定分布式存储比如hdfs上的文件，并为其指定一个名称。

使用起来也很简单，在rich函数的open方法中进行获取。

 class MyClassifier extends RichMapFunction<String, Integer> {

     @Override

     public void open(Configuration config) {

       File machineLearningModel = getRuntimeContext().getDistributedCache().getFile("machineLearningModel");

       ...

     }

     @Override

     public Integer map(String value) throws Exception {

       ...

     }

 }

上面的代码忽略了对文件内容的处理。

在上面的几个方法中，应该说参数本身都是static的，不会变化，那么如果参数本身随着时间也会发生变化，怎么办？

嗯，那就用connectStream，其实也是流的聚合了。

使用connectStream

使用ConnectedStream的前提当然是需要有一个动态的流，比如在主数据之外，还有一些规则数据，这些规则数据会通过Restful服务来发布，假如我们的主数据来自于kafka，

那么，就可以如下：

 DataStreamSource<String> input = (DataStreamSource) KafkaStreamFactory

                 .getKafka08Stream(env, srcCluster, srcTopic, srcGroup);

 DataStream<Tuple2<String, String>> appkeyMeta = env.addSource(new AppKeySourceFunction(), "appkey")

 ConnectedStreams<String, Tuple2<String, String>> connectedStreams = input.connect(appkeyMeta.broadcast());

 DataStream<String> cleanData = connectedStreams.flatMap(new DataCleanFlatMapFunction())

其实可以看到，上面的代码中做了四件事，首先在第1行定义了获取主数据的流，在第4行定义了获取规则数据的流，在AppKeySourceFunction中实现了读取Restful的逻辑，

在第6行实现了将规则数据广播到主数据中去，最后在第8行实现了从connectedStream中得到经过处理的数据。其中的关键即在于DataCleanFlatMapFunction。

 public class DataCleanFlatMapFunction extends RichCoFlatMapFunction<String, Tuple2<String, String>, String>{

 public void flatMap1(String s, Collector<String> collector){...}

 public void flatMap2(Tuple2<String, String> s, Collector<String> collector) {...}

 }

这是一段缩减的代码，关键在于第一行，首先这个函数需要实现RichCoFlatMapFunction这个抽象类，其次在类实现中，flatMap2会承接规则函数，flatMap1会承接主函数。

当然，参数可以从client发送到task，有时候也需要从task发回到client，一般这里就会使用accumulator。

这里先看一个简单的例子，实现单词的计数以及处理文本的记录数：

 DataSet<String> lines = ...

 // Word count algorithm

 lines.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {

     @Override

     public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {

         String[] words = line.split("\\W+");

         for (String word : words) {

             out.collect(new Tuple2<>(word, 1));

         }

     }

 })

 .groupBy(0)

 .sum(1)

 .print();

 // Count a number of lines in the text to process

 int linesCount = lines.count()

 System.out.println(linesCount);

上面的代码中，第14行实现了单词的计算，第18行实现了处理记录的行数，但很可惜，这里会产生两个job，仅仅第18行一句代码，就会产生一个job，无疑是不高效的。

flink提供了accumulator来实现数据的回传，亦即从tm传回到JM。

flink本身提供了一些内置的accumulator:

IntCounter, LongCounter, DoubleCounter – allows summing together int, long, double values sent from task managers
AverageAccumulator – calculates an average of double values
LongMaximum, LongMinimum, IntMaximum, IntMinimum, DoubleMaximum, DoubleMinimum – accumulators to determine maximum and minimum values for different types
Histogram – used to computed distribution of values from task managers

首先需要定义一个accumulator，然后在某个自定义函数中来注册它，这样在客户端就可以获取相应的的值。

 lines.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {

     // Create an accumulator

     private IntCounter linesNum = new IntCounter();

     @Override

     public void open(Configuration parameters) throws Exception {

         // Register accumulator

         getRuntimeContext().addAccumulator("linesNum", linesNum);

     }

     @Override

     public void flatMap(String line, Collector<Tuple2<String, Integer>> out) throws Exception {

         String[] words = line.split("\\W+");

         for (String word : words) {

             out.collect(new Tuple2<>(word, 1));

         }

         // Increment after each line is processed

         linesNum.add(1);

     }

 })

 .groupBy(0)

 .sum(1)

 .print();

 // Get accumulator result

 int linesNum = env.getLastJobExecutionResult().getAccumulatorResult("linesNum");

 System.out.println(linesNum);

当然，如果内置的accumulator不能满足需求，可以自定义accumulator，只需要继承两个接口之一即可，Accumulator或者SimpleAccumulato。

上面介绍了几种参数传递的方式，在日常的使用中，可能不仅仅是使用其中一种，或许是某些的组合，比如通过parametertool来传递hdfs的路径，再通过filecache来读取缓存。

如何在flink中传递参数的更多相关文章

C# ADO.NET SqlDataAdapter中传递参数
ADO.NET的SQL语句中,往往不是静态的语句,而是需要接受传递过来的参数,比如典型的登录功能,需要查找指定的用户名: string sqlQuery = "SELECT * FROM W ...
关于一些url中传递参数有空格问题
1.关于一些url中传递参数有空格问题: url.replace(/ /g, "%20") 从上面的例子中可以看到可以用:replace(/ /g, "%20" ...
【openresty】向lua代码中传递参数
前面介绍FormInputNginxModule模块时,明白了openresty如何获取post提交的数据. 然后,如果需要通过lua处理这些数据,需要把数据作为参数传递到lua中,lua获取了这些数 ...
mfc 在VC的两个对话框类中传递参数的三种方法
弄了好久,今天终于把在VC中的对话框类之间传递参数的问题解决了,很开心,记录如下: 1. 我所建立的工程是一个基于MFC对话框的应用程序,一共有三个对话框,第一个对话框为主对话框,所对应的类为CTMD ...
JQuery中如何click中传递参数
代码如下: click(data,fn)中的data其实是json对象,取的时候,只能通过当前的事件源来取,data是默认放在event中的,所以这里的data是eventdata,引用的时候也使用e ...
Struct2 向Action中传递参数（中文乱码问题）
就是把视图上的值传递到Action定义的方法中也就是把数据从前台传递到后台三种方式: 1. 使用action属性接收参数比如jsp页面: <body> 使用action属性接收参数 ...
ASP.net button类控件click事件中传递参数
单击Button会同时触发这两个事件,但先执行Click,后执行Command,在button控件中加上参数属性 CommandArgument='' 在click响应函数中可以用以下代码获得传递的参 ...
sys.argv向脚本中传递参数
可以向脚本中传递无限多个参数,其值是一个列表,默认sys.argv[0]内容是脚本文件路径加文件名 test.py文件中的内容如下: #! /usr/bin/python3import sys pri ...
URL中传递参数给视图函数
1. 采用在url中使用变量的方式: 在path的第一个参数中,使用<参数名>的方式可以传递参数.然后在视图函数中也要写一个参数,视图函数中的参数必须和url中的参数名称保持一致,不然就找 ...

随机推荐

Git Status 中文乱码解决
现象: jb@H39:~/doc$ git statusOn branch masterYour branch is up-to-date with 'origin/master'. Untracke ...
如何使用GeoServer发布地图
本文所采用的系统为Windows 10 64bit操作系统,使用FireFox浏览器一.安装配置Java的SDK 1. 安装JavaDevelopment Kit (JDK) 8,java开发环境, ...
成都Uber优步司机奖励政策（4月4日）
滴快车单单2.5倍,注册地址:http://www.udache.com/ 如何注册Uber司机(全国版最新最详细注册流程)/月入2万/不用抢单:http://www.cnblogs.com/mfry ...
安装wamp后，其显示目录的图标显示不出来
解决办法:wamp的安装目录中,到 wamp\bin\apache\Apache2.2.21\conf \extra下打开httpd-autoindex.conf文件,这里是索引文件图标的配置文件.修 ...
探索 Flask
探索 Flask 探索 Flask 是一本关于使用 Flask 开发 Web 应用程序的最佳实践和模式的书籍.这本书是由 426 名赞助人在 Kickstarter 上于 2013 年 7 月资助 ...
yarn 原理
产生背景直接源于MRv1在几个方面的缺陷扩展性受限(NameNode,JobTracker设计为单一节点,内存容量有限) 单点故障难以支持MR之外的计算 slot数目无法动态修改,Map slo ...
分享开源的GB/T-2260国家行政区划代码
项目中需要用到省市数据,在网上搜了一下,很多旧数据,稍微新一点的下载就要积分.X币什么的,很不爽,最后在GitHub上找到一个开源的,还有各种语言版本的,非常方便! https://github.co ...
Qt-QML-关于两个平级的qml文件中的函数调用问题
这几天还在继续搞我的QML,感悟就QML是坑的同时,也是一门很号的语言,用于快速搭界面是很好的.那么,这几天, 遇到一个问题,在下用一个框框画一下,希望可以理解抽象派,解释一下,QML1和QML3是 ...
Java开发工程师(Web方向) - 02.Servlet技术 - 第2章.Cookie与Session
第2章--Cookie与Session Cookie与Session 浏览器输入地址--HTTP请求--Servlet--HTTP响应--浏览器接收会话(session):打开浏览器,打开一系列页面 ...
C#使用EF连接PGSql数据库
前言由于项目需要,使用到了PGSql数据库,说实话这是第一次接触并且听说PGSql(PostgreSQL)关系型数据库,之前一直使用的都是SqlServer,一头雾水的各种找资源,终于将PGSql与 ...

如何在flink中传递参数

如何在flink中传递参数的更多相关文章

随机推荐

热门专题