flink dataset api使用及原理

随着大数据技术在各行各业的广泛应用，要求能对海量数据进行实时处理的需求越来越多，同时数据处理的业务逻辑也越来越复杂，传统的批处理方式和早期的流式处理框架也越来越难以在延迟性、吞吐量、容错能力以及使用便捷性等方面满足业务日益苛刻的要求。

在这种形势下，新型流式处理框架Flink通过创造性地把现代大规模并行处理技术应用到流式处理中来，极大地改善了以前的流式处理框架所存在的问题。

1.概述：

flink提供DataSet Api用户处理批量数据。flink先将接入数据转换成DataSet数据集，并行分布在集群的每个节点上；然后将DataSet数据集进行各种转换操作(map，filter等)，最后通过DataSink操作将结果数据集输出到外部系统。

2.数据接入

输入InputFormat

/**

 * The base interface for data sources that produces records.

 * <p>

 * The input format handles the following:

 * <ul>

 *   <li>It describes how the input is split into splits that can be processed in parallel.</li>

 *   <li>It describes how to read records from the input split.</li>

 *   <li>It describes how to gather basic statistics from the input.</li>

 * </ul>

 * <p>

 * The life cycle of an input format is the following:

 * <ol>

 *   <li>After being instantiated (parameterless), it is configured with a {@link Configuration} object.

 *       Basic fields are read from the configuration, such as a file path, if the format describes

 *       files as input.</li>

 *   <li>Optionally: It is called by the compiler to produce basic statistics about the input.</li>

 *   <li>It is called to create the input splits.</li>

 *   <li>Each parallel input task creates an instance, configures it and opens it for a specific split.</li>

 *   <li>All records are read from the input</li>

 *   <li>The input format is closed</li>

 * </ol>

 * <p>

 * IMPORTANT NOTE: Input formats must be written such that an instance can be opened again after it was closed. That

 * is due to the fact that the input format is used for potentially multiple splits. After a split is done, the

 * format's close function is invoked and, if another split is available, the open function is invoked afterwards for

 * the next split.

 *

 * @see InputSplit

 * @see BaseStatistics

 *

 * @param <OT> The type of the produced records.

 * @param <T> The type of input split.

 */

3.数据转换

DataSet：一组相同类型的元素。DataSet可以通过transformation转换成其它的DataSet。示例如下：

DataSet#map(org.apache.flink.api.common.functions.MapFunction)

DataSet#reduce(org.apache.flink.api.common.functions.ReduceFunction)

DataSet#join(DataSet)

DataSet#coGroup(DataSet)

其中，Function：用户定义的业务逻辑，支持java 8 lambda表达式

function的实现通过operator来做的，以map为例

    /**

     * Applies a Map transformation on this DataSet.

     *

     * <p>The transformation calls a {@link org.apache.flink.api.common.functions.MapFunction} for each element of the DataSet.

     * Each MapFunction call returns exactly one element.

     *

     * @param mapper The MapFunction that is called for each element of the DataSet.

     * @return A MapOperator that represents the transformed DataSet.

     *

     * @see org.apache.flink.api.common.functions.MapFunction

     * @see org.apache.flink.api.common.functions.RichMapFunction

     * @see MapOperator

     */

    public <R> MapOperator<T, R> map(MapFunction<T, R> mapper) {

        if (mapper == null) {

            throw new NullPointerException("Map function must not be null.");

        }

        String callLocation = Utils.getCallLocationName();

        TypeInformation<R> resultType = TypeExtractor.getMapReturnTypes(mapper, getType(), callLocation, true);

        return new MapOperator<>(this, resultType, clean(mapper), callLocation);

    }

其中，Operator

4.数据输出

DataSink：一个用来存储数据结果的操作。

输出OutputFormat

例如，可以csv输出

    /**

     * Writes a {@link Tuple} DataSet as CSV file(s) to the specified location with the specified field and line delimiters.

     *

     * <p><b>Note: Only a Tuple DataSet can written as a CSV file.</b>

      * For each Tuple field the result of {@link Object#toString()} is written.

     *

     * @param filePath The path pointing to the location the CSV file is written to.

     * @param rowDelimiter The row delimiter to separate Tuples.

     * @param fieldDelimiter The field delimiter to separate Tuple fields.

     * @param writeMode The behavior regarding existing files. Options are NO_OVERWRITE and OVERWRITE.

     *

     * @see Tuple

     * @see CsvOutputFormat

     * @see DataSet#writeAsText(String) Output files and directories

     */

    public DataSink<T> writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter, WriteMode writeMode) {

        return internalWriteAsCsv(new Path(filePath), rowDelimiter, fieldDelimiter, writeMode);

    }

    @SuppressWarnings("unchecked")

    private <X extends Tuple> DataSink<T> internalWriteAsCsv(Path filePath, String rowDelimiter, String fieldDelimiter, WriteMode wm) {

        Preconditions.checkArgument(getType().isTupleType(), "The writeAsCsv() method can only be used on data sets of tuples.");

        CsvOutputFormat<X> of = new CsvOutputFormat<>(filePath, rowDelimiter, fieldDelimiter);

        if (wm != null) {

            of.setWriteMode(wm);

        }

        return output((OutputFormat<T>) of);

    }

5.总结

　　1. flink通过InputFormat对各种数据源的数据进行读取转换成DataSet数据集

　　2. flink提供了丰富的转换操作，DataSet可以通过transformation转换成其它的DataSet，内部的实现是Function和Operator。

　　3. flink通过OutFormat将DataSet转换成DataSink，最终将数据写入到不同的存储介质。

参考资料：

【1】https://blog.51cto.com/13654660/2087705

flink dataset api使用及原理的更多相关文章

flink DataStream API使用及原理
传统的大数据处理方式一般是批处理式的,也就是说,今天所收集的数据,我们明天再把今天收集到的数据算出来,以供大家使用,但是在很多情况下,数据的时效性对于业务的成败是非常关键的. Spark 和 Flin ...
Flink DataSet API Programming Guide
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html Example ...
Apache Flink - Batch(DataSet API)
Flink DataSet API编程指南: Flink中的DataSet程序是实现数据集转换的常规程序(例如,过滤,映射,连接,分组).数据集最初是从某些来源创建的(例如,通过读取文件或从本地集合创 ...
Flink入门（五）——DataSet Api编程指南
Apache Flink Apache Flink 是一个兼顾高吞吐.低延迟.高性能的分布式处理框架.在实时计算崛起的今天,Flink正在飞速发展.由于性能的优势和兼顾批处理,流处理的特性,Flink ...
Apache Flink 1.12.0 正式发布，DataSet API 将被弃用，真正的流批一体
Apache Flink 1.12.0 正式发布 Apache Flink 社区很荣幸地宣布 Flink 1.12.0 版本正式发布!近 300 位贡献者参与了 Flink 1.12.0 的开发,提交 ...
Flink整合面向用户的数据流SDKs/API(Flink关于弃用Dataset API的论述)
动机 Flink提供了三种主要的sdk/API来编写程序:Table API/SQL.DataStream API和DataSet API.我们认为这个API太多了,建议弃用DataSet API,而 ...
【翻译】Flink Table Api & SQL — 用户定义函数
本文翻译自官网:User-defined Functions https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/tabl ...
8、Flink Table API & Flink Sql API
一.概述上图是flink的分层模型,Table API 和 SQL 处于最顶端,是 Flink 提供的高级 API 操作.Flink SQL 是 Flink 实时计算为简化计算模型,降低用户使用实时 ...
统一批处理流处理——Flink批流一体实现原理
实现批处理的技术许许多多,从各种关系型数据库的sql处理,到大数据领域的MapReduce,Hive,Spark等等.这些都是处理有限数据流的经典方式.而Flink专注的是无限流处理,那么他是怎么做到 ...

随机推荐

在 Excel 中如何使用宏示例删除列表中的重复项
概要:在 Microsoft Excel 中,可以创建宏来删除列表中的重复项.也可以创建宏来比较两个列表,并删除第二个列表中那些也出现在第一个(主)列表中的项目.如果您想将两个列表合并在一起,或者如果 ...
wpf 禁用window的systemmenu
private IntPtr WidProc(IntPtr hwnd, int msg, IntPtr wParam, IntPtr lParam, ref bool handled) { if (m ...
Win8 Metro(C#)数字图像处理--2.65形态学轮廓提取算法
原文:Win8 Metro(C#)数字图像处理--2.65形态学轮廓提取算法 [函数名称] 形态学轮廓提取函数 WriteableBitmap Morcontourextract ...
WPF使用Font-Awesome字体
官方网站:https://fontawesome.com/ 使用教程: 学习WPF——使用Font-Awesome图标字体 - liulun - 博客园https://www.cnblogs.com/ ...
MIPS开发板的“不二”选择——Creator Ci20单板计算机评测（芯片是君正JZ4780 ，也就是MIPS R3000，系统推荐Debian或深度，官网就有，其它语言有FreePascal和Go和Java和Python）
在MIPS架构的CPU上开发软件,当然需要使用MIPS专用的工具链来编译代码.不过一般的LINUX发行版内都有相应的配套工具链供用户使用.Ci20出厂时的LINUX发行版为DEBIAN 7.5,相应的 ...
Python日记：基于Scrapy的爬虫实现
安装 pywin32 和python版本一致地址 https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/安装过程中提示 ...
QQ邮箱打败163邮箱的过程（重视用户体验的结果）
引用楼主 CKAOS 的回复: 目前负责一个项目,ASP.NET的,做一个网盘系统,别的都弄好了,只差一个下载文件夹的功能未实现,只能在服务器端打包成压缩文件,再发回浏览器.如何直接下载文件夹,不是 ...
进程交互还可以使用QSharedMemory
官方例子: http://doc.qt.io/qt-5/qtcore-ipc-sharedmemory-example.html 查了一下,QSharedMemory没有自带任何信号.我的想法: 1. ...
QT中的各种对话框
大家可以参见QT中各种MessageBox的使用的这篇文章界面效果图如下,大家可以用代码自己操作 diglog.h #ifndef DIALOG_H #define DIALOG_H #includ ...
Python正则表达式基础指南
1. 正则表达式基础 1.1. 简单介绍正则表达式并不是Python的一部分.正则表达式是用于处理字符串的强大工具,拥有自己独特的语法以及一个独立的处理引擎,效率上可能不如str自带的方法,但功能十 ...

flink dataset api使用及原理

flink dataset api使用及原理的更多相关文章

随机推荐

热门专题