18 Nov 2014 by Fabian Hüske (@fhueske)

Apache Hadoop is an industry standard for scalable analytical data processing. Many data analysis applications have been implemented as Hadoop MapReduce jobs and run in clusters around the world. Apache Flink can be an alternative to MapReduce and improves it in many dimensions. Among other features, Flink provides much better performance and offers APIs in Java and Scala, which are very easy to use. Similar to Hadoop, Flink’s APIs provide interfaces for Mapper and Reducer functions, as well as Input- and OutputFormats along with many more operators. While being conceptually equivalent, Hadoop’s MapReduce and Flink’s interfaces for these functions are unfortunately not source compatible.

Flink’s Hadoop Compatibility Package

To close this gap, Flink provides a Hadoop Compatibility package to wrap functions implemented against Hadoop’s MapReduce interfaces and embed them in Flink programs. This package was developed as part of a Google Summer of Code 2014 project.

With the Hadoop Compatibility package, you can reuse all your Hadoop

  • InputFormats (mapred and mapreduce APIs)
  • OutputFormats (mapred and mapreduce APIs)
  • Mappers (mapred API)
  • Reducers (mapred API)

in Flink programs without changing a line of code. Moreover, Flink also natively supports all Hadoop data types (Writables and WritableComparable).

The following code snippet shows a simple Flink WordCount program that solely uses Hadoop data types, InputFormat, OutputFormat, Mapper, and Reducer functions.

// Definition of Hadoop Mapper function
public class Tokenizer implements Mapper<LongWritable, Text, Text, LongWritable> { ... }
// Definition of Hadoop Reducer function
public class Counter implements Reducer<Text, LongWritable, Text, LongWritable> { ... } public static void main(String[] args) {
final String inputPath = args[0];
final String outputPath = args[1]; final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // Setup Hadoop’s TextInputFormat
HadoopInputFormat<LongWritable, Text> hadoopInputFormat =
new HadoopInputFormat<LongWritable, Text>(
new TextInputFormat(), LongWritable.class, Text.class, new JobConf());
TextInputFormat.addInputPath(hadoopInputFormat.getJobConf(), new Path(inputPath)); // Read a DataSet with the Hadoop InputFormat
DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopInputFormat);
DataSet<Tuple2<Text, LongWritable>> words = text
// Wrap Tokenizer Mapper function
.flatMap(new HadoopMapFunction<LongWritable, Text, Text, LongWritable>(new Tokenizer()))
.groupBy(0)
// Wrap Counter Reducer function (used as Reducer and Combiner)
.reduceGroup(new HadoopReduceCombineFunction<Text, LongWritable, Text, LongWritable>(
new Counter(), new Counter())); // Setup Hadoop’s TextOutputFormat
HadoopOutputFormat<Text, LongWritable> hadoopOutputFormat =
new HadoopOutputFormat<Text, LongWritable>(
new TextOutputFormat<Text, LongWritable>(), new JobConf());
hadoopOutputFormat.getJobConf().set("mapred.textoutputformat.separator", " ");
TextOutputFormat.setOutputPath(hadoopOutputFormat.getJobConf(), new Path(outputPath)); // Output & Execute
words.output(hadoopOutputFormat);
env.execute("Hadoop Compat WordCount");
}
 

As you can see, Flink represents Hadoop key-value pairs as Tuple2<key, value> tuples. Note, that the program uses Flink’s groupBy() transformation to group data on the key field (field 0 of the Tuple2<key, value>) before it is given to the Reducer function. At the moment, the compatibility package does not evaluate custom Hadoop partitioners, sorting comparators, or grouping comparators.

Hadoop functions can be used at any position within a Flink program and of course also be mixed with native Flink functions. This means that instead of assembling a workflow of Hadoop jobs in an external driver method or using a workflow scheduler such as Apache Oozie, you can implement an arbitrary complex Flink program consisting of multiple Hadoop Input- and OutputFormats, Mapper and Reducer functions. When executing such a Flink program, data will be pipelined between your Hadoop functions and will not be written to HDFS just for the purpose of data exchange.

What comes next?

While the Hadoop compatibility package is already very useful, we are currently working on a dedicated Hadoop Job operation to embed and execute Hadoop jobs as a whole in Flink programs, including their custom partitioning, sorting, and grouping code. With this feature, you will be able to chain multiple Hadoop jobs, mix them with Flink functions, and other operations such as Spargel operations (Pregel/Giraph-style jobs).

Summary

Flink lets you reuse a lot of the code you wrote for Hadoop MapReduce, including all data types, all Input- and OutputFormats, and Mapper and Reducers of the mapred-API. Hadoop functions can be used within Flink programs and mixed with all other Flink functions. Due to Flink’s pipelined execution, Hadoop functions can arbitrarily be assembled without data exchange via HDFS. Moreover, the Flink community is currently working on a dedicated Hadoop Job operation to supporting the execution of Hadoop jobs as a whole.

If you want to use Flink’s Hadoop compatibility package checkout our documentation.

Hadoop Compatibility in Flink的更多相关文章

  1. Hadoop,Spark,Flink 相关KB

    Hive: https://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-exter ...

  2. flink hadoop yarn

    新一代大数据处理引擎 Apache Flink https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/ 新一代大数据处 ...

  3. Flink学习笔记:Flink开发环境搭建

    本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKhaz ...

  4. flink学习笔记-各种Time

    说明:本文为<Flink大数据项目实战>学习笔记,想通过视频系统学习Flink这个最火爆的大数据计算框架的同学,推荐学习课程: Flink大数据项目实战:http://t.cn/EJtKh ...

  5. Flink Program Guide (1) -- 基本API概念(Basic API Concepts -- For Java)

    false false false false EN-US ZH-CN X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-n ...

  6. 新一代大数据处理引擎 Apache Flink

    https://www.ibm.com/developerworks/cn/opensource/os-cn-apache-flink/index.html 大数据计算引擎的发展 这几年大数据的飞速发 ...

  7. Flink知识点

    1. Flink.Storm.Sparkstreaming对比 Storm只支持流处理任务,数据是一条一条的源源不断地处理,而MapReduce.spark只支持批处理任务,spark-streami ...

  8. 什么是Apache Flink

    大数据计算引擎的发展 这几年大数据的飞速发展,出现了很多热门的开源社区,其中著名的有 Hadoop.Storm,以及后来的 Spark,他们都有着各自专注的应用场景.Spark 掀开了内存计算的先河, ...

  9. Flink 部署文档

    Flink 部署文档 1 先决条件 2 下载 Flink 二进制文件 3 配置 Flink 3.1 flink-conf.yaml 3.2 slaves 4 将配置好的 Flink 分发到其他节点 5 ...

随机推荐

  1. Python爬虫入门教程 9-100 河北阳光理政投诉板块

    河北阳光理政投诉板块-写在前面 之前几篇文章都是在写图片相关的爬虫,今天写个留言板爬出,为另一套数据分析案例的教程做做准备,作为一个河北人,遵纪守法,有事投诉是必备的技能,那么咱看看我们大河北人都因为 ...

  2. Lucene 09 - 什么是Lucene的高亮显示 + Java API实现高亮显示

    目录 1 什么是高亮显示 2 高亮显示实现 2.1 配置pom.xml文件, 加入高亮显示支持 2.2 代码实现 2.3 自定义html标签高亮显示 1 什么是高亮显示 高亮显示是全文检索的一个特点, ...

  3. 带着新人学springboot的应用09(springboot+异步任务)

    本来想说说检索的,不过不知道什么鬼,下载ElasticSearch太慢了,还是放一下,后面有机会再补上!今天就说个简单的东西,来说说任务. 什么叫做任务呢?其实就是类中实现了一个什么功能的方法.常见的 ...

  4. 【链表问题】打卡2:删除单链表的第 K个节点

    前言 以专题的形式更新刷题贴,欢迎跟我一起学习刷题.每道题会提供简单的解答. 题目描述 在单链表中删除倒数第 K 个节点 要求 如果链表的长度为 N, 时间复杂度达到 O(N), 额外空间复杂度达到 ...

  5. Chapter 5 Blood Type——1

    The rest of the morning passed in a blur. 早上剩下的时间都在模糊中度过了. It was difficult to believe that I hadn't ...

  6. 在centos7中python3的安装注意

    ---恢复内容开始--- 云主机的python是2.7有些不方便,故要更换3,看到官网有3.6的包,就下载了. wget https://www.python.org/ftp/python/3.6.3 ...

  7. .NET快速信息化系统开发框架 V3.2 -> WinForm“组织机构管理”界面组织机构权限管理采用新的界面,操作权限按模块进行展示

    对于某些大型的企业.信息系统,涉及的组织机构较多,模块多.操作权限也多,对用户或角色一一设置模块.操作权限等比较繁琐.我们可以直接对某一组织机构进行权限的设置,这样设置后,同一组织机构的用户就可以拥有 ...

  8. 查看服务器运行多少个ASP.NET Core程序

    有时候,我们会想知道某台机器上面跑了什么程序. 当程序部署到IIS上面的时候,我们只需要打开IIS一看,就知道有多少个站点在运行了. 当我们在CentOS上面部署的时候,就没那么的直观了. 当然对于熟 ...

  9. SQL 中的一些小巧但常用的关键字

    前面的几篇文章中,我们大体上介绍了 SQL 中基本的创建.查询语句,甚至也学习了相对复杂的连接查询和子查询,这些基本功相信你也一定掌握的不错,那么本篇则着重介绍几个技巧方面的关键字,能够让你更快更有效 ...

  10. SpringCloud-Greenwich版本新特性探索(1)---SpringCloudGateway

    一.前言 1.SpringCloudGateway是SpringCloud新推出的网关框架,比较于上一代Zuul,功能和性能有很大的提升.Zuul1.x采用的是阻塞多线程方式,也就是一个线程处理一个连 ...