Hadoop combiners are a very powerful tool to speed up our computations. We already saw what a combiner is in a previous post and we also have seen another form of optimization inthis post. Let's put all together to get the broader idea. 
The combiners are optimizations that can be used with Hadoop to make a local-reduction: the idea is to reduce the key-value pairs directly on the mapper, to avoid transmitting all of them to the reducers. 
Let's get back to the Top20 example from the previous post, which finds the top 20 words most used in a text. The Hadoop output of this job is shown below:

...
Map input records=4239
Map output records=37817
Map output bytes=359621
Input split bytes=118
Combine input records=0
Combine output records=0
Reduce input groups=4987
Reduce shuffle bytes=435261
Reduce input records=37817
Reduce output records=20
...

As we can see in the lines highlighted in bold, without a combiner we have 4239 lines in input for the mappers and 37817 key-value pairs emitted (the number of different words of the text). Having defined no combiner, the input and output records of combiners are 0, and so the input records for the reducers are exactly those emitted by the mappers, 37817.

Let's define a simple combiner:

    public static class WordCountCombiner extends Reducer<text, intwritable,="" text,="" intwritable=""> {

        @Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { // computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

As we can see, the code has the same logic of the reducer, since its target is the same: reducing key/value pairs. 
Running the job having set the combiner gives us this result:

...
Map input records=4239
Map output records=37817
Map output bytes=359621
Input split bytes=116
Combine input records=37817
Combine output records=20

Reduce input groups=20
Reduce shuffle bytes=194
Reduce input records=20
Reduce output records=20
...

Looking at the output from Hadoop, we see that now the combiner has 37817 input records: this means that the records emitted from the mappers were all sent to the combiners; the result of the combiners is of 20 records emitted, which is the number of records received by the reducers. 
Wow, that's a great result! We avoided the transmission of a lot of data: just 20 records instead of 37817 that we had without the combiner.

But there's a big disadvantage using combiners: since is an optimization, Hadoop does not guarantee their execution. So, what can we do to ensure a reduction at the mapper-level? Simple: we can put the logic of the reducer inside the mapper!

This is exactly what we've done in the mapper of this post. This pattern is called "in-mapper combiner". The reduce part is started at mapper level, so that the key-value pairs sent to the reducers are minimized. 
Let's see Hadoop output with this pattern (in-mapper combiner and without the stand-alone combiner):

...
Map input records=4239
Map output records=4987
Map output bytes=61522
Input split bytes=118
Combine input records=0
Combine output records=0

Reduce input groups=4987
Reduce shuffle bytes=71502
Reduce input records=4987
Reduce output records=20...

Compared to the execution of the other mapper (without combining), this mapper outputs only 4987 records instead of the 37817 that are emitted to the reducers. A big reduction, even if not as big as the one obtained with the stand-alone combiner. 
And what happens if we decide to couple the in-mapper combiner pattern and the stand-alone combiner? Well, we've got the best of the two:

...
Map input records=4239
Map output records=4987
Map output bytes=61522
Input split bytes=116
Combine input records=4987
Combine output records=20

Reduce input groups=20
Reduce shuffle bytes=194
Reduce input records=20
Reduce output records=20
...

In this last case, we have the best performance because we're emitting from the mapper a reduced number of records, the combiners (if it's executed) reduce even more the size of the data to be emitted. The only downside of this approach I can think of is that it takes a lot of time to be coded.

from: http://andreaiacono.blogspot.com/2014/05/more-about-hadoop-combiners.html

更为详细的介绍Hadoop combiners-More about Hadoop combiners的更多相关文章

  1. 原来你是这样的BERT,i了i了! —— 超详细BERT介绍(一)BERT主模型的结构及其组件

    原来你是这样的BERT,i了i了! -- 超详细BERT介绍(一)BERT主模型的结构及其组件 BERT(Bidirectional Encoder Representations from Tran ...

  2. Window VNC远程控制LINUX:VNC详细配置介绍

    Window VNC远程控制LINUX:VNC详细配置介绍 //---------------------------------------vnc linux下的详细配置 1.VNC的启动/停止/重 ...

  3. Hadoop介绍及最新稳定版Hadoop 2.4.1下载地址及单节点安装

     Hadoop介绍 Hadoop是一个能对大量数据进行分布式处理的软件框架.其基本的组成包括hdfs分布式文件系统和可以运行在hdfs文件系统上的MapReduce编程模型,以及基于hdfs和MapR ...

  4. ThinkPHP 自动创建数据、自动验证、自动完成详细例子介绍(十九)

    原文:ThinkPHP 自动创建数据.自动验证.自动完成详细例子介绍(十九) 1:自动创建数据 //$name=$_POST['name']; //$password=$_POST['password ...

  5. hadoop学习第一天-hadoop初步环境搭建&伪分布式计算配置(详细)

    一.虚拟机环境搭建 我们用的虚拟机为vmware,Linux镜像为centOS6.5. vmware安装 安装没什么多说的,一路下一步,但是在新建虚拟机的时候有两个地方需要注意: 1.分配处理器1个就 ...

  6. [原]Redis详细配置介绍

    Redis详细配置介绍 # redis 配置文件示例 # 当你需要为某个配置项指定内存大小的时候,必须要带上单位, # 通常的格式就是 1k 5gb 4m 等酱紫: # # 1k => 1000 ...

  7. 更为详细的Txtsetup.sif文件解释

    更为详细的Txtsetup.sif文件解释;代码页定义, 以免文本安装模式下无法正常显示简体中文 (以下基本都是跟简体中文相关的, 不同语言版本的 Windows, 此处定义也不同)[nls]Ansi ...

  8. 详细版在虚拟机安装和使用hadoop分布式集群

    集群模式: 一台master 192.168.85.2 一台slave  192.168.85.3 jdk jdk1.8.0_74(版本不重要,看喜欢) hadoop版本 2.7.2(版本不重要,2. ...

  9. Hadoop(三) HADOOP常用命令参数介绍

    -help 功能:输出这个命令参数手册 -ls                  功能:显示目录信息 示例: hadoop fs -ls hdfs://hadoop-server01:9000/ 备注 ...

随机推荐

  1. django的orm中F对象的使用

    今天不巧就用上了. 就是将数据库的字段,自增1的场景. from django.db.models import F DeployPool.objects.filter(name=deployvers ...

  2. day5模块

    模块,用一砣代码实现了某个功能的代码集合. 类似于函数式编程和面向过程编程,函数式编程则完成一个功能,其他代码用来调用即可,提供了代码的重用性和代码间的耦合.而对于一个复杂的功能来,可能需要多个函数才 ...

  3. Linux下Diff命令

    一般正常比较两个文件用vimdiff,算是直接进入vim界面,如果比较两个文件夹下面的文件,可以用diff,注意,这里只会比较文件夹下面的同名文件,他会列出不一样的点. 参考Linux下Diff命令使 ...

  4. Python爬虫-urllib的基本用法

    from urllib import response,request,parse,error from http import cookiejar if __name__ == '__main__' ...

  5. bzoj 1816 二分

    思路:二分答案,然后我们贪心地先不填最小的一堆,看在最小的一堆消耗完之前能不能填满其他堆. #include<bits/stdc++.h> #define LL long long #de ...

  6. java 读入文件 FileInputStream

    package com.mkyong.io; import java.io.File; import java.io.FileInputStream; import java.io.IOExcepti ...

  7. HDU 5692 Snacks

    题目链接[http://acm.hdu.edu.cn/showproblem.php?pid=5692] 题意:一棵树,每个节点有权值,有两种操作:1.修改某个点的权值,2.求以x根的子树中的节点到根 ...

  8. 【9.22校内测试】【可持久化并查集(主席树实现)】【DP】【点双联通分量/割点】

    1 build1.1 Description从前有一个王国,里面有n 座城市,一开始两两不连通.现在国王将进行m 次命令,命令可能有两种,一种是在u 和v 之间修建道路,另一种是询问在第u 次命令执行 ...

  9. bzoj 3784

    第三道点分治. 首先找到黄学长的题解,他叫我参考XXX的题解,但已经没有了,然后找到另一个博客的简略题解,没看懂,最后看了一个晚上黄学长代码,写出来然后,写暴力都拍了小数据,但居然超时,....然后改 ...

  10. bzoj 2038 小z的袜子 莫队例题

    莫队,利用可以快速地通过一个问题的答案得到另一问题的答案这一特性,合理地组织问题的求解顺序,将已解决的问题帮助解决当前问题,来优化时间复杂度. 典型用法:处理静态(无修改)离线区间查询问题. 线段树也 ...