In this post we'll see how to count the top-n items of a dataset; we'll again use the flatland book we used in a previous post: in that example we used the WordCount program to count the occurrences of every single word forming the book; now we want to find which are the top-n words used in the book.

Let's start with the mapper:

public static class TopNMapper extends Mapper<object, text,="" intwritable=""> {

        private final static IntWritable one = new IntWritable(1);
private Text word = new Text(); @Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");
StringTokenizer itr = new StringTokenizer(cleanLine);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}

The mapper is really straightforward : the TopNMapper class defines an IntWritable set to 1 and a Text object; its map() method, like in the previous post, splits every line of the book into an array of single words and send to the reducers every word with the value of 1.

The reducer is more interesting:

public static class TopNReducer extends Reducer<text, intwritable,="" text,="" intwritable=""> {

        private Map<text, intwritable=""> countMap = new HashMap<>();

        @Override
public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { // computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
} // puts the number of occurrences of this word into the map.
countMap.put(key, new IntWritable(sum));
} @Override
protected void cleanup(Context context) throws IOException, InterruptedException { Map<text, intwritable=""> sortedMap = sortByValues(countMap); int counter = 0;
for (Text key: sortedMap.keySet()) {
if (counter ++ == 20) {
break;
}
context.write(key, sortedMap.get(key));
}
}
}

We override two methods: reduce() and cleanup(). Let's examine the reduce() method. 
As we've seen in the mapper's code, the keys the reducer receive are every single word contained in the book; at the beginning of the method, we compute the sum of all the values received from the mappers for this key, which is the number of occurrences of this word inside the book; then we put the word and the number of occurrences into a HashMap. Note that we're not directly putting into the map the Text object that contains the word because that instance is reused many times by Hadoop for performance issues; instead, we put a new Text object based on the received one.

To output the top-n values, we have to compute the number of occurrences of every word, sort the words by the number of occurrences and then extract the first n. In the reduce() method we don't write any value to the output, because we can sort the words only after that we collect them all; the cleanup() method is called by Hadoop after the reducer has received all its data, so we override this method to be sure that our HashMap is filled up with all the words. 
Let's look at the method: first we sort the HashMap by values (using code from this post); then we loop over the keyset and output the first 20 items.

The complete code is available on my github.

The output of the reducer gives us the 20 most used words in Flatland:

the 2286
of 1634
and 1098
to 1088
a 936
i 735
in 713
that 499
is 429
you 419
my 334
it 330
as 322
by 317
not 317
or 299
but 279
with 273
for 267
be 252

Predictably, the most used words in the book are articles, conjunctions, adjectives, prepositions and personal pronouns.

This MapReduce program is not very efficient: the mappers will transfer to the reducers a lot of data; every single word of the book will be emitted to reducers together with the number "1", causing a very high network load; the phase in which mappers send data to the reducers is called "Shuffle and sort" and is explained in more detail in the free chapter of the "Hadoop, the definitive guide" by Tom White.

In the next posts we'll see how to improve the performances of the Shuffle and sort phase.

from: http://andreaiacono.blogspot.com/2014/03/mapreduce-for-top-n-items.html

Top N的MapReduce程序MapReduce for Top N items的更多相关文章

  1. Top N之MapReduce程序加强版Enhanced MapReduce for Top N items

    In the last post we saw how to write a MapReduce program for finding the top-n items of a dataset. T ...

  2. 攻城狮在路上(陆)-- 配置hadoop本地windows运行MapReduce程序环境

    本文的目的是实现在windows环境下实现模拟运行Map/Reduce程序.最终实现效果:MapReduce程序不会被提交到实际集群,但是运算结果会写入到集群的HDFS系统中. 一.环境说明:     ...

  3. windows环境下Eclipse开发MapReduce程序遇到的四个问题及解决办法

    按此文章<Hadoop集群(第7期)_Eclipse开发环境设置>进行MapReduce开发环境搭建的过程中遇到一些问题,饶了一些弯路,解决办法记录在此: 文档目的: 记录windows环 ...

  4. 编写简单的Mapreduce程序并部署在Hadoop2.2.0上运行

    今天主要来说说怎么在Hadoop2.2.0分布式上面运行写好的 Mapreduce 程序. 可以在eclipse写好程序,export或用fatjar打包成jar文件. 先给出这个程序所依赖的Mave ...

  5. 如何在Hadoop的MapReduce程序中处理JSON文件

    简介: 最近在写MapReduce程序处理日志时,需要解析JSON配置文件,简化Java程序和处理逻辑.但是Hadoop本身似乎没有内置对JSON文件的解析功能,我们不得不求助于第三方JSON工具包. ...

  6. hadoop——在命令行下编译并运行map-reduce程序 2

     hadoop map-reduce程序的编译需要依赖hadoop的jar包,我尝试javac编译map-reduce时指定-classpath的包路径,但无奈hadoop的jar分布太散乱,根据自己 ...

  7. hadoop-初学者写map-reduce程序中容易出现的问题 3

    1.写hadoop的map-reduce程序之前所必须知道的基础知识: 1)hadoop map-reduce的自带的数据类型: Hadoop提供了如下内容的数据类型,这些数据类型都实现了Writab ...

  8. mapreduce程序编写(WordCount)

    折腾了半天.终于编写成功了第一个自己的mapreduce程序,并通过打jar包的方式运行起来了. 运行环境: windows 64bit eclipse 64bit jdk6.0 64bit 一.工程 ...

  9. 基于Maven管理的Mapreduce程序下载依赖包到LIB目录

    1.Mapreduce程序需要打包作为作业提交到Hadoop集群环境运行,但是程序中有相关的依赖包,如果没有一起打包,会出现xxxxClass Not Found . 2.在pom.xml文件< ...

随机推荐

  1. Codeforces Round #207 (Div. 1) D - Bags and Coins 构造 + bitset优化dp + 分段查找优化空间

    D - Bags and Coins 思路:我们可以这样构造,最大的那个肯定是作为以一个树根,所以我们只要找到一个序列a1 + a2 + a3 .... + ak 并且ak为 所有点中最大的那个,那么 ...

  2. Centos 7.2 安装 Python 3.5(适用于Python 3所有版本安装)

    提示:我们必须不能破坏系统的环境.因为几个关键的应用程序依赖于Python 2.7.5(centos 7默认版本).如果替换了系统的Python环境就会发生很多难以预见的错误,甚至要重装系统. 安装前 ...

  3. jquery实现checkbox的单选和全选

    一.思路 全选:判断“全选”checkbox的状态,如果选中则把tbody下所有的checkbox选中,反之 单选:主要是判断有没有全选,如果不是选中状态就把全选的checkbox状态设置为false ...

  4. linux查询系统负载

    linux uptime命令主要用于获取主机运行时间和查询linux系统负载等信息.uptime命令过去只显示系统运行多久.现在,可以显示系统已经运行了多长时间,信息显示依次为:现在时间.系统已经运行 ...

  5. Oracle11g 基础 --即将发布

    由于整理发布于博客园较为麻烦,所以这可能需要一点时间. 敬请期待.

  6. 基于五阶段流水线的RISC-V CPU模拟器实现

    RISC-V是源自Berkeley的开源体系结构和指令集标准.这个模拟器实现的是RISC-V Specification 2.2中所规定RV64I指令集,基于标准的五阶段流水线,并且实现了分支预测模块 ...

  7. web服务端安全之权限漏洞

    一.权限漏洞 访问控制是指用户对系统所有访问的权限控制,通常包含水平权限,和垂直权限. 水平越权:同一角色级别的用户之间所产生的问题,如A用户可以未授权访问B用户的数据: 垂直越权:不同角色级别的用户 ...

  8. JS Function Arguments

    Function arguments在ECMAScript中的行为并不像其他大多数语言中的函数参数. 在ECMAScript中,function 并不关心有多少个参数传入函数中,也不关心传入参数的数据 ...

  9. CentOS系统php5.6安装ImageMagick处理webp格式图片

    1.先安装webp yum install libwebp 2.编译安装ImageMagick 之前有过yum安装的先卸载 yum remove ImageMagick 我使用的是老版本ImageMa ...

  10. HDU 3949 XOR 线性基

    http://acm.hdu.edu.cn/showproblem.php?pid=3949 求异或第k小,结论是第k小就是 k二进制的第i位为1就把i位的线性基异或上去. 但是这道题和上一道线性基不 ...