Top N之MapReduce程序加强版Enhanced MapReduce for Top N items

In the last post we saw how to write a MapReduce program for finding the top-n items of a dataset.

The code in the mapper emits a pair key-value for every word found, passing the word as the key and 1 as the value. Since the book has roughly 38,000 words, this means that the information transmitted from mappers to reducers is proportional to that number. A way to improve network performance of this program is to rewrite the mapper as follows:

public static class TopNMapper extends Mapper<object, text,="" intwritable=""> {

        private Map<String, Integer> countMap = new HashMap<>();

        @Override

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            String cleanLine = value.toString().toLowerCase().replaceAll("[_|$#<>\\^=\\[\\]\\*/\\\\,;,.\\-:()?!\"']", " ");

            StringTokenizer itr = new StringTokenizer(cleanLine);

            while (itr.hasMoreTokens()) {

                String word = itr.nextToken().trim();

                if (countMap.containsKey(word)) {

                    countMap.put(word, countMap.get(word)+1);

                }

                else {

                    countMap.put(word, 1);

                }

            }

        }

        @Override

        protected void cleanup(Context context) throws IOException, InterruptedException {

            for (String key: countMap.keySet()) {

                context.write(new Text(key), new IntWritable(countMap.get(key)));

            }

        }

    }

As we can see, we define an HashMap that uses words as the keys and the number of occurrences as the values; inside the loop, instead of emitting every word to the reducer, we put it into the map: if the word was already put, we increase its value, otherwise we set it to one. We also overrode the cleanup method, which is a method that Hadoop calls when the mapper has finished computing its input; in this method we now can emit the words to the reducers: doing this way, we can save a lot of network transmissions because we send to the reducers every word only once.

The complete code of this class is available on my github.
In the next post we'll see how to use combiners to leverage this approach.

from: http://andreaiacono.blogspot.com/2014/03/enhanced-mapreduce-for-top-n-items.html

Top N之MapReduce程序加强版Enhanced MapReduce for Top N items的更多相关文章

hadoop 第一个 mapreduce 程序（对MapReduce的几种固定代码的理解）
1.2MapReduce 和 HDFS 是如何工作的 MapReduce 其实是两部分,先是 Map 过程,然后是 Reduce 过程.从词频计算来说,假设某个文件块里的一行文字是”Thisis a ...
hive--构建于hadoop之上、让你像写SQL一样编写MapReduce程序
hive介绍什么是hive? hive:由Facebook开源用于解决海量结构化日志的数据统计 hive是基于hadoop的一个数据仓库工具,可以将结构化的数据映射为数据库的一张表,并提供类SQL查 ...
攻城狮在路上（陆）-- 配置hadoop本地windows运行MapReduce程序环境
本文的目的是实现在windows环境下实现模拟运行Map/Reduce程序.最终实现效果:MapReduce程序不会被提交到实际集群,但是运算结果会写入到集群的HDFS系统中. 一.环境说明: ...
windows环境下Eclipse开发MapReduce程序遇到的四个问题及解决办法
按此文章<Hadoop集群(第7期)_Eclipse开发环境设置>进行MapReduce开发环境搭建的过程中遇到一些问题,饶了一些弯路,解决办法记录在此: 文档目的: 记录windows环 ...
编写简单的Mapreduce程序并部署在Hadoop2.2.0上运行
今天主要来说说怎么在Hadoop2.2.0分布式上面运行写好的 Mapreduce 程序. 可以在eclipse写好程序,export或用fatjar打包成jar文件. 先给出这个程序所依赖的Mave ...
如何在Hadoop的MapReduce程序中处理JSON文件
简介: 最近在写MapReduce程序处理日志时,需要解析JSON配置文件,简化Java程序和处理逻辑.但是Hadoop本身似乎没有内置对JSON文件的解析功能,我们不得不求助于第三方JSON工具包. ...
hadoop——在命令行下编译并运行map-reduce程序 2
hadoop map-reduce程序的编译需要依赖hadoop的jar包,我尝试javac编译map-reduce时指定-classpath的包路径,但无奈hadoop的jar分布太散乱,根据自己 ...
hadoop-初学者写map-reduce程序中容易出现的问题 3
1.写hadoop的map-reduce程序之前所必须知道的基础知识: 1)hadoop map-reduce的自带的数据类型: Hadoop提供了如下内容的数据类型,这些数据类型都实现了Writab ...
mapreduce程序编写(WordCount)
折腾了半天.终于编写成功了第一个自己的mapreduce程序,并通过打jar包的方式运行起来了. 运行环境: windows 64bit eclipse 64bit jdk6.0 64bit 一.工程 ...

随机推荐

面试题41：和为s的两个数字 || 和为s的连续正数序列
和为s的两个数字题目:输入一个递增排序的数组和一个数字s,在数组中查找两个数,使得它们的和正好是s.如果有多对数字的和等于s,输出任意一对即可. 有点类似于夹逼的思想注意两个int相加的和要用lo ...
【LOJ】#2105. 「TJOI2015」概率论
题解可以说是什么找规律好题了但是要推生成函数,非常神奇-- 任何的一切都可以用\(n^2\)dp说起我们所求即是所有树的叶子总数/所有树的方案数我们可以列出一个递推式,设\(g(x)\)为\ ...
【51nod】1742 开心的小Q
题解我们由于莫比乌斯函数如果有平方数因子就是0,那么我们可以列出这样的式子 \(\sum_{i = 1}^{n} \sum_{d|i} (1 - |\mu(d)|)\) 然后枚举倍数 \(\sum_ ...
bzoj 1271
思路:因为被占奇数次的点只有一个, 那么我们可以将数轴分成两部分,奇数次点之前的前缀和为偶数,之后的前缀和为奇数, 然后就可以二分了. #include<bits/stdc++.h> #d ...
基于 Laravel 开发博客应用系列 —— 设置 Windows 本地开发环境
1.安装原生PHP 下载/解压 PHP 到 PHP 下载页下载最新版本的 PHP(如果使用 Laravel 5.1 的话需要 PHP 5.5.9+ 版本),解压下载的zip格式压缩文件到本地目录,比如 ...
哪来的gou zi 阿龙（最新更新于1.21日）
众所周知,信息竞赛教室有一个特gou zi的人,叫做阿龙. 这个人呢,特别好玩,特别gou zi 还有一个人,叫Sugar,这个人特别喜欢和阿龙闹,so,一系列爆笑无脑的事就发生了! 1.谁是鱼? 一 ...
JAVAEE——宜立方商城09：Activemq整合spring的应用场景、添加商品同步索引库、商品详情页面动态展示与使用缓存
1. 学习计划 1.Activemq整合spring的应用场景 2.添加商品同步索引库 3.商品详情页面动态展示 4.展示详情页面使用缓存 2. Activemq整合spring 2.1. 使用方法 ...
Redis和MySQL数据一致中出现的几种情况
1. MySQL持久化数据,Redis只读数据 redis在启动之后,从数据库加载数据. 读请求: 不要求强一致性的读请求,走redis,要求强一致性的直接从mysql读取写请求: 数据首先都写到数 ...
Openstack_通用模块_Oslo_vmware 创建 vCenter 虚拟机快照
创建虚拟机快照 vSphere Create Snapshot 文档 Snapshot 是虚拟机磁盘文件(VMDK)在某个点及时的复本.包含了虚拟机所有虚拟磁盘上的数据状态和这个虚拟机的电源状态(on ...
QT学习笔记7：C++函数默认参数
C++中允许为函数提供默认参数,又名缺省参数. 使用默认参数时的注意事项: ① 有函数声明(原型)时,默认参数可以放在函数声明或者定义中,但只能放在二者之一.建议放在函数声明中. double sqr ...

Top N之MapReduce程序加强版Enhanced MapReduce for Top N items

Top N之MapReduce程序加强版Enhanced MapReduce for Top N items的更多相关文章

随机推荐

热门专题