mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次

（总感觉上一篇的实现有问题）http://www.cnblogs.com/i80386/p/3444726.html   combiner是把同一个机器上的多个map的结果先聚合一次
现重新实现一个：

思路：
第一个mapreduce仅仅做  <word_docid,count>的统计，即某个单词在某一篇文章里出现的次数。（原理跟wordcount一样，只是word变成了word_docid)
第二个mapreduce将word_docid在map阶段拆开，重新组合为<word,docid_count> 然后在combine和reduce阶段（combine和reduce是同一个函数）组合为 <word,doc1:count1,doc2:count2,doc3:count3>这种格式import java.io.IOException;

1 思路：
0.txt MapReduce is simple
1.txt MapReduce is powerfull is simple
2.txt Hello MapReduce bye MapReduce

采用两个JOB的形式实现
一：第一个JOB（跟wordcount一致，只是wordcount中的word换做了word:dicid）
1 map函数：context.write(word:docid, 1) 即将word:docid作为map函数的输出
输出key        输出value
MapReduce:0.txt 1
is:0.txt 1
simple:0.txt 1
Mapreduce:1.txt 1
is:1.txt 1
powerfull:1.txt 1
is:1.txt 1
simple:1.txt 1
Hello:2.txt 1
MapReduce:2.txt 1
bye:2.txt 1
MapReduce:2.txt 1
2 Partitioner函数：HashPartitioner
略，根据map函数的输出key（word：docid）进行分区
3 reduce函数：累加输入values
输出key    输出value
MapReduce:0.txt 1 => MapReduce 0.txt:1 
is:0.txt 1        => is 0.txt:1
simple:0.txt 1    => simple 0.txt:1
Mapreduce:1.txt 1 => Mapreduce 1.txt:1
is:1.txt 2        => is 1.txt:2
powerfull:1.txt 1 => powerfull 1.txt:1
simple:1.txt 1    => simple 1.txt:1
Hello:2.txt 1     => Hello 2.txt:1
MapReduce:2.txt 2 => MapReduce 2.txt:2
bye:2.txt 1       => bye 2.txt:1
二：第二个JOB
1 map函数：
输入key    输入value  输出key    输出value
MapReduce:0.txt 1 => MapReduce 0.txt:1 
is:0.txt 1        => is 0.txt:1
simple:0.txt 1    => simple 0.txt:1
Mapreduce:1.txt 1 => Mapreduce 1.txt:1
is:1.txt 2        => is 1.txt:2
powerfull:1.txt 1 => powerfull 1.txt:1
simple:1.txt 1    => simple 1.txt:1
Hello:2.txt 1     => Hello 2.txt:1
MapReduce:2.txt 2 => MapReduce 2
2 reduce函数 （组合values）
输出key    输出value
MapReduce 0.txt:1，1.txt:1 2.txt:2
is 0.txt:1，is 1.txt:2
simple 0.txt:1，1.txt:1
powerfull 1.txt:1
Hello 2.txt:1
bye 2.txt:1

import java.util.Random;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;

public class MyInvertIndex {

    public static class SplitMapper extends

            Mapper<Object, Text, Text, IntWritable> {

        public void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            FileSplit split = (FileSplit) context.getInputSplit();

            //String pth = split.getPath().toString();

            String name = split.getPath().getName();

            String[] tokens = value.toString().split("\\s");

            for (String token : tokens) {

                context.write(new Text(token + ":" + name), new IntWritable(1));

            }

        }

    }

    public static class CombineMapper extends

            Mapper<Text, IntWritable, Text, Text> {

        public void map(Text key, IntWritable value, Context context)

                throws IOException, InterruptedException {

            int splitIndex = key.toString().indexOf(":");

            context.write(new Text(key.toString().substring(0, splitIndex)),

                    new Text(key.toString().substring(splitIndex + 1) + ":"

                            + value.toString()));

        }

    }

    public static class CombineReducer extends Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            StringBuffer buff = new StringBuffer();

            for (Text val : values) {

                buff.append(val.toString() + ",");

            }

            context.write(key, new Text(buff.toString()));

        }

    }

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        String dir_in = "hdfs://localhost:9000/in_invertedindex";

        String dir_out = "hdfs://localhost:9000/out_invertedindex";

        Path in = new Path(dir_in);

        Path out = new Path(dir_out);

        Path path_tmp = new Path("word_docid"

                + Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));

        Configuration conf = new Configuration();

        try {

            Job countJob = new Job(conf, "invertedindex_count");

            countJob.setJarByClass(MyInvertIndex.class);

            countJob.setInputFormatClass(TextInputFormat.class);

            countJob.setMapperClass(SplitMapper.class);

            countJob.setCombinerClass(IntSumReducer.class);

            countJob.setPartitionerClass(HashPartitioner.class);

            countJob.setMapOutputKeyClass(Text.class);

            countJob.setMapOutputValueClass(IntWritable.class);

            FileInputFormat.addInputPath(countJob, in);

            countJob.setReducerClass(IntSumReducer.class);

            // countJob.setNumReduceTasks(1);

            countJob.setOutputKeyClass(Text.class);

            countJob.setOutputValueClass(IntWritable.class);

            countJob.setOutputFormatClass(SequenceFileOutputFormat.class);

            FileOutputFormat.setOutputPath(countJob, path_tmp);

            countJob.waitForCompletion(true);

            Job combineJob = new Job(conf, "invertedindex_combine");

            combineJob.setJarByClass(MyInvertIndex.class);

            combineJob.setInputFormatClass(SequenceFileInputFormat.class);

            combineJob.setMapperClass(CombineMapper.class);

            combineJob.setCombinerClass(CombineReducer.class);

            combineJob.setPartitionerClass(HashPartitioner.class);

            combineJob.setMapOutputKeyClass(Text.class);

            combineJob.setMapOutputValueClass(Text.class);

            FileInputFormat.addInputPath(combineJob, path_tmp);

            combineJob.setReducerClass(CombineReducer.class);

            // combineJob.setNumReduceTasks(1);

            combineJob.setOutputKeyClass(Text.class);

            combineJob.setOutputValueClass(Text.class);

            combineJob.setOutputFormatClass(TextOutputFormat.class);

            FileOutputFormat.setOutputPath(combineJob, out);

            combineJob.waitForCompletion(true);

        } finally {

            FileSystem.get(conf).delete(path_tmp, true);

        }

    }

}

运行结果：

Hello    2.txt:1,,

MapReduce    2.txt:2,1.txt:1,0.txt:1,,

bye    2.txt:1,,

is    1.txt:2,0.txt:1,,

powerfull    1.txt:1,,

simple    1.txt:1,0.txt:1,,

mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次的更多相关文章

mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
MapReduce(五) mapreduce的shuffle机制与 Yarn
一.shuffle机制 1.概述 (1)MapReduce 中, map 阶段处理的数据如何传递给 reduce 阶段,是 MapReduce 框架中最关键的一个流程,这个流程就叫 Shuffle:( ...
hadoop学习第三天-MapReduce介绍&&WordCount示例&&倒排索引示例
一.MapReduce介绍 (最好以下面的两个示例来理解原理) 1. MapReduce的基本思想 Map-reduce的思想就是“分而治之” Map Mapper负责“分”,即把复杂的任务分解为若干 ...
[C语言]声明解析器cdecl修改版
一.写在前面 K&R曾经在书中承认,"C语言声明的语法有时会带来严重的问题.".由于历史原因(BCPL语言只有唯一一个类型——二进制字),C语言声明的语法在各种合理的组合下 ...
Medoo个人修改版
Medoo是一款轻量级的php数据库操作类,下面不会介绍Medoo的使用方法,想学习Medoo请前往官网自学:http://medoo.in/ 在接触Medoo之前,一直是用自己写的php数据库操作类 ...
Android 仿美团网,大众点评购买框悬浮效果之修改版
转帖请注明本文出自xiaanming的博客(http://blog.csdn.net/xiaanming/article/details/17761431),请尊重他人的辛勤劳动成果,谢谢! 我之前写 ...
黄聪：WordPress图片插件：Auto Highslide修改版（转）
一直以来很多人都很喜欢我博客使用的图片插件,因为我用的跟原版是有些不同的,效果比原版的要好,他有白色遮罩层,可以直观的知道上下翻图片和幻灯片放映模式.很多人使用原版之后发现我用的更加帅一些,于是很多人 ...
sqm(sqlmapGUI) pcat修改版
sqlmap是一款开源的注入工具,支持几乎所有的数据库,支持get/post/cookie注入,支持错误回显注入/盲注,还有其他多种注入方法. 支持代理,指纹识别技术判断数据库 .而sqm(sqlma ...
转载：Eclipse+Spket插件+ExtJs4修改版提供代码提示功能[图]
转载:Eclipse+Spket插件+ExtJs4修改版提供代码提示功能[图] ExtJs是一种主要用于创建前端用户界面,是一个基本与后台技术无关的前端ajax框架.功能丰富,无人能出其右.无论是界面 ...

随机推荐

听听Matt Rogish说怎么面试程序员
Google的人力运营高级副总裁Laszlo Bock在一次采访中说Google发现在面试程序员时问智力题完全是浪费时间,Matt Rogish在他的这篇博客How to Interview Prog ...
java的学习路线
首先是培养兴趣.先开始学习HTML知识.也就是做网页,从这里开始比较简单,就是几个标签单词需要记住. 接着开始学习CSS,这里开始不要学习非常多,只要能作出简单类似hao123之类的静态网页就已经 ...
chrome浏览器打开网页，总是跳转到2345主页的解决方法 2345.com 绑架主页
昨晚装了一个wifi共享精灵,原本以为这下好了,全宿舍都可以上网了,但是,确实噩梦的开始啊. 遇到问题:不小心在安装wifi共享精灵的时候,点到了设置2345.com为主页,后来,每次使用chrome ...
MediaInfo源代码分析 4：Inform()函数
我们来看一下MediaInfo中的Inform()函数的内部调用过程首先Inform()函数封装了MediaInfo_Internal类中的Inform()函数 //返回文件信息 String Me ...
给考研计划报考“管理学科学与project”方向大学生的建议（大二阶段）
[来信]丁老师: 你好.在做学习计划前能了解到PDCA循环,着实感到受益匪浅. 这一理念不仅适用于质量管理体系.也适用于一切循序渐进的管理工作. 了解PDCA循环后.对此次学习计划的制定起到一定的导向 ...
android 51 有序广播
无序广播:一条广播发送出去,多个接收者接收没有顺序.有序广播:广播接收者可以设置优先级,优先级高的先收到广播.有序广播可以设置优先级. mainActivity.java package com.sx ...
Entity Framework CodeFirst------使用CodeFirst方式建立数据库连接（一）
本文分步演练介绍通过 Code First 开发建立新数据库.这个方案包括建立不存在的数据库(Code First 创建)或者空数据库(Code First 向它添加新表).借助 Code First ...
sql 语句总结
sql 语句的总结: 下面是个统计 from_userid 字段相同的数数量有多少在用num参数来接收,这个数值: select *,count(*) as num from invitation ...
c读mysql产生乱码问题
在编写接口API时,发现中文字utf8输入的在linux下采用c读取显示为”??”问号,这是由于编码造成的. 很简单的两个地方做修改就搞定. 1.先找到mysql的my.cnf配置文件/etc/my. ...
学习java随笔第二篇：java开发工具——Eclipse
java开发工具有很多这里我使用的是Eclipse. 首先我在官网上下载了Eclipse的软件包,下载地址:http://www.eclipse.org/downloads/,然后有在网上找了一个汉化 ...

mapreduce (五) MapReduce实现倒排索引 修改版 combiner是把同一个机器上的多个map的结果先聚合一次

mapreduce (五) MapReduce实现倒排索引 修改版 combiner是把同一个机器上的多个map的结果先聚合一次的更多相关文章

随机推荐

热门专题

mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次

mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次的更多相关文章