mapreduce (三) MapReduce实现倒排索引(二)

hadoop api

http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Reducer.html

改变一下需求：要求“文档词频列表”是经过排序的，即 出现次数高的再前

思路：

代码：

package proj;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndexSortByFreq {

    // 将词分为<word:num,docid>

    public static class InvertedIndexMapper extends

            Mapper<Object, Text, Text, Text> {

        private Text keyInfo = new Text();

        private Text valInfo = new Text();

        private FileSplit split;

        public void map(Object key, Text value, Context context)

                throws IOException, InterruptedException {

            String[] tokens = value.toString().split(" ");

            split = (FileSplit) context.getInputSplit();

            String docid = split.getPath().getName();

            Map<String, Integer> map = new HashMap<String, Integer>();

            for (String token : tokens) {

                if (map.containsKey(token)) {

                    Integer newInt = new Integer(map.get(token) + 1);

                    map.put(token, newInt);

                } else {

                    map.put(token, 1);

                }

            }

            for (String k : map.keySet()) {

                Integer num = map.get(k);

                keyInfo.set(k + ":" + num);

                valInfo.set(docid);

                context.write(keyInfo, valInfo);

            }

        }

    }

    public static class InvertedIndexPartioner extends

            HashPartitioner<Text, Text> {

        private Text term = new Text();

        public int getPartition(Text key, Text value, int numReduceTasks) {

            term.set(key.toString().split(":")[0] + ":" + value);

            return super.getPartition(term, value, numReduceTasks);

        }

    }

    // 组合成倒排索引文档

    public static class InvertedIndexReducer extends

            Reducer<Text, Text, Text, Text> {

        private Text keyInfo = new Text();

        private Text valInfo = new Text();

        private String tPrev = null;

        private StringBuffer buff = new StringBuffer();

        public void reduce(Text key, Iterable<Text> values, Context context)

                throws IOException, InterruptedException {

            String[] tokens = key.toString().split(":");

            String current = tokens[0];

            if (tPrev == null) {

                tPrev = current;

                for (Text val : values) {

                    buff.append(tokens[1] + ":" + val.toString() + ";");

                }

            }

            if(tPrev.equals(current)){

                for (Text val : values) {

                    buff.append(tokens[1] + ":" + val.toString() + ";");

                }

            }else{

                keyInfo.set(tPrev);

                valInfo.set(buff.toString());

                context.write(keyInfo,valInfo);

                tPrev = current;

                buff = new StringBuffer();

                for (Text val : values) {

                    buff.append(tokens[1] + ":" + val.toString() + ";");

                }

            }

        }

        public void cleanup(Context context) throws IOException, InterruptedException{

            keyInfo.set(tPrev);

            valInfo.set(buff.toString());

            context.write(keyInfo,valInfo);

            super.cleanup(context);

        }

    }

    public static void main(String[] args) throws IOException,

            ClassNotFoundException, InterruptedException {

        Configuration conf = new Configuration();

        String[] otherArgs = new GenericOptionsParser(conf, args)

                .getRemainingArgs();

        Job job = new Job(conf, "InvertedIndex");

        job.setJarByClass(InvertedIndex.class);

        job.setMapperClass(InvertedIndexMapper.class);

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(Text.class);

        job.setPartitionerClass(InvertedIndexPartioner.class);

        job.setReducerClass(InvertedIndexReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

mapreduce (三) MapReduce实现倒排索引(二)的更多相关文章

MapReduce(三)
MapReduce(三) MapReduce(三): 1.关于倒叙排序前10名 1)TreeMap根据key排序 2)TreeSet排序,传入一个对象,排序按照类中的compareTo方法排序 2.写 ...
Hadoop Mapreduce分区、分组、二次排序过程详解[转]
原文地址:Hadoop Mapreduce分区.分组.二次排序过程详解[转]作者: 徐海蛟教学用途 1.MapReduce中数据流动 (1)最简单的过程: map - reduce (2) ...
《Data-Intensive Text Processing with mapReduce》读书笔记之二：mapreduce编程、框架及运行
搜狐视频的屌丝男士第二季大结局了,惊现波多野老师,怀揣着无比鸡冻的心情啊,可惜随着剧情的推进发展,并没有出现期待中的屌丝奇遇,大鹏还是没敢冲破尺度的界线.想百度些种子吧,又不想让电脑留下污点证据,要知 ...
mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
Lucene.Net 2.3.1开发介绍 —— 三、索引（二）
原文:Lucene.Net 2.3.1开发介绍 -- 三.索引(二) 2.索引中用到的核心类在Lucene.Net索引开发中,用到的类不多,这些类是索引过程的核心类.其中Analyzer是索引建立的 ...
Java基于opencv实现图像数字识别(三)—灰度化和二值化
Java基于opencv实现图像数字识别(三)-灰度化和二值化一.灰度化灰度化:在RGB模型中,如果R=G=B时,则彩色表示灰度颜色,其中R=G=B的值叫灰度值:因此,灰度图像每个像素点只需一个字 ...
“全栈2019”Java第三十一章：二维数组和多维数组详解
难度初级学习时间 10分钟适合人群零基础开发语言 Java 开发环境 JDK v11 IntelliJ IDEA v2018.3 文章原文链接 "全栈2019"Java第 ...
hadoop学习第三天-MapReduce介绍&&WordCount示例&&倒排索引示例
一.MapReduce介绍 (最好以下面的两个示例来理解原理) 1. MapReduce的基本思想 Map-reduce的思想就是“分而治之” Map Mapper负责“分”,即把复杂的任务分解为若干 ...
hadoop系列三:mapreduce的使用(一)
转载请在页首明显处注明作者与出处 http://www.cnblogs.com/zhuxiaojie/p/7224772.html 一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的 ...

随机推荐

UNIX编程之冲洗内存流与null追加策略（APUE F5-15）
博文链接:http://haoyuanliu.github.io/2016/04/29/mysql/ 对,我是来骗访问量的!O(∩_∩)O~~ 最近一直在拜读APUE(Advanced Program ...
VS 制作安装包小窥
难得忙里偷闲,看到有关VS制作安装包,按下文小试一把,还行,比不上Installshield. 首先在打开 VS2010 > 文件 >新建项目创建一个安装项目 XXX 在“目 ...
Java 实现字符串反转
方法一: public class StringReverse { public void swap(char[] arr, int begin, int end) { while(begin < ...
PHP Predefined Interfaces 预定义接口(转)
SPL提供了6个迭代器接口: Traversable 遍历接口(检测一个类是否可以使用 foreach 进行遍历的接口) Iterator 迭代器接口(可在内部迭代自己的外部迭代器或类的接口) Ite ...
iOS 类似美团外卖 app 两个 tableView 联动效果实现
写在前面首先声明哈,不是广告,我就是用的时候觉得这个功能比较好玩,就想着实现了一下.效果如图: 接下来简单的说一下思路吧~ 大体思路可能我们看到这种功能的实现的时候,首先想着的是我在这个控制器中左 ...
Android程序版本更新--通知栏更新下载安装（转）
Android应用检查版本更新后,在通知栏下载,更新下载进度,下载完成自动安装,效果图如下: 检查当前版本号 AndroidManifest文件中的versionCode用来标识版本,在服务器放一个新 ...
Android开发系列（一）Activity与Fragment获取屏幕获取屏幕像素的不同方式
Activity中常用的获取屏幕像素代码: //获取屏幕像素相关信息 DisplayMetrics dm = new DisplayMetrics(); getWindowManager().getD ...
java常见错误的列表
ava常见错误列表: 找不到符号(symbol) 类X是public的,应该被声明在名为X.java的文件中缺失类.接口或枚举类型缺失X 缺失标识符非法的表达式开头类型不兼容非法的方法声明; ...
JS遍历对象或者数组
一.纯js实现 <script> var obj = {"player_id":"GS001","event_id":" ...
xls和xlsx
xls XLS 就是 Microsoft Excel 工作表,是一种非常常用的电子表格格式.xls文件可以使用Microsoft Excel打开,另外微软为那些没有安装Excel的用户开发了专门的 ...

mapreduce (三) MapReduce实现倒排索引(二)

mapreduce (三) MapReduce实现倒排索引(二)的更多相关文章

随机推荐

热门专题