一、倒排索引简单介绍

倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。

它是文档检索系统中最经常使用的数据结构。

以英文为例。以下是要被索引的文本：

T0="it is what it is"

T1＝"what is it"

T2＝"it is a banana"

我们就能得到以下的反向文件索引：

 "a":      {2}

 "banana": {2}

 "is":     {0, 1, 2}

 "it":     {0, 1, 2}

 "what":   {0, 1}

检索的条件”what”, “is” 和 “it” 将相应这个集合：{0, 1}&{0, 1, 2}& {0, 1, 2}={0,1}

对于中文分词，能够使用开源的中文分词工具，这里使用ik-analyzer。

准备几个文本文件，写入内容做測试。

file1.txt内容例如以下:

其实我们发现，互联网裁员潮频现甚至要高于其它行业领域

file2.txt内容例如以下:

面对寒冬，互联网企业不得不调整人员结构，优化雇员的投入产出

file3.txt内容例如以下:

在互联网内部，因为内部竞争机制以及要与竞争对手拼进度

file4.txt内容例如以下:

互联网大公司职员尽管能够从复杂性和专业分工中受益

互联网企业不得不调整人员结构

二、加入依赖

出了hadoop主要的jar包意外。加入中文分词的lucene-analyzers-common和ik-analyzers：



   <!--Lucene分词模块-->

    <dependency>

      <groupId>org.apache.lucene</groupId>

      <artifactId>lucene-analyzers-common</artifactId>

      <version>6.0.0</version>

    </dependency>

 <!--IK分词 -->

    <dependency>

      <groupId>cn.bestwu</groupId>

      <artifactId>ik-analyzers</artifactId>

      <version>5.1.0</version>

    </dependency>

三、MapReduce程序

关于Lucene 6.0中IK分词的配置參考http://blog.csdn.net/napoay/article/details/51911875，MapReduce程序例如以下。

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.lucene.analysis.TokenStream;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.IOException;

import java.io.StringReader;

import java.util.HashMap;

import java.util.Map;

/**

 * Created by bee on 4/4/17.

 */

public class InvertIndexIk {

    public static class InvertMapper extends Mapper<Object, Text, Text, Text> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

            String filename = ((FileSplit) context.getInputSplit()).getPath().getName()

                    .toString();

            Text fname = new Text(filename);

            IKAnalyzer6x analyzer = new IKAnalyzer6x(true);

            String line = value.toString();

            StringReader reader = new StringReader(line);

            TokenStream tokenStream = analyzer.tokenStream(line, reader);

            tokenStream.reset();

            CharTermAttribute termAttribute = tokenStream.getAttribute

                    (CharTermAttribute.class);

            while (tokenStream.incrementToken()) {

                Text word = new Text(termAttribute.toString());

                context.write(word, fname);

            }

        }

    }

    public static class InvertReducer extends Reducer<Text, Text, Text, Text> {

        public void reduce(Text key, Iterable<Text> values,Reducer<Text,Text,

                Text,Text>.Context context) throws IOException, InterruptedException {

            Map<String, Integer> map = new HashMap<String, Integer>();

            for (Text val : values) {

                if (map.containsKey(val.toString())) {

                    map.put(val.toString(),map.get(val.toString())+1);

                } else {

                    map.put(val.toString(),1);

                }

            }

            int termFreq=0;

            for (String mapKey:map.keySet()){

                termFreq+=map.get(mapKey);

            }

            context.write(key,new Text(map.toString()+"  "+termFreq));

        }

    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        HadoopUtil.deleteDir("output");

        Configuration conf=new Configuration();

        String[] otherargs=new

                String[]{"input/InvertIndex",

                "output"};

        if (otherargs.length!=2){

            System.err.println("Usage: mergesort <in> <out>");

            System.exit(2);

        }

        Job job=Job.getInstance();

        job.setJarByClass(InvertIndexIk.class);

        job.setMapperClass(InvertIndexIk.InvertMapper.class);

        job.setReducerClass(InvertIndexIk.InvertReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job,new Path(otherargs[0]));

        FileOutputFormat.setOutputPath(job,new Path(otherargs[1]));

        System.exit(job.waitForCompletion(true) ? 0: 1);

    }

}

四、执行结果

输出例如以下:

专业分工    {file4.txt=1}  1

中   {file4.txt=1}  1

其实 {file1.txt=1}  1

互联网 {file1.txt=1, file3.txt=1, file4.txt=2, file2.txt=1}  5

人员  {file4.txt=1, file2.txt=1}  2

企业  {file4.txt=1, file2.txt=1}  2

优化  {file2.txt=1}  1

内部  {file3.txt=2}  2

发现  {file1.txt=1}  1

受益  {file4.txt=1}  1

复杂性 {file4.txt=1}  1

大公司 {file4.txt=1}  1

寒冬  {file2.txt=1}  1

投入产出    {file2.txt=1}  1

拼   {file3.txt=1}  1

潮   {file1.txt=1}  1

现   {file1.txt=1}  1

竞争对手    {file3.txt=1}  1

竞争机制    {file3.txt=1}  1

结构  {file4.txt=1, file2.txt=1}  2

职员  {file4.txt=1}  1

行业  {file1.txt=1}  1

裁员  {file1.txt=1}  1

要与  {file3.txt=1}  1

调整  {file4.txt=1, file2.txt=1}  2

进度  {file3.txt=1}  1

雇员  {file2.txt=1}  1

面对  {file2.txt=1}  1

领域  {file1.txt=1}  1

频   {file1.txt=1}  1

高于  {file1.txt=1}  1

结果有三列。依次为词项、词项在单个文件里的词频以及总的词频。

五、參考资料

1.https://zh.wikipedia.org/wiki/ 倒排索引

2. Lucene 6.0下使用IK分词器

MapReduce编程(七) 倒排索引构建的更多相关文章

[置顶] MapReduce 编程之倒排索引
本文调试环境: ubuntu 10.04 , hadoop-1.0.2 hadoop装的是伪分布模式,就是只有一个节点,集namenode, datanode, jobtracker, tasktra ...
MapReduce编程之倒排索引
任务要求: //输入文件格式 18661629496 110 13107702446 110 1234567 120 2345678 120 987654 110 2897839274 1866162 ...
Hadoop MapReduce编程 API入门系列之挖掘气象数据版本2（十）
下面,是版本1. Hadoop MapReduce编程 API入门系列之挖掘气象数据版本1(一) 这篇博文,包括了,实际生产开发非常重要的,单元测试和调试代码.这里不多赘述,直接送上代码. MRUni ...
批处理引擎MapReduce编程模型
批处理引擎MapReduce编程模型作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. MapReduce是一个经典的分布式批处理计算引擎,被广泛应用于搜索引擎索引构建,大规模数据处理 ...
[Hadoop入门] - 1 Ubuntu系统 Hadoop介绍 MapReduce编程思想
Ubuntu系统 (我用到版本号是140.4) ubuntu系统是一个以桌面应用为主的Linux操作系统,Ubuntu基于Debian发行版和GNOME桌面环境.Ubuntu的目标在于为一般用户提供一 ...
Hadoop MapReduce编程学习
一直在搞spark,也没时间弄hadoop,不过Hadoop基本的编程我觉得我还是要会吧,看到一篇不错的文章,不过应该应用于hadoop2.0以前,因为代码中有 conf.set("map ...
hadoop2.2编程：使用MapReduce编程实例（转）
原文链接:http://www.cnblogs.com/xia520pi/archive/2012/06/04/2534533.html 从网上搜到的一篇hadoop的编程实例,对于初学者真是帮助太大 ...
MapReduce编程模型及其在Hadoop上的实现
转自:https://www.zybuluo.com/frank-shaw/note/206604 MapReduce基本过程关于MapReduce中数据流的传输过程,下图是一个经典演示: 关于上 ...
三、MapReduce编程实例
前文一.CentOS7 hadoop3.3.1安装(单机分布式.伪分布式.分布式二.JAVA API实现HDFS MapReduce编程实例 @ 目录前文 MapReduce编程实例前言注意 ...

随机推荐

【Tomcat】Tomcat闪退的问题解决/Tomcat修改端口号无效
一. Tomcat闪退的问题解决 1.首先确定JDK的环境变量配置正确 2.下载纯净的新的Tomcat 3.在bin\startup.bat文件中的第一行前面加入: SET JAVA_HOME = ...
nodejs + express访问静态资源
想访问一个资源的时候,发现访问不了方法1.加上了这个就可以访问了,static参数为静态文件存放目录:__dirname代表目录 app.use(express.static(__dirname)) ...
Pod中访问外部的域名配置
在实际应用中经常遇到Pod中访问外部域名的状况,在Kubenetes 1.6以上的版本通过配置DNS configmap已经解决,详细的内容可以参考官方的 https://kubernetes.io/ ...
tomcat+mysql在Kubernetes环境
基于PV作为交换目录将应用最终拷贝入/tomcat/webapps目录进入Docker后,修改/bin/catalina.sh,加入jdbc的类 \webapps\mytestsql\WEB-INF ...
万里长征第二步——django个人博客（第五步 ——配置后台admin）
在urls.py文件中配置admin路径 from django.conf.urls import url from django.contrib import admin from blog.vie ...
用latex写毕业论文
用 LaTeX 写漂亮学位论文(from wloo) 序一直觉得有必要写这样一篇文章,因为学位论文从格式上说更像一本书,与文章的排版不同,不仅多出目录等文章没有的部分,而且一般要设置页眉页脚方便阅 ...
前端对比插件JS
https://github.com/kpdecker/jsdiff demo http://kpdecker.github.io/jsdiff/ 缺点:文件大于10M的就比较不了了用于比对两段HT ...
破解IDEA Ultimate2017 测试
转载:http://blog.csdn.net/linyanqing21/article/details/72594352 IntelliJ Idea 2017 免费激活方法: 1.到网站 http: ...
[Angular] Angular i18n Alternative Expressions Support (select)
For example we have those code: <div class="course-category" [ngSwitch]="course.ca ...
PHP面向对象之接口（interface）
1.使用接口,接口中指定了某个类必须实现的某些方法,这些方法都是空的(不需要定义这些方法的具体内容) 2. 要实现一个接口用关键字implements,类中必须包含接口中所有的方法,否则会出现一个致命 ...

MapReduce编程(七) 倒排索引构建