MapReduce的倒排索引

索引：

什么是索引：索引（Index）是帮助数据库高效获取数据的数据结构。索引是在基于数据库表创建的，它包含一个表中某些列的值以及记录对应的地址，并且把这些值存储在一个数据结构中。最常见的就是使用哈希表、B+树作为索引。

索引的具体分析：https ：//blog.csdn.net/meiLin_Ya/article/details/80854232

用代码说事，先来看看我的数据吧：

包com.huhu.day05;

import java.io.IOException;

导入org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import com.huhu.day04.ProgenyCount;

公共类InvertedIndex扩展ToolRunner实现工具{

	私人配置conf;

	公共静态类MyMapper扩展Mapper <LongWritable，文本，文本，文本> {

		私人FileSplit拆分;

		private Text va = new Text（）;

		@覆盖

		保护无效设置（Mapper <LongWritable，Text，Text，Text> .Context上下文）

				抛出IOException，InterruptedException {

			split =（FileSplit）context.getInputSplit（）;

		}

		@覆盖

		protected void map（LongWritable key，Text value，Context context）throws IOException，InterruptedException {

			String [] line = value.toString（）。split（“”）;

			通信System.err.println（线）;

			String filename = split.getPath（）。getName（）;

			for（String s：line）{

				va.set（“fileName：”+ filename +“：”+ key.get（）+“\ t索引位置：”+ value.toString（）。indexOf（s）+“\ t”）;

				context.write（new Text（“搜索词：”+ s +“\ r”），new Text（va））;

			}

		}

	}

	公共静态类MyReduce扩展Reducer <文本，文本，文本，文本> {

		@覆盖

		保护无效设置（上下文上下文）抛出IOException，InterruptedException {

		}

		@覆盖

		protected void reduce（Text key，Iterable <Text> values，Context context）

				抛出IOException，InterruptedException {

			StringBuffer sb = new StringBuffer（）;

			for（Text v：values）{

				sb.append（v.toString（））;

			}

			context.write（new Text（key），new Text（sb.toString（）））;

		}

		@覆盖

		保护无效清理（上下文上下文）抛出IOException，InterruptedException {

		}

	}

	公共静态无效的主要（字符串[]参数）抛出异常{

		InvertedIndex t = new InvertedIndex（）;

		配置conf = t.getConf（）;

		String [] other = new GenericOptionsParser（conf，args）.getRemainingArgs（）;

		if（other.length！= 2）{

			System.err.println（“number is fail”）;

		}

		int run = ToolRunner.run（conf，t，args）;

		System.exit（运行）;

	}

	@覆盖

	public Configuration getConf（）{

		if（conf！= null）{

			返回conf;

		}

		返回新的配置（）;

	}

	@覆盖

	public void setConf（Configuration arg0）{

	}

	@覆盖

	公共诠释运行（字符串[]其他）抛出异常{

		配置con = getConf（）;

		Job job = Job.getInstance（con）;

		job.setJarByClass（ProgenyCount.class）;

		job.setMapperClass（MyMapper.class）;

		job.setMapOutputKeyClass（Text.class）;

		job.setMapOutputValueClass（Text.class）;

		//默认分区

		// job.setPartitionerClass（HashPartitioner.class）;

		job.setReducerClass（MyReduce.class）;

		job.setOutputKeyClass（Text.class）;

		job.setOutputValueClass（Text.class）;

		FileInputFormat.addInputPath（job，new Path（“hdfs：// ry-hadoop1：8020 / in / day05 / InvertedIndex”））;

		Path path = new Path（“hdfs：// ry-hadoop1：8020 / out / day05.txt”）;

		FileSystem fs = FileSystem.get（getConf（））;

		if（fs.exists（path））{

			fs.delete（path，true）;

		}

		FileOutputFormat.setOutputPath（job，path）;

		返回job.waitForCompletion（true）？0：1;

	}

}

索引很重要：

详情：https ：//blog.csdn.net/meiLin_Ya/article/details/80854232

MapReduce的倒排索引的更多相关文章

利用MapReduce实现倒排索引
这里来学习的是利用MapReduce的分布式编程模型来实现简单的倒排索引. 首先什么是倒排索引? 倒排索引是文档检索中最常用的数据结构,被广泛地应用于全文搜索引擎. 它主要是用来存储某个单词(或词组) ...
MapReduce实例-倒排索引
环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境数据:任意数量.格式的文本文件(我用的四个.java代码文件) 方案目标: 根据提供的文本文件,提取出每个单词在哪个文件 ...
mapreduce (三) MapReduce实现倒排索引(二)
hadoop api http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Reducer.html 改变一下需求: ...
MapReduce实战--倒排索引
本文地址:http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html,转载请注明源地址. 1.倒排索引简介倒排索引(Inver ...
Hadoop实战-MapReduce之倒排索引(八)
倒排索引 (就是key和Value对调的显示结果) 一.需求:下面是用户播放音乐记录,统计歌曲被哪些用户播放过 tom LittleApple jack YesterdayO ...
MapReduce实现倒排索引（类似协同过滤）
一.问题背景倒排索引其实就是出现次数越多,那么权重越大,不过我国有凤巢....zf为啥不管,总局回应推广是不是广告有争议... eclipse里ctrl+t找接口或者抽象类的实现类,看看都有啥方法, ...
mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
使用MapReduce实现一些经典的案例
在工作中,很多时候都是用hive或pig来自动化执行mr统计,但是我们不能忘记原始的mr.本文记录了一些通过mr来完成的经典的案例,有倒排索引.数据去重等,需要掌握. 一.使用mapreduce实现倒 ...

随机推荐

20165330 2017-2018-2 《Java程序设计》第7周学习总结
课本知识总结第十一章 JDBC与MySQL数据库安装XAMPP软件及启动MySQL 下载链接:XAMPP 安装步骤:参考教程xampp新手学习指引(windows示例) 启动MySQL:打开系统c ...
python终端打印带颜色的print
原理实现过程: 终端的字符颜色是用转义序列控制的,是文本模式下的系统显示功能,和具体的语言无关. 转义序列是以ESC开头,即用\033来完成(ESC的ASCII码用十进制表示 ...
016-并发编程-java.util.concurrent.locks之-Lock及ReentrantLock
一.概述重入锁ReentrantLock,就是支持重进入的锁 ,它表示该锁能够支持一个线程对资源的重复加锁.支持公平性与非公平性选择,默认为非公平. 以下梳理ReentrantLock.作为依赖于A ...
GDB查看堆栈局部变量
GDB查看堆栈局部变量 “参数从右到左入栈”,“局部变量在栈上分配空间”,听的耳朵都起茧子了.最近做项目涉及C和汇编互相调用,写代码的时候才发现没真正弄明白.自己写了个最简单的函数,用gdb跟踪了调用 ...
Python Async/Await入门指南
转自:https://zhuanlan.zhihu.com/p/27258289 本文将会讲述Python 3.5之后出现的async/await的使用方法,以及它们的一些使用目的,如果错误,欢迎指正 ...
linux下查看CPU、内存、磁盘信息
1.查看CPU信息# 总核数 = 物理CPU个数 X 每颗物理CPU的核数 # 总逻辑CPU数 = 物理CPU个数 X 每颗物理CPU的核数 X 超线程数 # 查看物理CPU个数cat /proc/c ...
抓取html 生成图片
<!DOCTYPE html> <html> <head> <script type="text/javascript" ...
DRF之解析器源码解析
解析器 RESTful一种API的命名风格,主要因为前后端分离开发出现前后端分离: 用户访问静态文件的服务器,数据全部由ajax请求给到解析器的作用就是服务端接收客户端传过来的数据,把数据解析成自己 ...
用Java画简单验证码
以下是具体代码: package com.jinzhi.tes2; import java.awt.Color;import java.awt.Font;import java.awt.Graphic ...
Flutter之MaterialApp使用详解
来自: https://cloud.tencent.com/developer/article/1337184 字段类型 navigatorKey(导航键) GlobalKey<Navigat ...

MapReduce的倒排索引

MapReduce的倒排索引的更多相关文章

随机推荐

热门专题