MapReduce的倒排索引

索引：

什么是索引：索引（Index）是帮助数据库高效获取数据的数据结构。索引是在基于数据库表创建的，它包含一个表中某些列的值以及记录对应的地址，并且把这些值存储在一个数据结构中。最常见的就是使用哈希表、B+树作为索引。

索引的具体分析：https ：//blog.csdn.net/meiLin_Ya/article/details/80854232

用代码说事，先来看看我的数据吧：

包com.huhu.day05;

import java.io.IOException;

导入org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import com.huhu.day04.ProgenyCount;

公共类InvertedIndex扩展ToolRunner实现工具{

	私人配置conf;

	公共静态类MyMapper扩展Mapper <LongWritable，文本，文本，文本> {

		私人FileSplit拆分;

		private Text va = new Text（）;

		@覆盖

		保护无效设置（Mapper <LongWritable，Text，Text，Text> .Context上下文）

				抛出IOException，InterruptedException {

			split =（FileSplit）context.getInputSplit（）;

		}

		@覆盖

		protected void map（LongWritable key，Text value，Context context）throws IOException，InterruptedException {

			String [] line = value.toString（）。split（“”）;

			通信System.err.println（线）;

			String filename = split.getPath（）。getName（）;

			for（String s：line）{

				va.set（“fileName：”+ filename +“：”+ key.get（）+“\ t索引位置：”+ value.toString（）。indexOf（s）+“\ t”）;

				context.write（new Text（“搜索词：”+ s +“\ r”），new Text（va））;

			}

		}

	}

	公共静态类MyReduce扩展Reducer <文本，文本，文本，文本> {

		@覆盖

		保护无效设置（上下文上下文）抛出IOException，InterruptedException {

		}

		@覆盖

		protected void reduce（Text key，Iterable <Text> values，Context context）

				抛出IOException，InterruptedException {

			StringBuffer sb = new StringBuffer（）;

			for（Text v：values）{

				sb.append（v.toString（））;

			}

			context.write（new Text（key），new Text（sb.toString（）））;

		}

		@覆盖

		保护无效清理（上下文上下文）抛出IOException，InterruptedException {

		}

	}

	公共静态无效的主要（字符串[]参数）抛出异常{

		InvertedIndex t = new InvertedIndex（）;

		配置conf = t.getConf（）;

		String [] other = new GenericOptionsParser（conf，args）.getRemainingArgs（）;

		if（other.length！= 2）{

			System.err.println（“number is fail”）;

		}

		int run = ToolRunner.run（conf，t，args）;

		System.exit（运行）;

	}

	@覆盖

	public Configuration getConf（）{

		if（conf！= null）{

			返回conf;

		}

		返回新的配置（）;

	}

	@覆盖

	public void setConf（Configuration arg0）{

	}

	@覆盖

	公共诠释运行（字符串[]其他）抛出异常{

		配置con = getConf（）;

		Job job = Job.getInstance（con）;

		job.setJarByClass（ProgenyCount.class）;

		job.setMapperClass（MyMapper.class）;

		job.setMapOutputKeyClass（Text.class）;

		job.setMapOutputValueClass（Text.class）;

		//默认分区

		// job.setPartitionerClass（HashPartitioner.class）;

		job.setReducerClass（MyReduce.class）;

		job.setOutputKeyClass（Text.class）;

		job.setOutputValueClass（Text.class）;

		FileInputFormat.addInputPath（job，new Path（“hdfs：// ry-hadoop1：8020 / in / day05 / InvertedIndex”））;

		Path path = new Path（“hdfs：// ry-hadoop1：8020 / out / day05.txt”）;

		FileSystem fs = FileSystem.get（getConf（））;

		if（fs.exists（path））{

			fs.delete（path，true）;

		}

		FileOutputFormat.setOutputPath（job，path）;

		返回job.waitForCompletion（true）？0：1;

	}

}

索引很重要：

详情：https ：//blog.csdn.net/meiLin_Ya/article/details/80854232

MapReduce的倒排索引的更多相关文章

利用MapReduce实现倒排索引
这里来学习的是利用MapReduce的分布式编程模型来实现简单的倒排索引. 首先什么是倒排索引? 倒排索引是文档检索中最常用的数据结构,被广泛地应用于全文搜索引擎. 它主要是用来存储某个单词(或词组) ...
MapReduce实例-倒排索引
环境: Hadoop1.x,CentOS6.5,三台虚拟机搭建的模拟分布式环境数据:任意数量.格式的文本文件(我用的四个.java代码文件) 方案目标: 根据提供的文本文件,提取出每个单词在哪个文件 ...
mapreduce (三) MapReduce实现倒排索引(二)
hadoop api http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Reducer.html 改变一下需求: ...
MapReduce实战--倒排索引
本文地址:http://www.cnblogs.com/archimedes/p/mapreduce-inverted-index.html,转载请注明源地址. 1.倒排索引简介倒排索引(Inver ...
Hadoop实战-MapReduce之倒排索引(八)
倒排索引 (就是key和Value对调的显示结果) 一.需求:下面是用户播放音乐记录,统计歌曲被哪些用户播放过 tom LittleApple jack YesterdayO ...
MapReduce实现倒排索引（类似协同过滤）
一.问题背景倒排索引其实就是出现次数越多,那么权重越大,不过我国有凤巢....zf为啥不管,总局回应推广是不是广告有争议... eclipse里ctrl+t找接口或者抽象类的实现类,看看都有啥方法, ...
mapreduce (五) MapReduce实现倒排索引修改版 combiner是把同一个机器上的多个map的结果先聚合一次
(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次现重新实现一个: 思路 ...
mapreduce (二) MapReduce实现倒排索引(一) combiner是把同一个机器上的多个map的结果先聚合一次
1 思路:0.txt MapReduce is simple1.txt MapReduce is powerfull is simple2.txt Hello MapReduce bye MapRed ...
使用MapReduce实现一些经典的案例
在工作中,很多时候都是用hive或pig来自动化执行mr统计,但是我们不能忘记原始的mr.本文记录了一些通过mr来完成的经典的案例,有倒排索引.数据去重等,需要掌握. 一.使用mapreduce实现倒 ...

随机推荐

使用Typescript写的Vue初学者Hello World实例(实现按需加载、跨域调试、await/async）
万事开头难,一个好的Hello World程序可以节省我们好多的学习时间,帮助我们快速入门.Hello World程序之所以是入门必读必会,就是因为其代码量少,简单易懂.但我觉得,还应该做到功能丰富, ...
外网无法ip访问服务器解决方法 (原)
示例ip:119.75.1.1 windows server示例一.检查网络是否通畅打开dos窗口(windows+r , 输入cmd回车 )输入命令: ping 119.75 ...
微信小程序返回上一页的方法并传参
这个有点像子-->父传值第一步,在子页面点击上一步或者保存数据请求成功以后添加如下代码. var pages = getCurrentPages(); var prevPage = pages ...
Linux. 计划任务时间格式
Linux. 计划任务时间格式在linux中执行指令:cat /etc/crontab 结果,如下图所示: 结果一目了然,不多说. 如有问题,欢迎纠正!!! 如有转载,请标明源处:https:// ...
Netty返回数据丢包的问题之一
这个问题是在一个群友做压力测试的时候发现的.使用客户端和netty创建一条连接,然后写了一个for循环不停的给服务器发送1500条信息,发现返回只有几百条.另外几百条不知道哪去了.查看代码,发现在服务 ...
centos7安装zabbix3.0超详细步骤解析
centos7安装zabbix3.0超详细步骤解析很详细,感谢作者以下是我操作的history 622 java -version 623 javac -version 624 grep SELI ...
sitecore 数字化营销-path funnel
路径分析器是一个应用程序,允许您查看联系人在浏览网站时所采用的各种路径.您可以查看联系人在转换目标并与广告系列互动时所采用的路径,让您深入了解哪些路径为每次转化提供最佳参与价值,以及哪些路径效率较低且 ...
ansible的高级应用-roles
在之前我们知道了playbook,类似于shell的脚本,playbook适用于一些不太麻烦的部署任务,比如说使用playbook安装mysql,那么我们直接写一个playbook文件即可.可是如果我 ...
PHP遍历目录和文件及子目录和文件
正常直接使用opendir方法,就可以读到所有的目录和文件文件可以直接记录下来,目录则需要再进一步获取里边的文件信息也就是,如果当前读出来是目录,则需要再次调用函数本身(递归),直到没有目录循环 ...
JDK1.7安装和配置及注意事项
要求必备知识 windows 7 基本操作. 运行环境 windows 7 下载地址环境下载下载JDK 下载地址:http://www.oracle.com/technetwork/java/j ...

MapReduce的倒排索引

MapReduce的倒排索引的更多相关文章

随机推荐

热门专题