MapReduce 简单的全文搜索

上一个已经实现了反向索引，那么为什么不尝试下全文搜索呢。例如有了

Hello     file3.txt:1;
MapReduce     file3.txt:2;fil1.txt:1;fil2.txt:1;
bye     file3.txt:1;
is     fil1.txt:1;fil2.txt:2;
powerful     fil2.txt:1;
simple     fil2.txt:1;fil1.txt:1;

那么我要找MapReduce is simple，那么就有file1 和file2有。基本的思想就是先按照MapReduce is simple一个个在索引上查找，例如

MapReduce 3,1,2

is 1,2

simple 2,1

接着以file作为key，word作为value输出

1 MapReduce is simple

2 MapReduce is simple

3 MapReduce

接下来在Reduce中对各个value的单词名进行统计，如果超过3个，那就说明有匹配的了。

这里主要的技术是map，Reduce如何获得命令行参数。在主类中可以通过

String[] pathargs= new GenericOptionsParser(conf, args).getRemainingArgs();

来获得参数，但是如何向map和reduce传参呢，这里有三种方法，只看了一种，因为感觉够用了。

我们通过在主类中的配置实例写参数conf.set(key,value)这里的key，value都是String。要记住一点，这个语句一定要在jog.getInstance(conf)之前，否则都实例化了一个job了还怎么配置呢。接着在map或者reduce中通过

Configuration conf=context.getConfiguration()来获得主类的配置文件。接着就可以conf.get(key)了。

代码具体如下：

public class Find {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] pathargs= new GenericOptionsParser(conf, args).getRemainingArgs();

if(pathargs.length <2){

System. err.println(pathargs.length );

System. exit(2);

}

conf.set( "argsnum",Integer.toString(pathargs. length));

for(int i=2;i<pathargs.length;i++){

conf.set( "args"+i,pathargs[i]);

System. out.println(pathargs[i]);

}

Job job = Job. getInstance(conf, "JobName");

job.setJarByClass(org.apache.hadoop.examples10.Find. class);

// TODO: specify a mapper

job.setMapperClass(MyMapper. class);

// TODO: specify a reducer

job.setReducerClass(MyReducer. class);

// TODO: specify output types

job.setOutputKeyClass(Text. class);

job.setOutputValueClass(Text. class);

// TODO: specify input and output DIRECTORIES (not files)

FileInputFormat. setInputPaths(job, new Path(pathargs[0]));

FileOutputFormat. setOutputPath(job, new Path(pathargs[1]));

if (!job.waitForCompletion(true))

return;

}

public class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

//String[] content={"MapReduce","is","simple"};

public void map(LongWritable ikey, Text ivalue, Context context)

throws IOException, InterruptedException {

Configuration conf=context.getConfiguration();

int argsnum=Integer.parseInt(conf.get( "argsnum"));

//int argsnum=conf.get(" argsnum");

int i=0;

ArrayList<String> content= new ArrayList<String>();

for(i=2;i<argsnum;i++ ){

//System.out.println(conf.get("args"+i));

content.add(conf.get( "args"+i));

}

String line=ivalue.toString();

String key=line.split( " " )[0];

String value=line.split( " " )[1];

StringTokenizer st= new StringTokenizer(value,";" );

for(i=0;i<content.size();i++){

if(content.get(i).compareTo(key)==0){

ArrayList<String> filelist=new ArrayList<String>();

while(st.hasMoreTokens()){

String file=st.nextToken();

file=file.split( ":")[0];

filelist.add(file);

}

for(int j=0;j<filelist.size();j++){

context.write( new Text(filelist.get(j)),new Text(key));

}

public class MyReducer extends Reducer<Text, Text, Text, Text> {

public void reduce(Text _key, Iterable<Text> values, Context context)

throws IOException, InterruptedException {

Configuration conf=context.getConfiguration();

int argsnum=Integer.parseInt(conf.get( "argsnum"));

// process values

int sum=0;

String filename= new String();

for(int i =2;i<argsnum; i++ ){

//System.out.println(conf.get("args"+i));

filename+=(conf.get( "args"+i));

filename+= " ";

}

for (Text val : values) {

sum++;

}

if(sum>=argsnum-2){

context.write( new Text(filename),_key);

}

MapReduce 简单的全文搜索的更多相关文章

MapReduce 简单的全文搜索2
上一个全文搜索实现了模糊查找,这个主要实现了精确查找,就是比如你查找mapreduce is simple那么他就只查找有这个句子的文章,而不是查找有这三个单词的文章. 这个版本需要重写反向索引,因为 ...
Django 博客实现简单的全文搜索
作者:HelloGitHub-追梦人物文中所涉及的示例代码,已同步更新到 HelloGitHub-Team 仓库搜索是一个复杂的功能,但对于一些简单的搜索任务,我们可以使用 Django Mode ...
如何在MySQL中获得更好的全文搜索结果
如何在MySQL中获得更好的全文搜索结果很多互联网应用程序都提供了全文搜索功能,用户可以使用一个词或者词语片断作为查询项目来定位匹配的记录.在后台,这些程序使用在一个SELECT 查询中的LIKE语 ...
window环境下，php+sphinx+coreseek实现简单的中文全文搜索
就以我个人理解来说,sphinx其实是介于客户端和mysql之间的一个索引表,把数据库的没一条记录假设为文档,那么这个索引表其实保存的就是这条记录的关键词及其对应的文档id 1.sphinx的安装下 ...
SQLSERVER全文搜索
SQLSERVER全文搜索看这篇文章之前请先看一下下面我摘抄的全文搜索的MSDN资料,基本上MSDN上关于全文搜索的资料的我都copy下来了并且非常认真地阅读和试验了一次,并且补充了一些SQL语句 ...
[Elasticsearch] 全文搜索 (一) 基础概念和match查询
全文搜索(Full Text Search) 现在我们已经讨论了搜索结构化数据的一些简单用例,是时候开始探索全文搜索了 - 如何在全文字段中搜索来找到最相关的文档. 对于全文搜索而言,最重要的两个方面 ...
MySQL 全文搜索支持, mysql 5.6.4支持Innodb的全文检索和类memcache的nosql支持
背景:搞个个人博客的全文搜索得用like啥的,现在mysql版本号已经大于5.6.4了也就支持了innodb的全文搜索了,刚查了下目前版本号都到MySQL Community Server 5.6.1 ...
MySQL 全文搜索支持
MySQL 全文搜索支持从MySQL 4.0以上 myisam引擎就支持了full text search 全文搜索,在一般的小网站或者blog上可以使用这个特性支持搜索. 那么怎么使用了,简单看看 ...
Apache Solr采用Java开发、基于Lucene的全文搜索服务器
http://docs.spring.io/spring-data/solr/ 首先介绍一下solr: Apache Solr (读音: SOLer) 是一个开源.高性能.采用Java开发.基于Luc ...

随机推荐

Struts中的数据处理的三种方式
Struts中的数据处理的三种方式: public class DataAction extends ActionSupport{ @Override public String execute() ...
select初始化默认选项
在写select时由于在数据库中的得到的值都是字典型的值0,1,2所以在初始化的时候要
UIWebView & javascript
http://blog.163.com/m_note/blog/static/208197045201293015844274/ UIWebView是IOS SDK中渲染网面的控件,在显示网页的时候, ...
android经典开源代码集合
一.依赖注入DI通过依赖注入减少View.服务.资源简化初始化,事件绑定等重复繁琐工作1. AndroidAnnotations(Code Diet) android快速开发框架项目地址:https: ...
优化之sitemap+RSS
RSS也叫聚合, RSS是在线共享内容的一种简易方式,也叫聚合内容,Really Simple Syndication. 通常在时效性比较强的网站或网络平台上应用RSS订阅功能可以更快速获取信息,网站 ...
json optString getString
optString 和 getString 区别. optString 当接收到的为空时候不会报错
Processes and Threads
http://www.cnblogs.com/xitang/archive/2011/09/24/2189460.html Processes and Threads 译者署名: 呆呆大虾译者微博: ...
Server对象
Server是服务器对象,定义了一个与Web服务器相关的类,用于访问服务器上的资源. 属性 MachineName 获取服务器的计算机名. 返回本地计算机的名称 ScriptTimeout ...
jQuery常用及基础知识总结(三)
1.通过jquery的$()引用元素包括通过id.class.元素名以及元素的层级关系及dom或者xpath条件等方法,且返回的对象为jquery对象(集合对象),不能直接调用dom定义的方法. 2. ...
运行时设计（Design at Run-time）
1.定义传统软件开发必须经历“设计时”和“运行时”两个阶段,运行时设计,顾名思义,就是在软件运行过程中,对软件进行实时设计修改,而无需再次进行编译,用户即可使用. “运行时设计(Design at ...

MapReduce 简单的全文搜索

MapReduce 简单的全文搜索的更多相关文章

随机推荐

热门专题