【Lucene4.8教程之中的一个】使用Lucene4.8进行索引及搜索的基本操作

在Lucene对文本进行处理的过程中，能够大致分为三大部分：

1、索引文件：提取文档内容并分析，生成索引

2、搜索内容：搜索索引内容，依据搜索keyword得出搜索结果

3、分析内容：对搜索词汇进行分析，生成Quey对象。

注：其实。除了最主要的全然匹配搜索以外。其他都须要在搜索前进行分析。

如不加分析步骤。则搜索JAVA。是没有结果的，由于在索引过程中已经将词汇均转化为小写。而此处搜索时则要求keyword全然匹配。

使用了QueryParser类以后，则依据Analyzer的详细实现类，对搜索词汇进行分析，如大写和小写转换，java and ant等的搜索词解释等。

一、索引文件

基本过程例如以下：

1、创建索引库IndexWriter

2、依据文件创建文档Document

3、向索引库中写入文档内容

package com.ljh.search.index;

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.LongField;

import org.apache.lucene.document.StringField;

import org.apache.lucene.document.TextField;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

// 1、创建索引库IndexWriter

// 2、依据文件创建文档Document

// 3、向索引库中写入文档内容

public class IndexFiles {

	public static void main(String[] args) throws IOException {

		String usage = "java IndexFiles"

				+ " [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"

				+ "This indexes the documents in DOCS_PATH, creating a Lucene index"

				+ "in INDEX_PATH that can be searched with SearchFiles";

		String indexPath = null;

		String docsPath = null;

		for (int i = 0; i < args.length; i++) {

			if ("-index".equals(args[i])) {

				indexPath = args[i + 1];

				i++;

			} else if ("-docs".equals(args[i])) {

				docsPath = args[i + 1];

				i++;

			}

		}

		if (docsPath == null) {

			System.err.println("Usage: " + usage);

			System.exit(1);

		}

		final File docDir = new File(docsPath);

		if (!docDir.exists() || !docDir.canRead()) {

			System.out

					.println("Document directory '"

							+ docDir.getAbsolutePath()

							+ "' does not exist or is not readable, please check the path");

			System.exit(1);

		}

		IndexWriter writer = null;

		try {

			// 1、创建索引库IndexWriter

			writer = getIndexWriter(indexPath);

			index(writer, docDir);

		} catch (IOException e) {

			e.printStackTrace();

		} finally {

			writer.close();

		}

	}

	private static IndexWriter getIndexWriter(String indexPath)

			throws IOException {

		Directory indexDir = FSDirectory.open(new File(indexPath));

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,

				new StandardAnalyzer(Version.LUCENE_48));

		IndexWriter writer = new IndexWriter(indexDir, iwc);

		return writer;

	}

	private static void index(IndexWriter writer, File file) throws IOException {

		if (file.isDirectory()) {

			String[] files = file.list();

			if (files != null) {

				for (int i = 0; i < files.length; i++) {

					index(writer, new File(file, files[i]));

				}

			}

		} else {

			// 2、依据文件创建文档Document

			Document doc = new Document();

			Field pathField = new StringField("path", file.getPath(),

					Field.Store.YES);

			doc.add(pathField);

			doc.add(new LongField("modified", file.lastModified(),

					Field.Store.NO));

			doc.add(new TextField("contents", new FileReader(file)));

			System.out.println("Indexing " + file.getName());

			// 3、向索引库中写入文档内容

			writer.addDocument(doc);

		}

	}

}

（1）使用“java indexfiles -index d:/index -docs d:/tmp”执行程序，索引d:/tmp中的文件。并将索引文件放置到d:/index。

（2）上述生成的索引文件能够使用Luke进行查看。眼下Luke已迁移至github进行托管。

二、搜索文件

1、打开索引库IndexSearcher
2、依据关键词进行搜索
3、遍历结果并处理

package com.ljh.search.search;

//1、打开索引库IndexSearcher

//2、依据关键词进行搜索

//3、遍历结果并处理

import java.io.File;

import java.io.IOException;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.Term;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TermQuery;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

public class Searcher {

	public static void main(String[] args) throws IOException {

		String indexPath = null;

		String term = null;

		for (int i = 0; i < args.length; i++) {

			if ("-index".equals(args[i])) {

				indexPath = args[i + 1];

				i++;

			} else if ("-term".equals(args[i])) {

				term = args[i + 1];

				i++;

			}

		}

		System.out.println("Searching " + term + " in " + indexPath);

		// 1、打开索引库

		Directory indexDir = FSDirectory.open(new File(indexPath));

		IndexReader ir = DirectoryReader.open(indexDir);

		IndexSearcher searcher = new IndexSearcher(ir);

		// 2、依据关键词进行搜索

		TopDocs docs = searcher.search(

				new TermQuery(new Term("contents", term)), 20);

		// 3、遍历结果并处理

		ScoreDoc[] hits = docs.scoreDocs;

		System.out.println(hits.length);

		for (ScoreDoc hit : hits) {

			System.out.println("doc: " + hit.doc + " score: " + hit.score);

		}

		ir.close();

	}

}

三、分析

其实。除了最主要的全然匹配搜索以外，其他都须要在搜索前进行分析。

如不加分析步骤。则搜索JAVA。是没有结果的，由于在索引过程中已经将词汇均转化为小写。而此处搜索时则要求keyword全然匹配。

使用了QueryParser类以后，则依据Analyzer的详细实现类，对搜索词汇进行分析，如大写和小写转换，java and ant等的搜索词解释等。

分析过程有2个基本步骤：

1、生成QueryParser对象

2、调用QueryParser.parse()生成Query()对象。

详细代码，将下述代码：

		// 2、依据关键词进行搜索

		TopDocs docs = searcher.search(

				new TermQuery(new Term("contents", term)), 20);

用下面取代：

		// 2、依据关键词进行搜索

		/*TopDocs docs = searcher.search(

				new TermQuery(new Term("contents", term)), 10);*/

		QueryParser parser = new QueryParser(Version.LUCENE_48, "contents", new SimpleAnalyzer(Version.LUCENE_48));

		Query query = null;

		try {

			query = parser.parse(term);

		} catch (ParseException e) {

			e.printStackTrace();

		}

		TopDocs docs = searcher.search(query, 30);

【Lucene4.8教程之中的一个】使用Lucene4.8进行索引及搜索的基本操作的更多相关文章

【spring教程之中的一个】创建一个最简单的spring样例
1.首先spring的主要思想,就是依赖注入.简单来说.就是不须要手动new对象,而这些对象由spring容器统一进行管理. 2.样例结构如上图所看到的,採用的是mavenproject. 2.po ...
【solr基础教程之中的一个】Solr相关知识点串讲
Solr是Apache Lucene的一个子项目.Lucene为全文搜索功能提供了完备的API.但它仅仅作为一个API库存在.而不能直接用于搜索. 因此,Solr基于Lucene构建了一 ...
【Tika基础教程之中的一个】Tika基础教程
一.高速入门 1.Tika是一个用于文本解释的框架.其本身并不提供不论什么的库用于解释文本,而是调用各种各样的库,如POI,PDFBox等. 使用Tika.能够提取文件里的作者.标题.创建时间.正文等 ...
【Lucene4.8教程之一】使用Lucene4.8进行索引及搜索的基本操作
在Lucene对文本进行处理的过程中,可以大致分为三大部分: 1.索引文件:提取文档内容并分析,生成索引 2.搜索内容:搜索索引内容,根据搜索关键字得出搜索结果 3.分析内容:对搜索词汇进行分析,生成 ...
【Lucene4.8教程之三】搜索
1.关键类 Lucene的搜索过程中涉及的主要类有以下几个: (1)IndexSearcher:执行search()方法的类 (2)IndexReader:对索引文件进行读操作,并为IndexSear ...
【Lucene4.8教程之二】索引
一.基础内容 0.官方文档说明 (1)org.apache.lucene.index provides two primary classes: IndexWriter, which creates ...
【Lucene4.8教程之三】搜索 2014-06-21 09:53 1532人阅读评论(0) 收藏
1.关键类 Lucene的搜索过程中涉及的主要类有以下几个: (1)IndexSearcher:执行search()方法的类 (2)IndexReader:对索引文件进行读操作,并为IndexSear ...
【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读评论(0) 收藏
一.基础内容 0.官方文档说明 (1)org.apache.lucene.index provides two primary classes: IndexWriter, which creates ...
【浅墨Unity3D Shader编程】之中的一个夏威夷篇：游戏场景的创建 & 第一个Shader的书写
本系列文章由@浅墨_毛星云出品.转载请注明出处. 文章链接:http://blog.csdn.net/poem_qianmo/article/details/40723789 作者:毛星云(浅墨) ...

随机推荐

[SCOI2010] 连续攻击游戏
题目 Description lxhgww最近迷上了一款游戏,在游戏里,他拥有很多的装备,每种装备都有2个属性,这些属性的值用[1,10000]之间的数表示.当他使用某种装备时,他只能使用该装备的某一 ...
Linux IPC基础(System V)
简介 IPC 主要有消息队列.信号量和共享内存3种机制.和文件一样,IPC 在使用前必须先创建,使用 ipcs 命令可以查看当前系统正在使用的 IPC 工具: 由以上可以看出,一个 IPC 至少包含 ...
Apollo 3 定时/长轮询拉取配置的设计
前言如上图所示,Apollo portal 更新配置后,进行轮询的客户端获取更新通知,然后再调用接口获取最新配置.不仅仅只有轮询,还有定时更新(默认 5 分钟一次).目的就是让客户端能够稳定的获取到 ...
jQuery 【事件】【dom 操作】
事件 hover( function(){},function(){}) -- 鼠标移入移出事件 toggle(function(){},function(){},function(){} ...
[PHP] 算法-有序数组旋转后寻找最小值的PHP实现
把一个数组最开始的若干个元素搬到数组的末尾,我们称之为数组的旋转. 输入一个非减排序的数组的一个旋转,输出旋转数组的最小元素. 例如数组{3,4,5,1,2}为{1,2,3,4,5}的一个旋转,该数组 ...
01-初始Java
1. 你学习编程的目的是什么?学习编程最快的办法是什么? 答:我喜欢计算机,想更多的了解计算机的原理:我认为学习最快的办法就是尝试,只有不断地在计算机上尝试编程,遇到错误, 解决错误,才能更快的学会编 ...
Cylinder Candy(积分)
Cylinder Candy Time Limit: 2 Seconds Memory Limit: 65536 KB Special Judge Edward the confectioner is ...
inheritPrototypeChain.js
// 原型链 // 其基本思路是利用原型让一个引用类型继承另一个引用类型的属性和方法 function Person(){ this.name = "Person"; } Pers ...
解决ie7，ie8下a链接无效问题
.person a{ display: block; position: absolute; width: 109px; height: 33px; bottom: 19px; right: 40px ...
LVS主从部署配置和使用
LVS是Linux Virtual Server的简写,意即Linux虚拟服务器,是一个虚拟的服务器集群系统.本项目在1998年5月由章文嵩博士成立,是中国国内最早出现的自由软件项目之一. LVS是L ...

【Lucene4.8教程之中的一个】使用Lucene4.8进行索引及搜索的基本操作

【Lucene4.8教程之中的一个】使用Lucene4.8进行索引及搜索的基本操作的更多相关文章

随机推荐

热门专题