【Lucene4.8教程之二】索引

一、基础内容

0、官方文档说明

（1）org.apache.lucene.index provides two primary classes:
IndexWriter, which creates and adds documents to indices; and
IndexReader, which accesses the data in the index.

（2）涉及的两个主要包有：

org.apache.lucene.index:Code to maintain and access indices.

org.apache.lucene.document:Thelogical representation of a Document for indexing and searching.

1、创建一个索引时，涉及的重要类有以下几个：

（1）IndexWriter：索引过程中的核心组件，用于创建新索引或者打开已有索引，以及向索引中添加、删除、更新被索引文档的信息。

（2）Document：代表一些域(field)的集合。

（3）Field及其子类：一个域，如文档创建时间，作者，内容等。

（4）Analyzer：分析器。

（5）Directory：可用于描述Lucene索引的存放位置。

2、索引文档的基本步骤如下：

（1）创建索引库IndexWriter

（2）根据文件创建文档Document

（3）向索引库中写入文档内容

基本程序如下：

package org.jediael.search.index;

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.LongField;

import org.apache.lucene.document.StringField;

import org.apache.lucene.document.TextField;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

import org.jediael.util.LoadProperties;

// 1、创建索引库IndexWriter

// 2、根据文件创建文档Document

// 3、向索引库中写入文档内容

public class IndexFiles {

	private IndexWriter writer = null;

	public void indexAllFileinDirectory(String indexPath, String docsPath)

			throws IOException {

		// 获取放置待索引文件的位置，若传入参数为空，则读取search.properties中设置的默认值。

		if (docsPath == null) {

			docsPath = LoadProperties.getProperties("docsDir");

		}

		final File docDir = new File(docsPath);

		if (!docDir.exists() || !docDir.canRead()) {

			System.out

					.println("Document directory '"

							+ docDir.getAbsolutePath()

							+ "' does not exist or is not readable, please check the path");

			System.exit(1);

		}

		// 获取放置索引文件的位置，若传入参数为空，则读取search.properties中设置的默认值。

		if (indexPath == null) {

			indexPath = LoadProperties.getProperties("indexDir");

		}

		final File indexDir = new File(indexPath);

		if (!indexDir.exists() || !indexDir.canRead()) {

			System.out

					.println("Document directory '"

							+ indexDir.getAbsolutePath()

							+ "' does not exist or is not readable, please check the path");

			System.exit(1);

		}

		try {

			// 1、创建索引库IndexWriter

			if(writer == null){

				initialIndexWriter(indexDir);

			}

			index(writer, docDir);

		} catch (IOException e) {

			e.printStackTrace();

		} finally{

			writer.close();

		}

	}

	//使用了最简单的单例模式，用于返回一个唯一的IndexWirter，注意此处非线程安全，需要进一步优化。

	private void initialIndexWriter(File indexDir) throws IOException {

		Directory returnIndexDir = FSDirectory.open(indexDir);

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));

		writer = new IndexWriter(returnIndexDir, iwc);

	}

	private void index(IndexWriter writer, File filetoIndex) throws IOException {

		if (filetoIndex.isDirectory()) {

			String[] files = filetoIndex.list();

			if (files != null) {

				for (int i = 0; i < files.length; i++) {

					index(writer, new File(filetoIndex, files[i]));

				}

			}

		} else {

			// 2、根据文件创建文档Document，考虑一下能否不用每次创建Document对象

			Document doc = new Document();

			Field pathField = new StringField("path", filetoIndex.getPath(),

					Field.Store.YES);

			doc.add(pathField);

			doc.add(new LongField("modified", filetoIndex.lastModified(),

					Field.Store.YES));

			doc.add(new StringField("title",filetoIndex.getName(),Field.Store.YES));

			doc.add(new TextField("contents", new FileReader(filetoIndex)));

			//System.out.println("Indexing " + filetoIndex.getName());

			// 3、向索引库中写入文档内容

			writer.addDocument(doc);

		}

	}

}

一些说明：

（1）使用了最简单的单例模式，用于返回一个唯一的IndexWirter，注意此处非线程安全，需要进一步优化。

（2）注意IndexWriter，IndexReader等均需要耗费较大的资源用于创建实例，因此如非必要，使用单例模式创建一个实例后。

3、索引、Document、Filed之间的关系

简而言之，多个Filed组成一个Document，多个Document组成一个索引。

它们之间通过以下方法相互调用：

Document doc = new Document();

Field pathField = new StringField("path", filetoIndex.getPath(),Field.Store.YES);

doc.add(pathField);

writer.addDocument(doc);

二、关于Field

（一）创建一个域（field）的基本方法

1、在Lucene4.x前，使用以下方式创建一个Field：

Field field = new Field("filename", f.getName(),  Field.Store.YES, Field.Index.NOT_ANALYZED);

Field field = new Field("contents", new FileReader(f));

Field field = new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)

Filed的四个参数分别代表：

域的名称

域的值

是否保存

是否分析，对于文件名称，url，文件路径等内容，不需要对其进行分析。

2、在Lucene4后，定义了大量的Field的实现类型，根据需要，直接使用其中一个，不再使用笼统的Field来直接创建域。

Direct Known Subclasses:

BinaryDocValuesField, DoubleField, FloatField,IntField, LongField, NumericDocValuesField, SortedDocValuesField, SortedSetDocValuesField, StoredField, StringField,TextField

例如，对于上述三个Filed，可相应的改为：

<pre name="code" class="java">Field field = new StringField("path", filetoIndex.getPath(),Field.Store.YES);

Field field = new LongField("modified", filetoIndex.lastModified(),Field.Store.NO);

Field field = new TextField("contents", new FileReader(filetoIndex));

在4.x以后，StringField即为NOT_ANALYZED的（即不对域的内容进行分割分析），而textField是ANALYZED的，因此，创建Field对象时，无需再指定此属性。见http://stackoverflow.com/questions/19042587/how-to-prevent-a-field-from-not-analyzing-in-lucene

即每一个Field的子类均具有默认的是否INDEXED与ANALYZED属性，不再需要显式指定。

官方文档：

StringField: A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use
for sorting or access through the field cache

TextField: A field that is indexed and tokenized,without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text.

（二）有关于Field的一些选项

1、Field.Store.Yes/No

在创建一个Field的时候，需要传入一个参数，用于指定内容是否需要存储到索引中。这些被存储的内容可以在搜索结果中返回，呈现给用户。

二者最直观的差异在于：使用document.get("fileName")时，是否可以返回内容。

比如，一个文件的标题通常都是Field.Store.Yes，因为其内容一般需要呈现给用户，文件的作者、摘要等信息也一样。

但一个文件的内容可能就没必要保存了，一方面是文件内容太大，另一方面是没必要在索引中保存其信息，因为可以引导用户进入原有文件即可。

2、加权

可以对Filed及Document进行加权。注意加权是影响返回结果顺序的一个因素，但也仅仅是一个因素，它和其它因素一起构成了Lucene的排序算法。

（三）对富文本（非纯文本）的索引

上述的对正文的索引语句：

Field field = new TextField("contents", new FileReader(filetoIndex));

只对纯文本有效。对于word，excel，pdf等富文本，FileReader读取到的内容只是一些乱码，并不能形成有效的索引。

若需要对此类文本进行索引，需要使用Tika等工具先将其正文内容提取出来，然后再进行索引。

http://stackoverflow.com/questions/16640292/lucene-4-2-0-index-pdf

Lucene doesn't handle files at all, really. That demo handles plain text files, but core Lucene doesn't. FileStreamReader is a Java standard stream reader,
and for your purposes, it will only handle plain text. This works on the Unix philosophy. Lucene indexes content. Tika extracts content from rich documents. I've added links to a couple of examples using Tika, one with Lucene directly, the other using Solr
(which you might want to consider as well).

一个简单示例如下：

首先使用Tika提取word中的正文，再使用TextField索引文字。

doc.add(new TextField("contents", TikaBasicUtil.extractContent(filetoIndex),Field.Store.NO));

注意此处不能使用StringField，因为StringField限制了字符串的大小不能超过32766，否则会报异常IllegalArgumentException:Document contains at least one immense term in field="contents" (whose UTF8 encoding is longer than the max length 32766)*/

使用Tika索引富文本的简单示例如下：

注意，此示例不仅可以索引word，还可以索引pdf,excel等。

package org.jediael.util;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.InputStream;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class TikaBasicUtil {

	public static String extractContent(File f) {

		//1、创建一个parser

		Parser parser = new AutoDetectParser();

		InputStream is = null;

		try {

			Metadata metadata = new Metadata();

			metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());

			is = new FileInputStream(f);

			ContentHandler handler = new BodyContentHandler();

			ParseContext context = new ParseContext();

			context.set(Parser.class,parser);

			//2、执行parser的parse()方法。

			parser.parse(is,handler, metadata,context);

			String returnString = handler.toString();

			System.out.println(returnString.length());

			return returnString;

		} catch (FileNotFoundException e) {

			e.printStackTrace();

		} catch (IOException e) {

			e.printStackTrace();

		} catch (SAXException e) {

			e.printStackTrace();

		} catch (TikaException e) {

			e.printStackTrace();

		}finally {

			try {

				if(is!=null) is.close();

			} catch (IOException e) {

				e.printStackTrace();

			}

		}

		return "No Contents";

	}

}

三、关于Document

FSDocument RAMDocument

四、关于IndexWriter

1、创建一个IndexWriter

		Directory returnIndexDir = FSDirectory.open(indexDir);

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));

		iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

		writer = new IndexWriter(returnIndexDir, iwc);

		System.out.println(writer.getConfig().getOpenMode()+"");

		System.out.println(iwc.getOpenMode());

创建一个IndexWriter时，需要2个参数，一个是Directory对象，用于指定所创建的索引写到哪个地方；另一个是IndexWriterConfig对象，用于指定writer的配置。

2、IndexWriterConfig

（1）继承关系

java.lang.Object
- org.apache.lucene.index.LiveIndexWriterConfig
- - org.apache.lucene.index.IndexWriterConfig

All Implemented Interfaces:

Cloneable

（2）Holds all the configuration that is used to create an IndexWriter.
Once IndexWriter has
been created with this object, changes to this object will not affect the IndexWriterinstance.

（3）IndexWriterConfig.OpenMode：指明了打开索引目录的方式，有以下三种：

APPEND:Opens an existing index. 若原来存在索引，则将本次索引的内容追加进来。不管文档是否与原来是否重复，因此若2次索引的文档相同，则返回结果数则为原来的2倍。

CREATE:Creates a new index or overwrites an existing one. 若原来存在索引，则先将其删除，再创建新的索引

CREATE_OR_APPEND【默认值】:Creates a new index if one does not exist, otherwise it opens the index and documents will be appended.

3、索引的优化

索引过程中，会将索引结果存放至多个索引文件中，这样会回收索引的效率，但在搜索时，需要将多个索引文件中的返回结果进行合并处理，因此效率较低。

为了加快搜索结果的返回，可以将索引进行优化。

writer.addDocument(doc);

writer.forceMerge(2);

索引的优化是将索引结果文件归为一个或者有限的多个，它加大的索引过程中的消耗，减少了搜索时的消耗。

五、关于Analyzer

此处主要关于和索引期间相关的analyzer，关于analyzer更详细的内容请参见 http://blog.csdn.net/jediael_lu/article/details/33303499 【Lucene4.8教程之四】分析

在创建IndexWriter时，需要指定分析器，如：

IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));

writer = new IndexWriter(IndexDir, iwc);

便在每次向writer中添加文档时，可以针对该文档指定一个分析器，如

writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));

六、关于Directory

【Lucene4.8教程之二】索引的更多相关文章

【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读评论(0) 收藏
一.基础内容 0.官方文档说明 (1)org.apache.lucene.index provides two primary classes: IndexWriter, which creates ...
mysql进阶(二)索引简易教程
Mysql索引简易教程基本概念索引是指把你设置为索引的字段A的内容储存在一个独立区间S里,里面只有这个字段的内容.在找查这个与这个字段A的内容时会直接从这个独立区间里查找,而不是去到数据表里查找. ...
Senparc.Weixin.MP SDK 微信公众平台开发教程（二）：成为开发者
Senparc.Weixin.MP SDK 微信公众平台开发教程(二):成为开发者这一篇主要讲作为一名使用公众平台接口的开发者,你需要知道的一些东西.其中也涉及到一些微信官方的规定或比较掩蔽的注意点 ...
SQLite 入门教程（二）创建、修改、删除表（转）
转于 SQLite 入门教程(二)创建.修改.删除表一.数据库定义语言 DDL 在关系型数据库中,数据库中的表 Table.视图 View.索引 Index.关系 Relationship 和触发器 ...
屌炸天实战 MySQL 系列教程（二）史上最屌、你不知道的数据库操作
此篇写MySQL中最基础,也是最重要的操作! 第一篇:屌炸天实战 MySQL 系列教程(一) 生产标准线上环境安装配置案例及棘手问题解决第二篇:屌炸天实战 MySQL 系列教程(二) 史上最屌.你不 ...
【Lucene4.8教程之三】搜索
1.关键类 Lucene的搜索过程中涉及的主要类有以下几个: (1)IndexSearcher:执行search()方法的类 (2)IndexReader:对索引文件进行读操作,并为IndexSear ...
Senparc.Weixin.MP SDK 微信公众平台开发教程（二十）：使用菜单消息功能
在<Senparc.Weixin.MP SDK 微信公众平台开发教程(十一):高级接口说明>教程中,我们介绍了如何使用“客服接口”,即在服务器后台,在任意时间向微信发送文本.图文.图片等不 ...
Photoshop入门教程（二）：暂存盘设置与标尺设置
新建文档之后大家就可以对图像进行编辑.在对图像进行编辑之前,先来了解一下如何查看图像的一些基本信息.在软件左下角,会有这样的信息显示窗口. 1窗口表示当前图像显示比例,200%代表当前为放大两倍显示. ...
【Lucene4.8教程之三】搜索 2014-06-21 09:53 1532人阅读评论(0) 收藏
1.关键类 Lucene的搜索过程中涉及的主要类有以下几个: (1)IndexSearcher:执行search()方法的类 (2)IndexReader:对索引文件进行读操作,并为IndexSear ...

随机推荐

mac下 WebStorm下主题包安装
mac下: 主题包 1.mac下,点击桌面,使用shift+command+G 输入:~/Library/Preferences 前往(mac查找安装目录的方法,因为默认这些文件夹是隐藏的),进入We ...
mysql 5.7占用400M内存优化方案
mysql出问题了,装了一下新版本,竟然占用400多M的内存,这对于不是服务器,占用是在太高了,再开打一个开发工具,那电脑很卡了,其实是可以优化一下的,在my.ini文件找到这几个参数更改一下,占用大 ...
Measuring Lengths in Baden
Description Measuring Lengths in Baden time limit per test: 2 seconds memory limit per test: 256 meg ...
information_schema.column_privileges 学习
mysql 的授权是分层次的实例级 | 库级 | 表级 | 列级而这些授权信息被保存在了mysql.user | mysql.db | mysql.tables_priv | mysql.colu ...
NOI十连测第四测 T3
思路: 算法一:可以n^2找出每个点的权值,然后n^2做完,预计得分10 算法二:随机找点然后每次找最高..貌似只有10分?然而考试的时候煞笔了,边界设成inf.. 算法三:随机找几个点,然后随机爬山 ...
VS_QT中配置qDebug输出
在使用qt_create时可以使用qDebug进行调试输出.在VS中也可以使用.但需要配置.配置过程如下图所示: 一.首先右击工程名,选择最后一个选项“Properties” 二.然后选择Linker ...
算法中的增长率（Rate of Growth）是什么意思？
一个函数或算法的代码块花费的时间随输入增长的速率称为增长率. 假设你去买一辆小车和一辆自行车.如果你朋友刚好看到,问你在买什么,我们一般都会说:买小车.因为买小车比买自行车花费高多了. [总花费=小车 ...
剑指offer-面试题16.反转链表
题目:定义一个函数,输入一个链表的头结点,反转该链表并输出反转后的头结点链表结点定义如下: struct ListNode { int m_nKey; ListNode* m_pNext; } 其实 ...
ZOJ3477&JAVA大数类
转:http://blog.csdn.net/sunkun2013/article/details/11822927 import java.util.*; import java.math.BigI ...
Scala 函数（五）
函数是一组一起执行一个任务的语句. 您可以把代码划分到不同的函数中.如何划分代码到不同的函数中是由您来决定的,但在逻辑上,划分通常是根据每个函数执行一个特定的任务来进行的. Scala 有函数和方法, ...

【Lucene4.8教程之二】索引

【Lucene4.8教程之二】索引的更多相关文章

随机推荐

热门专题