【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读评论(0) 收藏

一、基础内容

0、官方文档说明

（1）org.apache.lucene.index provides two primary classes:
IndexWriter, which creates and adds documents to indices; and
IndexReader, which accesses the data in the index.

（2）涉及的两个主要包有：

org.apache.lucene.index:Code to maintain and access indices.

org.apache.lucene.document:Thelogical representation of a Document for indexing and searching.

1、创建一个索引时，涉及的重要类有以下几个：

（1）IndexWriter：索引过程中的核心组件，用于创建新索引或者打开已有索引，以及向索引中添加、删除、更新被索引文档的信息。

（2）Document：代表一些域(field)的集合。

（3）Field及其子类：一个域，如文档创建时间，作者，内容等。

（4）Analyzer：分析器。

（5）Directory：可用于描述Lucene索引的存放位置。

2、索引文档的基本步骤如下：

（1）创建索引库IndexWriter

（2）根据文件创建文档Document

（3）向索引库中写入文档内容

基本程序如下：

package org.jediael.search.index;

import java.io.File;

import java.io.FileReader;

import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.LongField;

import org.apache.lucene.document.StringField;

import org.apache.lucene.document.TextField;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

import org.jediael.util.LoadProperties;

// 1、创建索引库IndexWriter

// 2、根据文件创建文档Document

// 3、向索引库中写入文档内容

public class IndexFiles {

	private IndexWriter writer = null;

	public void indexAllFileinDirectory(String indexPath, String docsPath)

			throws IOException {

		// 获取放置待索引文件的位置，若传入参数为空，则读取search.properties中设置的默认值。

		if (docsPath == null) {

			docsPath = LoadProperties.getProperties("docsDir");

		}

		final File docDir = new File(docsPath);

		if (!docDir.exists() || !docDir.canRead()) {

			System.out

					.println("Document directory '"

							+ docDir.getAbsolutePath()

							+ "' does not exist or is not readable, please check the path");

			System.exit(1);

		}

		// 获取放置索引文件的位置，若传入参数为空，则读取search.properties中设置的默认值。

		if (indexPath == null) {

			indexPath = LoadProperties.getProperties("indexDir");

		}

		final File indexDir = new File(indexPath);

		if (!indexDir.exists() || !indexDir.canRead()) {

			System.out

					.println("Document directory '"

							+ indexDir.getAbsolutePath()

							+ "' does not exist or is not readable, please check the path");

			System.exit(1);

		}

		try {

			// 1、创建索引库IndexWriter

			if(writer == null){

				initialIndexWriter(indexDir);

			}

			index(writer, docDir);

		} catch (IOException e) {

			e.printStackTrace();

		} finally{

			writer.close();

		}

	}

	//使用了最简单的单例模式，用于返回一个唯一的IndexWirter，注意此处非线程安全，需要进一步优化。

	private void initialIndexWriter(File indexDir) throws IOException {

		Directory returnIndexDir = FSDirectory.open(indexDir);

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));

		writer = new IndexWriter(returnIndexDir, iwc);

	}

	private void index(IndexWriter writer, File filetoIndex) throws IOException {

		if (filetoIndex.isDirectory()) {

			String[] files = filetoIndex.list();

			if (files != null) {

				for (int i = 0; i < files.length; i++) {

					index(writer, new File(filetoIndex, files[i]));

				}

			}

		} else {

			// 2、根据文件创建文档Document，考虑一下能否不用每次创建Document对象

			Document doc = new Document();

			Field pathField = new StringField("path", filetoIndex.getPath(),

					Field.Store.YES);

			doc.add(pathField);

			doc.add(new LongField("modified", filetoIndex.lastModified(),

					Field.Store.YES));

			doc.add(new StringField("title",filetoIndex.getName(),Field.Store.YES));

			doc.add(new TextField("contents", new FileReader(filetoIndex)));

			//System.out.println("Indexing " + filetoIndex.getName());

			// 3、向索引库中写入文档内容

			writer.addDocument(doc);

		}

	}

}

一些说明：

（1）使用了最简单的单例模式，用于返回一个唯一的IndexWirter，注意此处非线程安全，需要进一步优化。

（2）注意IndexWriter，IndexReader等均需要耗费较大的资源用于创建实例，因此如非必要，使用单例模式创建一个实例后。

3、索引、Document、Filed之间的关系

简而言之，多个Filed组成一个Document，多个Document组成一个索引。

它们之间通过以下方法相互调用：

Document doc = new Document();

Field pathField = new StringField("path", filetoIndex.getPath(),Field.Store.YES);

doc.add(pathField);

writer.addDocument(doc);

二、关于Field

（一）创建一个域（field）的基本方法

1、在Lucene4.x前，使用以下方式创建一个Field：

Field field = new Field("filename", f.getName(),  Field.Store.YES, Field.Index.NOT_ANALYZED);

Field field = new Field("contents", new FileReader(f));

Field field = new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)

Filed的四个参数分别代表：

域的名称

域的值

是否保存

是否分析，对于文件名称，url，文件路径等内容，不需要对其进行分析。

2、在Lucene4后，定义了大量的Field的实现类型，根据需要，直接使用其中一个，不再使用笼统的Field来直接创建域。

Direct Known Subclasses:

BinaryDocValuesField, DoubleField, FloatField,IntField, LongField, NumericDocValuesField, SortedDocValuesField, SortedSetDocValuesField, StoredField, StringField,TextField

例如，对于上述三个Filed，可相应的改为：

<pre name="code" class="java">Field field = new StringField("path", filetoIndex.getPath(),Field.Store.YES);

Field field = new LongField("modified", filetoIndex.lastModified(),Field.Store.NO);

Field field = new TextField("contents", new FileReader(filetoIndex));

在4.x以后，StringField即为NOT_ANALYZED的（即不对域的内容进行分割分析），而textField是ANALYZED的，因此，创建Field对象时，无需再指定此属性。见http://stackoverflow.com/questions/19042587/how-to-prevent-a-field-from-not-analyzing-in-lucene

即每一个Field的子类均具有默认的是否INDEXED与ANALYZED属性，不再需要显式指定。

官方文档：

StringField: A field that is indexed but not tokenized: the entire String value is indexed as a single token. For example this might be used for a 'country' field or an 'id' field, or any field that you intend to use
for sorting or access through the field cache

TextField: A field that is indexed and tokenized,without term vectors. For example this would be used on a 'body' field, that contains the bulk of a document's text.

（二）有关于Field的一些选项

1、Field.Store.Yes/No

在创建一个Field的时候，需要传入一个参数，用于指定内容是否需要存储到索引中。这些被存储的内容可以在搜索结果中返回，呈现给用户。

二者最直观的差异在于：使用document.get("fileName")时，是否可以返回内容。

比如，一个文件的标题通常都是Field.Store.Yes，因为其内容一般需要呈现给用户，文件的作者、摘要等信息也一样。

但一个文件的内容可能就没必要保存了，一方面是文件内容太大，另一方面是没必要在索引中保存其信息，因为可以引导用户进入原有文件即可。

2、加权

可以对Filed及Document进行加权。注意加权是影响返回结果顺序的一个因素，但也仅仅是一个因素，它和其它因素一起构成了Lucene的排序算法。

（三）对富文本（非纯文本）的索引

上述的对正文的索引语句：

Field field = new TextField("contents", new FileReader(filetoIndex));

只对纯文本有效。对于word，excel，pdf等富文本，FileReader读取到的内容只是一些乱码，并不能形成有效的索引。

若需要对此类文本进行索引，需要使用Tika等工具先将其正文内容提取出来，然后再进行索引。

http://stackoverflow.com/questions/16640292/lucene-4-2-0-index-pdf

Lucene doesn't handle files at all, really. That demo handles plain text files, but core Lucene doesn't. FileStreamReader is a Java standard stream reader,
and for your purposes, it will only handle plain text. This works on the Unix philosophy. Lucene indexes content. Tika extracts content from rich documents. I've added links to a couple of examples using Tika, one with Lucene directly, the other using Solr
(which you might want to consider as well).

一个简单示例如下：

首先使用Tika提取word中的正文，再使用TextField索引文字。

doc.add(new TextField("contents", TikaBasicUtil.extractContent(filetoIndex),Field.Store.NO));

注意此处不能使用StringField，因为StringField限制了字符串的大小不能超过32766，否则会报异常IllegalArgumentException:Document contains at least one immense term in field="contents" (whose UTF8 encoding is longer than the max length 32766)*/

使用Tika索引富文本的简单示例如下：

注意，此示例不仅可以索引word，还可以索引pdf,excel等。

package org.jediael.util;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.InputStream;

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;

import org.xml.sax.SAXException;

public class TikaBasicUtil {

	public static String extractContent(File f) {

		//1、创建一个parser

		Parser parser = new AutoDetectParser();

		InputStream is = null;

		try {

			Metadata metadata = new Metadata();

			metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());

			is = new FileInputStream(f);

			ContentHandler handler = new BodyContentHandler();

			ParseContext context = new ParseContext();

			context.set(Parser.class,parser);

			//2、执行parser的parse()方法。

			parser.parse(is,handler, metadata,context);

			String returnString = handler.toString();

			System.out.println(returnString.length());

			return returnString;

		} catch (FileNotFoundException e) {

			e.printStackTrace();

		} catch (IOException e) {

			e.printStackTrace();

		} catch (SAXException e) {

			e.printStackTrace();

		} catch (TikaException e) {

			e.printStackTrace();

		}finally {

			try {

				if(is!=null) is.close();

			} catch (IOException e) {

				e.printStackTrace();

			}

		}

		return "No Contents";

	}

}

三、关于Document

FSDocument RAMDocument

四、关于IndexWriter

1、创建一个IndexWriter

		Directory returnIndexDir = FSDirectory.open(indexDir);

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));

		iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

		writer = new IndexWriter(returnIndexDir, iwc);

		System.out.println(writer.getConfig().getOpenMode()+"");

		System.out.println(iwc.getOpenMode());

创建一个IndexWriter时，需要2个参数，一个是Directory对象，用于指定所创建的索引写到哪个地方；另一个是IndexWriterConfig对象，用于指定writer的配置。

2、IndexWriterConfig

（1）继承关系

java.lang.Object
- org.apache.lucene.index.LiveIndexWriterConfig
- - org.apache.lucene.index.IndexWriterConfig

All Implemented Interfaces:

Cloneable

（2）Holds all the configuration that is used to create an IndexWriter.
Once IndexWriter has
been created with this object, changes to this object will not affect the IndexWriterinstance.

（3）IndexWriterConfig.OpenMode：指明了打开索引目录的方式，有以下三种：

APPEND:Opens an existing index. 若原来存在索引，则将本次索引的内容追加进来。不管文档是否与原来是否重复，因此若2次索引的文档相同，则返回结果数则为原来的2倍。

CREATE:Creates a new index or overwrites an existing one. 若原来存在索引，则先将其删除，再创建新的索引

CREATE_OR_APPEND【默认值】:Creates a new index if one does not exist, otherwise it opens the index and documents will be appended.

3、索引的优化

索引过程中，会将索引结果存放至多个索引文件中，这样会回收索引的效率，但在搜索时，需要将多个索引文件中的返回结果进行合并处理，因此效率较低。

为了加快搜索结果的返回，可以将索引进行优化。

writer.addDocument(doc);

writer.forceMerge(2);

索引的优化是将索引结果文件归为一个或者有限的多个，它加大的索引过程中的消耗，减少了搜索时的消耗。

五、关于Analyzer

此处主要关于和索引期间相关的analyzer，关于analyzer更详细的内容请参见 http://blog.csdn.net/jediael_lu/article/details/33303499 【Lucene4.8教程之四】分析

在创建IndexWriter时，需要指定分析器，如：

IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,new StandardAnalyzer(Version.LUCENE_48));

writer = new IndexWriter(IndexDir, iwc);

便在每次向writer中添加文档时，可以针对该文档指定一个分析器，如

writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));

六、关于Directory

【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读评论(0) 收藏的更多相关文章

【solr专题之二】配置文件：solr.xml solrConfig.xml schema.xml 分类： H4_SOLR/LUCENCE 2014-07-23 21:30 1959人阅读评论(0) 收藏
1.关于默认搜索域 If you are using the Lucene query parser, queries that don't specify a field name will use ...
Lucene学习总结之三：Lucene的索引文件格式(1) 2014-06-25 14:15 1124人阅读评论(0) 收藏
Lucene的索引里面存了些什么,如何存放的,也即Lucene的索引文件格式,是读懂Lucene源代码的一把钥匙. 当我们真正进入到Lucene源代码之中的时候,我们会发现: Lucene的索引过程, ...
spark1.3.1使用基础教程分类： B8_SPARK 2015-04-28 11:10 1651人阅读评论(0) 收藏
spark可以通过交互式命令行及编程两种方式来进行调用: 前者支持scala与python 后者支持scala.python与java 本文参考https://spark.apache.org/d ...
Gora官方文档之二：Gora对Map-Reduce的支持分类： C_OHTERS 2015-01-31 11:27 232人阅读评论(0) 收藏
参考官方文档:http://gora.apache.org/current/tutorial.html 项目代码见:https://code.csdn.net/jediael_lu/mygoradem ...
JSON入门之二：org.json的基本用法分类： C_OHTERS 2014-05-14 11:25 6001人阅读评论(0) 收藏
java中用于解释json的主流工具有org.json.json-lib与gson,本文介绍org.json的应用. 官方文档: http://www.json.org/java/ http://de ...
【Lucene4.8教程之三】搜索 2014-06-21 09:53 1532人阅读评论(0) 收藏
1.关键类 Lucene的搜索过程中涉及的主要类有以下几个: (1)IndexSearcher:执行search()方法的类 (2)IndexReader:对索引文件进行读操作,并为IndexSear ...
Mahout快速入门教程分类： B10_计算机基础 2015-03-07 16:20 508人阅读评论(0) 收藏
Mahout 是一个很强大的数据挖掘工具,是一个分布式机器学习算法的集合,包括:被称为Taste的分布式协同过滤的实现.分类.聚类等.Mahout最大的优点就是基于hadoop实现,把很多以前运行于单 ...
【solr基础教程之二】索引分类： H4_SOLR/LUCENCE 2014-07-18 21:06 3331人阅读评论(0) 收藏
一.向Solr提交索引的方式 1.使用post.jar进行索引 (1)创建文档xml文件 <add> <doc> <field name="id"&g ...
【Lucene4.8教程之五】Luke 2014-06-24 15:12 1092人阅读评论(0) 收藏
一.Luke基本内容 1.Luke简介 Luke可用于查看Lucene创建的索引,并对其进行基本操作. 2.创建Luke (1)从Github上下载源文件 https://github.com/tar ...

随机推荐

AIX中经常使用的SMIT 的使用
AIX中经常使用的SMIT 的使用 1. smit 的日志文件 (1)$HOME/smit.log 记录了所訪问的全部菜单.对话内容,所运行的命令和输出结果在 SMIT 会话中出现的全部 ...
WIN8.1的安装和打开＂这台电脑＂速度很慢的解决办法
WIN8.1的安装和打开＂这台电脑＂速度很慢的解决办法对于非服务器用的电脑,如果电脑的内存在2G或更高,首推的操作系统是 WINDOWS8.1 64位企业版,用了就知道,没有比这流畅懂事的操作系统. ...
thinkphp内置标签简单讲解
thinkphp内置标签简单讲解 1.volist循环 name 需要遍历的数据 id 类似于foreach中 value offset 截取数据起始位置 length 截取数据的个数 mod 奇偶数 ...
13.AxisUtil
1. package com.glodon.gspm.adapter.plugin.common; import lombok.SneakyThrows; import org.apache.axis ...
Linux下搭建JSP环境
Linux下搭建JSP环境作为一名Java EE系统架构工程师,经常需要搭配和建立JSP(Java Server Pages)的开发环境和运行环境,所以本人在平时的工作中积累了一些在Linu ...
【代码】Django学习笔记
一些设置setting.py DEBUG = True ALLOWED_HOSTS = ['*'] DATABASES = { 'default': { 'ENGINE': 'django.db.ba ...
OR1200指令Cache使用举例
下面内容摘自<步步惊芯--软核处理器内部设计分析>一书 12.4 ICache中的特殊寄存器通过ICache的接口可知其具有特殊寄存器,而且是不可读的特殊寄存器,OR1200处理器中IC ...
Android: 分页浏览的利器 android View Pager
最近有一个项目需求,水平滑动实现视图切换(分页显示效果) 最先想到的是ImageSwitcher + ViewFilpper 来实现,这效果做出来我自己都不想用,更不用说客户的感觉了:滑动效果生硬,只 ...
82.管道实现cgi内存多线程查询
总体思路就是客户端写入要查询的数据到管道中,服务器端从管道读取,然后写入随机文件,再把文件名写入管道,然后客户端再读取文件服务器端设置缓冲区大写,设置管道名字,以及标识有多少个线程等 //设置缓存 ...
Css 显示删除条目效果
样式设置

【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读 评论(0) 收藏

【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读 评论(0) 收藏的更多相关文章

随机推荐

热门专题

【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读评论(0) 收藏

【Lucene4.8教程之二】索引 2014-06-16 11:30 3845人阅读评论(0) 收藏的更多相关文章