lucene源码分析(2)读取过程实例

1.官方提供的代码demo

        Analyzer analyzer = new StandardAnalyzer();

        // Store the index in memory:

        Directory directory = new RAMDirectory();

        // To store an index on disk, use this instead:

        //Directory directory = FSDirectory.open("/tmp/testindex");

        IndexWriterConfig config = new IndexWriterConfig(analyzer);

        IndexWriter iwriter = new IndexWriter(directory, config);

        Document doc = new Document();

        String text = "This is the text to be indexed.";

        doc.add(new Field("fieldname", text, TextField.TYPE_STORED));

        iwriter.addDocument(doc);

        iwriter.close();

2.涉及到的类及其关系

2.1 TokenStream

/**

 * A <code>TokenStream</code> enumerates the sequence of tokens, either from

 * {@link Field}s of a {@link Document} or from query text.

 * <p>

 * This is an abstract class; concrete subclasses are:

 * <ul>

 * <li>{@link Tokenizer}, a <code>TokenStream</code> whose input is a Reader; and

 * <li>{@link TokenFilter}, a <code>TokenStream</code> whose input is another

 * <code>TokenStream</code>.

 * </ul>

 * A new <code>TokenStream</code> API has been introduced with Lucene 2.9. This API

 * has moved from being {@link Token}-based to {@link Attribute}-based. While

 * {@link Token} still exists in 2.9 as a convenience class, the preferred way

 * to store the information of a {@link Token} is to use {@link AttributeImpl}s.

 * <p>

 * <code>TokenStream</code> now extends {@link AttributeSource}, which provides

 * access to all of the token {@link Attribute}s for the <code>TokenStream</code>.

 * Note that only one instance per {@link AttributeImpl} is created and reused

 * for every token. This approach reduces object creation and allows local

 * caching of references to the {@link AttributeImpl}s. See

 * {@link #incrementToken()} for further details.

 * <p>

 * <b>The workflow of the new <code>TokenStream</code> API is as follows:</b>

 * <ol>

 * <li>Instantiation of <code>TokenStream</code>/{@link TokenFilter}s which add/get

 * attributes to/from the {@link AttributeSource}.

 * <li>The consumer calls {@link TokenStream#reset()}.

 * <li>The consumer retrieves attributes from the stream and stores local

 * references to all attributes it wants to access.

 * <li>The consumer calls {@link #incrementToken()} until it returns false

 * consuming the attributes after each call.

 * <li>The consumer calls {@link #end()} so that any end-of-stream operations

 * can be performed.

 * <li>The consumer calls {@link #close()} to release any resource when finished

 * using the <code>TokenStream</code>.

 * </ol>

 * To make sure that filters and consumers know which attributes are available,

 * the attributes must be added during instantiation. Filters and consumers are

 * not required to check for availability of attributes in

 * {@link #incrementToken()}.

 * <p>

 * You can find some example code for the new API in the analysis package level

 * Javadoc.

 * <p>

 * Sometimes it is desirable to capture a current state of a <code>TokenStream</code>,

 * e.g., for buffering purposes (see {@link CachingTokenFilter},

 * TeeSinkTokenFilter). For this usecase

 * {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}

 * can be used.

 * <p>The {@code TokenStream}-API in Lucene is based on the decorator pattern.

 * Therefore all non-abstract subclasses must be final or have at least a final

 * implementation of {@link #incrementToken}! This is checked when Java

 * assertions are enabled.

 */

2.2 Analyzer

/**

 * An Analyzer builds TokenStreams, which analyze text.  It thus represents a

 * policy for extracting index terms from text.

 * <p>

 * In order to define what analysis is done, subclasses must define their

 * {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.

 * The components are then reused in each call to {@link #tokenStream(String, Reader)}.

 * <p>

 * Simple example:

 * <pre class="prettyprint">

 * Analyzer analyzer = new Analyzer() {

 *  {@literal @Override}

 *   protected TokenStreamComponents createComponents(String fieldName) {

 *     Tokenizer source = new FooTokenizer(reader);

 *     TokenStream filter = new FooFilter(source);

 *     filter = new BarFilter(filter);

 *     return new TokenStreamComponents(source, filter);

 *   }

 *   {@literal @Override}

 *   protected TokenStream normalize(TokenStream in) {

 *     // Assuming FooFilter is about normalization and BarFilter is about

 *     // stemming, only FooFilter should be applied

 *     return new FooFilter(in);

 *   }

 * };

 * </pre>

 * For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.

 * <p>

 * For some concrete implementations bundled with Lucene, look in the analysis modules:

 * <ul>

 *   <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:

 *       Analyzers for indexing content in different languages and domains.

 *   <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:

 *       Exposes functionality from ICU to Apache Lucene.

 *   <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:

 *       Morphological analyzer for Japanese text.

 *   <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:

 *       Dictionary-driven lemmatization for the Polish language.

 *   <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:

 *       Analysis for indexing phonetic signatures (for sounds-alike search).

 *   <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:

 *       Analyzer for Simplified Chinese, which indexes words.

 *   <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:

 *       Algorithmic Stemmer for the Polish Language.

 *   <li><a href="{@docRoot}/../analyzers-uima/overview-summary.html">UIMA</a>:

 *       Analysis integration with Apache UIMA.

 * </ul>

 */

2.3 Directory

/** A Directory is a flat list of files.  Files may be written once, when they

 * are created.  Once a file is created it may only be opened for read, or

 * deleted.  Random access is permitted both when reading and writing.

 *

 * <p> Java's i/o APIs not used directly, but rather all i/o is

 * through this API.  This permits things such as: <ul>

 * <li> implementation of RAM-based indices;

 * <li> implementation indices stored in a database, via JDBC;

 * <li> implementation of an index as a single file;

 * </ul>

 *

 * Directory locking is implemented by an instance of {@link

 * LockFactory}.

 *

 */

2.4 IndexWriter

/**

  An <code>IndexWriter</code> creates and maintains an index.

  <p>The {@link OpenMode} option on

  {@link IndexWriterConfig#setOpenMode(OpenMode)} determines

  whether a new index is created, or whether an existing index is

  opened. Note that you can open an index with {@link OpenMode#CREATE}

  even while readers are using the index. The old readers will

  continue to search the "point in time" snapshot they had opened,

  and won't see the newly created index until they re-open. If

  {@link OpenMode#CREATE_OR_APPEND} is used IndexWriter will create a

  new index if there is not already an index at the provided path

  and otherwise open the existing index.</p>

  <p>In either case, documents are added with {@link #addDocument(Iterable)

  addDocument} and removed with {@link #deleteDocuments(Term...)} or {@link

  #deleteDocuments(Query...)}. A document can be updated with {@link

  #updateDocument(Term, Iterable) updateDocument} (which just deletes

  and then adds the entire document). When finished adding, deleting

  and updating documents, {@link #close() close} should be called.</p>

  <a name="sequence_numbers"></a>

  <p>Each method that changes the index returns a {@code long} sequence number, which

  expresses the effective order in which each change was applied.

  {@link #commit} also returns a sequence number, describing which

  changes are in the commit point and which are not.  Sequence numbers

  are transient (not saved into the index in any way) and only valid

  within a single {@code IndexWriter} instance.</p>

  <a name="flush"></a>

  <p>These changes are buffered in memory and periodically

  flushed to the {@link Directory} (during the above method

  calls). A flush is triggered when there are enough added documents

  since the last flush. Flushing is triggered either by RAM usage of the

  documents (see {@link IndexWriterConfig#setRAMBufferSizeMB}) or the

  number of added documents (see {@link IndexWriterConfig#setMaxBufferedDocs(int)}).

  The default is to flush when RAM usage hits

  {@link IndexWriterConfig#DEFAULT_RAM_BUFFER_SIZE_MB} MB. For

  best indexing speed you should flush by RAM usage with a

  large RAM buffer. Additionally, if IndexWriter reaches the configured number of

  buffered deletes (see {@link IndexWriterConfig#setMaxBufferedDeleteTerms})

  the deleted terms and queries are flushed and applied to existing segments.

  In contrast to the other flush options {@link IndexWriterConfig#setRAMBufferSizeMB} and

  {@link IndexWriterConfig#setMaxBufferedDocs(int)}, deleted terms

  won't trigger a segment flush. Note that flushing just moves the

  internal buffered state in IndexWriter into the index, but

  these changes are not visible to IndexReader until either

  {@link #commit()} or {@link #close} is called.  A flush may

  also trigger one or more segment merges which by default

  run with a background thread so as not to block the

  addDocument calls (see <a href="#mergePolicy">below</a>

  for changing the {@link MergeScheduler}).</p>

  <p>Opening an <code>IndexWriter</code> creates a lock file for the directory in use. Trying to open

  another <code>IndexWriter</code> on the same directory will lead to a

  {@link LockObtainFailedException}.</p>

  <a name="deletionPolicy"></a>

  <p>Expert: <code>IndexWriter</code> allows an optional

  {@link IndexDeletionPolicy} implementation to be specified.  You

  can use this to control when prior commits are deleted from

  the index.  The default policy is {@link KeepOnlyLastCommitDeletionPolicy}

  which removes all prior commits as soon as a new commit is

  done.  Creating your own policy can allow you to explicitly

  keep previous "point in time" commits alive in the index for

  some time, either because this is useful for your application,

  or to give readers enough time to refresh to the new commit

  without having the old commit deleted out from under them.

  The latter is necessary when multiple computers take turns opening

  their own {@code IndexWriter} and {@code IndexReader}s

  against a single shared index mounted via remote filesystems

  like NFS which do not support "delete on last close" semantics.

  A single computer accessing an index via NFS is fine with the

  default deletion policy since NFS clients emulate "delete on

  last close" locally.  That said, accessing an index via NFS

  will likely result in poor performance compared to a local IO

  device. </p>

  <a name="mergePolicy"></a> <p>Expert:

  <code>IndexWriter</code> allows you to separately change

  the {@link MergePolicy} and the {@link MergeScheduler}.

  The {@link MergePolicy} is invoked whenever there are

  changes to the segments in the index.  Its role is to

  select which merges to do, if any, and return a {@link

  MergePolicy.MergeSpecification} describing the merges.

  The default is {@link LogByteSizeMergePolicy}.  Then, the {@link

  MergeScheduler} is invoked with the requested merges and

  it decides when and how to run the merges.  The default is

  {@link ConcurrentMergeScheduler}. </p>

  <a name="OOME"></a><p><b>NOTE</b>: if you hit a

  VirtualMachineError, or disaster strikes during a checkpoint

  then IndexWriter will close itself.  This is a

  defensive measure in case any internal state (buffered

  documents, deletions, reference counts) were corrupted.

  Any subsequent calls will throw an AlreadyClosedException.</p>

  <a name="thread-safety"></a><p><b>NOTE</b>: {@link

  IndexWriter} instances are completely thread

  safe, meaning multiple threads can call any of its

  methods, concurrently.  If your application requires

  external synchronization, you should <b>not</b>

  synchronize on the <code>IndexWriter</code> instance as

  this may cause deadlock; use your own (non-Lucene) objects

  instead. </p>

  <p><b>NOTE</b>: If you call

  <code>Thread.interrupt()</code> on a thread that's within

  IndexWriter, IndexWriter will try to catch this (eg, if

  it's in a wait() or Thread.sleep()), and will then throw

  the unchecked exception {@link ThreadInterruptedException}

  and <b>clear</b> the interrupt status on the thread.</p>

*/

/*

 * Clarification: Check Points (and commits)

 * IndexWriter writes new index files to the directory without writing a new segments_N

 * file which references these new files. It also means that the state of

 * the in memory SegmentInfos object is different than the most recent

 * segments_N file written to the directory.

 *

 * Each time the SegmentInfos is changed, and matches the (possibly

 * modified) directory files, we have a new "check point".

 * If the modified/new SegmentInfos is written to disk - as a new

 * (generation of) segments_N file - this check point is also an

 * IndexCommit.

 *

 * A new checkpoint always replaces the previous checkpoint and

 * becomes the new "front" of the index. This allows the IndexFileDeleter

 * to delete files that are referenced only by stale checkpoints.

 * (files that were created since the last commit, but are no longer

 * referenced by the "front" of the index). For this, IndexFileDeleter

 * keeps track of the last non commit checkpoint.

 */

lucene源码分析(2)读取过程实例的更多相关文章

Dubbo 源码分析 - 服务调用过程
注: 本系列文章已捐赠给 Dubbo 社区,你也可以在 Dubbo 官方文档中阅读本系列文章. 1. 简介在前面的文章中,我们分析了 Dubbo SPI.服务导出与引入.以及集群容错方面的代码.经过 ...
MyBatis 源码分析 - 配置文件解析过程
* 本文速览由于本篇文章篇幅比较大,所以这里拿出一节对本文进行快速概括.本篇文章对 MyBatis 配置文件中常用配置的解析过程进行了较为详细的介绍和分析,包括但不限于settings,typeAl ...
Lucene 源码分析之倒排索引（三）
上文找到了 collect(-) 方法,其形参就是匹配的文档 Id,根据代码上下文,其中 doc 是由 iterator.nextDoc() 获得的,那 DefaultBulkScorer.itera ...
源码分析HotSpot GC过程(三)：TenuredGeneration的GC过程
老年代TenuredGeneration所使用的垃圾回收算法是标记-压缩-清理算法.在回收阶段,将标记对象越过堆的空闲区移动到堆的另一端,所有被移动的对象的引用也会被更新指向新的位置.看起来像是把杂陈 ...
SOFA 源码分析 —— 服务引用过程
前言在前面的 SOFA 源码分析 -- 服务发布过程文章中,我们分析了 SOFA 的服务发布过程,一个完整的 RPC 除了发布服务,当然还需要引用服务. So,今天就一起来看看 SOFA 是如何引 ...
源码分析HotSpot GC过程(一)
«上一篇:源码分析HotSpot GC过程(一)»下一篇:源码分析HotSpot GC过程(三):TenuredGeneration的GC过程 https://blogs.msdn.microsoft ...
一个lucene源码分析的博客
ITpub上的一个lucene源码分析的博客,写的比较全面:http://blog.itpub.net/28624388/cid-93356-list-1/
mybatis源码分析：启动过程
mybatis在开发中作为一个ORM框架使用的比较多,所谓ORM指的是Object Relation Mapping,直译过来就是对象关系映射,这个映射指的是java中的对象和数据库中的记录的映射,也 ...
精尽MyBatis源码分析 - SQL执行过程（二）之 StatementHandler
该系列文档是本人在学习 Mybatis 的源码过程中总结下来的,可能对读者不太友好,请结合我的源码注释(Mybatis源码分析 GitHub 地址.Mybatis-Spring 源码分析 GitHub ...

随机推荐

queued frame 造成图形性能卡顿
曾经遇到过卡顿是类似的原因:当时对显卡底层知识理解不懂,看到引擎底层有一个MaxFramexxx的接口,实现是使用注册表修改显卡底层的注册信息,当时还是一个掉接口习惯的客户端码农的思维,没理解底层含义 ...
Svn在eclipse中使用
首先下载SvnAdt,我这里有个中文版的. 下载地址是 http://dl.vmall.com/c0i19tiqbq 你在其它地方下载的文件的话,解压文件后,把fea ...
Windows 安装并配置 MySQL 5.6
Windows 下安装 MySQL 有两种方式,一种是下载安装包,根据提示一路 next 安装,不需要什么配置,比较简单:另一种是下载压缩包,通过命令和配置来安装,也不难,个人感觉更简单.本篇就采用第 ...
C#实现墨卡托投影坐标系经纬度与米制单位之间的互转
using System; using GeoJSON.Net.Geometry; namespace GISWebService.Common { /// <summary> /// 墨 ...
Hibernate一级缓存测试分析
Hibernate 一级缓存测试分析 Hibernate的一级缓存就是指Session缓存,此Session非http的session会话技术,可以理解为JDBC的Connection,连接会话,Se ...
11-使用EF操作数据库
本篇博客对应视频讲解回顾上一篇教程我们讲了XML与JSON的序列化问题,我们可以看到序列化实际上也是不同形式的转换,我们通常要以字节流的形式做中转.同时我们也可以看到,对于序列化这种常见的需求,我 ...
VisualStudio、NETFramework及C#版本关系
1.Visual Studio..NET Framework 及C#版本搭载关系介绍 Visual Studio版本 .NET Framework版本 C#版本增加功能 Visual Studio ...
SSE sqrt还是比C math库的sqrtf快了不少
#include <stdio.h> #include <xmmintrin.h> #define NOMINMAX #include <windows.h> #i ...
在Android中使用Protocol Buffers（上篇）
本文来自网易云社区. 总览先来看一下 FlatBuffers 项目已经为我们提供了什么,而我们在将 FlatBuffers 用到我们的项目中时又需要做什么的整体流程.如下图: 在使用 FlatBuf ...
CRT和EXCRT简单学习笔记
中国剩余定理CRT 中国剩余定理是要求我们解决这样的一类问题: \[\begin{cases}x\equiv a_1\pmod {b_1} \\x\equiv a_2 \pmod{b_2}\\...\ ...

lucene源码分析(2)读取过程实例

lucene源码分析(2)读取过程实例的更多相关文章

随机推荐

热门专题