lucene文件格式待整理

这是之前Lucene3.0生成的索引格式

a表

b表

c.这是网上找的图片(因为上面的两张表的segment都是合并了的)

lucene4.9 建立的索引：

索引(Index)：
        在Lucene中一个索引是放在一个文件夹中的。
        如上图，同一文件夹中的所有的文件构成一个Lucene索引。
    段(Segment)：
        一个索引可以包含多个段，段与段之间是独立的，添加新文档可以生成新的段，不同的段可以合并。
        如上图，具有相同前缀文件的属同一个段，图中共两个段 "_0" 和 "_1"。
        segments.gen和segments_5是段的元数据文件，也即它们保存了段的属性信息。
    文档(Document)：
        文档是我们建索引的基本单位，不同的文档是保存在不同的段中的，一个段可以包含多篇文档。
        新添加的文档是单独保存在一个新生成的段中，随着段的合并，不同的文档合并到同一个段中。
    域(Field)：
        一篇文档包含不同类型的信息，可以分开索引，比如标题，时间，正文，作者等，都可以保存在不同的域里。
        不同域的索引方式可以不同，在真正解析域的存储的时候，我们会详细解读。
    词(Term)：
        词是索引的最小单位，是经过词法分析和语言处理后的字符串。

Segment info. This contains metadata about a segment, such as the number of documents, what files it uses,
Field names. This contains the set of field names used in the index.
Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.
Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.
Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY)
Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.
Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.(这个值里面存储了一个当你命中它的时候的一个打分)
Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors
Per-document values. Like stored values, these are also keyed by document number, but are generally intended to be loaded into main memory for fast access. Whereas stored values are generally intended for summary results from searches, per-document values are useful for things like scoring factors.
Deleted documents. An optional file indicating which documents are deleted.

文件格式对应缩写

.fdt field data

.fdx field index

.fnm field name This contains the set of field names used in the index 索引立面的域名

.frq frequencies

.nrm norms

.prx ProxFile

.tii term info index

.tis term infos

.si Segment info This contains metadata about a segment, such as the number of documents, what files it uses

段的元数据，如此段的文档数及应用的相关文件

segments.gen

segments_N

所谓正向信息：

按层次保存了从索引，一直到词的包含关系：索引(Index) –> 段(segment) –> 文档(Document) –> 域(Field) –> 词(Term)
也即此索引包含了那些段，每个段包含了那些文档，每个文档包含了那些域，每个域包含了那些词。
既然是层次结构，则每个层次都保存了本层次的信息以及下一层次的元信息，也即属性信息，比如一本介绍中国地理的书，应该首先介绍中国地理的概况，以及中国包含多少个省，每个省介绍本省的基本概况及包含多少个市，每个市介绍本市的基本概况及包含多少个县，每个县具体介绍每个县的具体情况。
如上图，包含正向信息的文件有：

所谓反向信息：

保存了词典到倒排表的映射：词(Term) –> 文档(Document)
如上图，包含反向信息的文件有：

lucene文件格式待整理的更多相关文章

lucene 检索流程整理笔记
lucene 索引流程整理笔记
索引的原文档(Document). 为了方便说明索引创建过程,这里特意用两个文件为例: 文件一:Students should be allowed to go out with their frie ...
lucene学习笔记：三，Lucene的索引文件格式
Lucene的索引里面存了些什么,如何存放的,也即Lucene的索引文件格式,是读懂Lucene源代码的一把钥匙. 当我们真正进入到Lucene源代码之中的时候,我们会发现: Lucene的索引过程, ...
Lucene学习总结之三：Lucene的索引文件格式(1)
Lucene的索引里面存了些什么,如何存放的,也即Lucene的索引文件格式,是读懂Lucene源代码的一把钥匙. 当我们真正进入到Lucene源代码之中的时候,我们会发现: Lucene的索引过程, ...
Lucene学习之四：Lucene的索引文件格式(1)
本文转载自:http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623597.html Lucene的索引里面存了些什么,如何存放的,也即 ...
Lucene学习总结之三：Lucene的索引文件格式(1) 2014-06-25 14:15 1124人阅读评论(0) 收藏
Lucene的索引里面存了些什么,如何存放的,也即Lucene的索引文件格式,是读懂Lucene源代码的一把钥匙. 当我们真正进入到Lucene源代码之中的时候,我们会发现: Lucene的索引过程, ...
Lucene 基础理论 (zhuan)
http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html**************************************** ...
Solr4.8.0源码分析(8)之Lucene的索引文件(1)
Solr4.8.0源码分析(8)之Lucene的索引文件(1) 题记:最近有幸看到觉先大神的Lucene的博客,感觉自己之前学习的以及工作的太为肤浅,所以决定先跟随觉先大神的博客学习下Lucene的原 ...
深入Lucene索引机制
Lucene的索引里面存了些什么,如何存放的,也即Lucene的索引文件格式,是读懂Lucene源代码的一把钥匙. 当我们真正进入到Lucene源代码之中的时候,我们会发现: Lucene的索引过程, ...

随机推荐

linux 杀死进程的方法
# kill -pid 注释:标准的kill命令通常都能达到目的.终止有问题的进程,并把进程的资源释放给系统.然而,如果进程启动了子进程,只杀死父进程,子进程仍在运行,因此仍消耗资源.为了防止这些所谓 ...
ACM - KMP题目小结（更新中）
KMP算法题型大致有两类,一类是next数组的应用,一类是匹配问题. next数组大多数是求字符串周期,或者是与前缀后缀有关,也可以应用在DP中.需要对next数组有一定理解才能做得出. next数组 ...
HackRF实现ADS-B飞机信号跟踪定位
硬件平台:HackRF One软件平台:MAC运行环境搭建系统平台:OS X 10.11 EI Capitan文章特点:捕捉程序支持HackRF One且基于MAC平台验证通过有效. 1. 原理概述 ...
php大力力 [029节] 做PHP项目如何下载js文件：使用腾讯浏览器把网上案例页面存储到本地
php大力力 [029节] 做PHP项目如何下载js文件:使用腾讯浏览器把网上案例页面存储到本地 yeah,搞定啦 php大力力 [029节] 做PHP项目如何下载js文件:使用腾讯浏览器把网上案例页 ...
php大力力 [021节]mysql表名在mac下不能大写
2015-08-27 php大力力021.mysql表名在mac下不能大写刚才数据库里面,phpMyAdmin狂出错. Some errors have been detected on the s ...
JS - To my gril
/* 这个程序的流程是 , 首先执行构造函数 (), 然后就去执行那个 render 渲染 , 在 render 哪里的if else 转向应该执行的渲染方法 , 例如 commitsrende ...
BZOJ 2252 矩阵距离
BFS. #include<iostream> #include<cstdio> #include<cstring> #include<algorithm&g ...
# 20145210 《Java程序设计》第03周学习总结
教材学习内容总结第四章类与对象在定义类这个小结里,有很多新的术语,书上的比喻很形象,对于理解这部分的内容有很大帮助,现总结如下: •类与对象的关系:要产生对象必须先定义类,类是对象的设计图,对象 ...
HDU 1300
http://acm.hdu.edu.cn/showproblem.php?pid=1300 这题大一就看到过,当时没读懂题目,今天再做就容易多了题意:升序给出n个珍珠的的数量和价值,问买这些珍珠的 ...
Nunit 使用介绍
Nunit是.NET平台单元测试框架,其是从Junit发展而来,它强大之处是支持所有的.NET语言. Nunit的下载地址:http://www.nunit.org 介绍1: 布局: 左面:我们写的每 ...

lucene文件格式待整理

lucene文件格式待整理的更多相关文章

随机推荐

热门专题