作为一个开放源代码项目,Lucene从问世之后,引发了开放源代码社群的巨大反响,程序员们不仅使用它构建具体的全文检索应用,而且将之集成到各种系统软件中去,以及构建Web应用,甚至某些商业软件也采用了Lucene作为其内部全文检索子系统的核心。apache软件基金会的网站使用了Lucene作为全文检索的引擎,IBM的开源软件eclipse的2.1版本中也采用了Lucene作为帮助子系统的全文索引引擎,相应的IBM的商业软件Web Sphere中也采用了Lucene。Lucene以其开放源代码的特性、优异的索引结构、良好的系统架构获得了越来越多的应用。

Lucene作为一个全文检索引擎,其具有如下突出的优点:

(1)索引文件格式独立于应用平台。Lucene定义了一套以8位字节为基础的索引文件格式,使得兼容系统或者不同平台的应用能够共享建立的索引文件。

(2)在传统全文检索引擎的倒排索引的基础上,实现了分块索引,能够针对新的文件建立小文件索引,提升索引速度。然后通过与原有索引的合并,达到优化的目的。

(3)优秀的面向对象的系统架构,使得对于Lucene扩展的学习难度降低,方便扩充新功能。

(4)设计了独立于语言和文件格式的文本分析接口,索引器通过接受Token流完成索引文件的创立,用户扩展新的语言和文件格式,只需要实现文本分析的接口。

(5)已经默认实现了一套强大的查询引擎,用户无需自己编写代码即使系统可获得强大的查询能力,Lucene的查询实现中默认实现了布尔操作、模糊查询(Fuzzy Search)、分组查询等等。

lucene的索引结构

1. 准备工作

1.1 下载最新源码,https://github.com/apache/lucene-solr

1.2 编译,按照说明,使用ant进行编译(我使用了ant eclipse)

1.3.将编译后的文件导入到eclipse,sts或者idea中

2.新建测试类

    public void test() throws IOException, ParseException {
Analyzer analyzer = new NGramAnalyzer(); // Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Path path = FileSystems.getDefault().getPath("E:\\demo\\data", "access.data");
//Directory directory = FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "我是中国人.";
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close(); // Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
isearcher.setSimilarity(new BM25Similarity());
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("中国,人");
ScoreDoc[] hits = isearcher.search(query, 1000).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println(hitDoc.getFields().toString());
}
ireader.close();
directory.close();
} private static class NGramAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer tokenizer = new KeywordTokenizer();
return new TokenStreamComponents(tokenizer, new NGramTokenFilter(tokenizer, 1, 4, true));
}
}

其中,分词使用自定义的NGramAnalyzer,它继承自Analyzer,Analyzer分析文本,并将文本转换为TokenStream。详细如下:

/**
* An Analyzer builds TokenStreams, which analyze text. It thus represents a
* policy for extracting index terms from text.
* <p>
* In order to define what analysis is done, subclasses must define their
* {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.
* The components are then reused in each call to {@link #tokenStream(String, Reader)}.
* <p>
* Simple example:
* <pre class="prettyprint">
* Analyzer analyzer = new Analyzer() {
* {@literal @Override}
* protected TokenStreamComponents createComponents(String fieldName) {
* Tokenizer source = new FooTokenizer(reader);
* TokenStream filter = new FooFilter(source);
* filter = new BarFilter(filter);
* return new TokenStreamComponents(source, filter);
* }
* {@literal @Override}
* protected TokenStream normalize(TokenStream in) {
* // Assuming FooFilter is about normalization and BarFilter is about
* // stemming, only FooFilter should be applied
* return new FooFilter(in);
* }
* };
* </pre>
* For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.
* <p>
* For some concrete implementations bundled with Lucene, look in the analysis modules:
* <ul>
* <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:
* Analyzers for indexing content in different languages and domains.
* <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:
* Exposes functionality from ICU to Apache Lucene.
* <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:
* Morphological analyzer for Japanese text.
* <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:
* Dictionary-driven lemmatization for the Polish language.
* <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:
* Analysis for indexing phonetic signatures (for sounds-alike search).
* <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:
* Analyzer for Simplified Chinese, which indexes words.
* <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:
* Algorithmic Stemmer for the Polish Language.
* </ul>
*
* @since 3.1
*/

ClassicSimilarity是TFIDFSimilarity的封装,因TFIDFSimilarity是抽象方法,无法直接new出实例.这个算法是lucene早期的默认打分实现。

将测试类放入solr-lucene源码中,并进行debug,如果想要分析TFIDF算法,可以直接new ClassicSimilarity 然后放入IndexSearch,其它的类似。

3.算法介绍

新版的lucene使用了BM25Similarity作为默认打分实现。这里显式使用了BM25Similarity,算法详细。这里简要介绍一下:

其中:

   D即文档(Document),Q即查询语句(Query),score(D,Q)指使用Q的查询语句在该文档下的打分函数。

  IDF即倒排文件频次(Inverse Document Frequency)指在倒排文档中出现的次数,qi是Q分词后term

其中,N是总的文档数目,n(qi)是出现分词qi的文档数目。

  f(qi,D)是qi分词在文档Document出现的频次

   k1和b是可调参数,默认值为1.2,0.75

  |D|是文档的单词的个数,avgdl 指库里的平均文档长度。

4.算法实现

1.IDF实现

  单个IDF实现

  /** Implemented as <code>log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))</code>. */
protected float idf(long docFreq, long docCount) {
return (float) Math.log(1 + (docCount - docFreq + 0.5D)/(docFreq + 0.5D));
}

  IDF的集合实现

  @Override
public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
float avgdl = avgFieldLength(collectionStats); float[] oldCache = new float[256];
float[] cache = new float[256];
for (int i = 0; i < cache.length; i++) {
oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
}
return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
} /**
* Computes a score factor for a phrase.
*
* <p>
* The default implementation sums the idf factor for
* each term in the phrase.
*
* @param collectionStats collection-level statistics
* @param termStats term-level statistics for the terms in the phrase
* @return an Explain object that includes both an idf
* score factor for the phrase and an explanation
* for each term.
*/
public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats[]) {
double idf = 0d; // sum into a double before casting into a float
List<Explanation> details = new ArrayList<>();
for (final TermStatistics stat : termStats ) {
Explanation idfExplain = idfExplain(collectionStats, stat);
details.add(idfExplain);
idf += idfExplain.getValue();
}
return Explanation.match((float) idf, "idf(), sum of:", details);
}

2.k1和b参数实现

  public BM25Similarity(float k1, float b) {
if (Float.isFinite(k1) == false || k1 < 0) {
throw new IllegalArgumentException("illegal k1 value: " + k1 + ", must be a non-negative finite value");
}
if (Float.isNaN(b) || b < 0 || b > 1) {
throw new IllegalArgumentException("illegal b value: " + b + ", must be between 0 and 1");
}
this.k1 = k1;
this.b = b;
} /** BM25 with these default values:
* <ul>
* <li>{@code k1 = 1.2}</li>
* <li>{@code b = 0.75}</li>
* </ul>
*/
public BM25Similarity() {
this(1.2f, 0.75f);
}

3.平均文档长度avgdl 计算

  /** The default implementation computes the average as <code>sumTotalTermFreq / docCount</code> */
protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq;
if (collectionStats.sumTotalTermFreq() == -1) {
// frequencies are omitted (tf=1), its # of postings
if (collectionStats.sumDocFreq() == -1) {
// theoretical case only: remove!
return 1f;
}
sumTotalTermFreq = collectionStats.sumDocFreq();
} else {
sumTotalTermFreq = collectionStats.sumTotalTermFreq();
}
final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();
return (float) (sumTotalTermFreq / (double) docCount);
}

4.参数Weigh的计算

  /** Cache of decoded bytes. */
private static final float[] OLD_LENGTH_TABLE = new float[256];
private static final float[] LENGTH_TABLE = new float[256]; static {
for (int i = 1; i < 256; i++) {
float f = SmallFloat.byte315ToFloat((byte)i);
OLD_LENGTH_TABLE[i] = 1.0f / (f*f);
}
OLD_LENGTH_TABLE[0] = 1.0f / OLD_LENGTH_TABLE[255]; // otherwise inf for (int i = 0; i < 256; i++) {
LENGTH_TABLE[i] = SmallFloat.byte4ToInt((byte) i);
}
} @Override
public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {
Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);
float avgdl = avgFieldLength(collectionStats); float[] oldCache = new float[256];
float[] cache = new float[256];
for (int i = 0; i < cache.length; i++) {
oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);
cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);
}
return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);
}

相当于 

5.WeightValue计算

    BM25Stats(String field, float boost, Explanation idf, float avgdl, float[] oldCache, float[] cache) {
this.field = field;
this.boost = boost;
this.idf = idf;
this.avgdl = avgdl;
this.weight = idf.getValue() * boost;
this.oldCache = oldCache;
this.cache = cache;
} BM25DocScorer(BM25Stats stats, int indexCreatedVersionMajor, NumericDocValues norms) throws IOException {
this.stats = stats;
this.weightValue = stats.weight * (k1 + 1);
this.norms = norms;
if (indexCreatedVersionMajor >= 7) {
lengthCache = LENGTH_TABLE;
cache = stats.cache;
} else {
lengthCache = OLD_LENGTH_TABLE;
cache = stats.oldCache;
}
}

相当于

红色部分相乘

6.总的得分计算

    @Override
public float score(int doc, float freq) throws IOException {
// if there are no norms, we act as if b=0
float norm;
if (norms == null) {
norm = k1;
} else {
if (norms.advanceExact(doc)) {
norm = cache[((byte) norms.longValue()) & 0xFF];
} else {
norm = cache[0];
}
}
return weightValue * freq / (freq + norm);
}

其中norm是从cache里取的,cache是放入了

那么整个公式就完整的出来了

7.深入

打分的数据来源于CollectionStatistics,TermStatistics及freq,那么它们是哪里得到的?

    SynonymWeight(Query query, IndexSearcher searcher, float boost) throws IOException {
super(query);
CollectionStatistics collectionStats = searcher.collectionStatistics(terms[0].field());//1
long docFreq = 0;
long totalTermFreq = 0;
termContexts = new TermContext[terms.length];
for (int i = 0; i < termContexts.length; i++) {
termContexts[i] = TermContext.build(searcher.getTopReaderContext(), terms[i]);
TermStatistics termStats = searcher.termStatistics(terms[i], termContexts[i]);//2
docFreq = Math.max(termStats.docFreq(), docFreq);
if (termStats.totalTermFreq() == -1) {
totalTermFreq = -1;
} else if (totalTermFreq != -1) {
totalTermFreq += termStats.totalTermFreq();
}
}
TermStatistics[] statics=new TermStatistics[terms.length];
for(int i=0;i<terms.length;i++) {
TermStatistics pseudoStats = new TermStatistics(terms[i].bytes(), docFreq, totalTermFreq,query.getKeyword());
statics[i]=pseudoStats;
} this.similarity = searcher.getSimilarity(true);
this.simWeight = similarity.computeWeight(boost, collectionStats, statics);
}

CollectionStatistics的来源

  /**
* Returns {@link CollectionStatistics} for a field.
*
* This can be overridden for example, to return a field's statistics
* across a distributed collection.
* @lucene.experimental
*/
public CollectionStatistics collectionStatistics(String field) throws IOException {
final int docCount;
final long sumTotalTermFreq;
final long sumDocFreq; assert field != null; Terms terms = MultiFields.getTerms(reader, field);
if (terms == null) {
docCount = 0;
sumTotalTermFreq = 0;
sumDocFreq = 0;
} else {
docCount = terms.getDocCount();
sumTotalTermFreq = terms.getSumTotalTermFreq();
sumDocFreq = terms.getSumDocFreq();
} return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);
}

TermStatistics的来源

  /**
* Returns {@link TermStatistics} for a term.
*
* This can be overridden for example, to return a term's statistics
* across a distributed collection.
* @lucene.experimental
*/
public TermStatistics termStatistics(Term term, TermContext context) throws IOException {
return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq(),term.text());
}

freq的来源(tf)

    @Override
protected float score(DisiWrapper topList) throws IOException {
return similarity.score(topList.doc, tf(topList));
} /** combines TF of all subs. */
final int tf(DisiWrapper topList) throws IOException {
int tf = 0;
for (DisiWrapper w = topList; w != null; w = w.next) {
tf += ((TermScorer)w.scorer).freq();
}
return tf;
}

底层实现

Lucene50PostingsReader.BlockPostingsEnum

 @Override
public int nextDoc() throws IOException {
if (docUpto == docFreq) {
return doc = NO_MORE_DOCS;
}
if (docBufferUpto == BLOCK_SIZE) {
refillDocs();
} accum += docDeltaBuffer[docBufferUpto];
freq = freqBuffer[docBufferUpto];
posPendingCount += freq;
docBufferUpto++;
docUpto++; doc = accum;
position = 0;
return doc;
}

8.总结

BM25算法的全称是 Okapi BM25,是一种二元独立模型的扩展,也可以用来做搜索的相关度排序。本文通过和lucene的BM25Similarity的实现来深入理解整个打分公式。

在此基础之上,又分析了CollectionStatistics,TermStatistics及freq这些参数是如何计算的。

通过整个分析过程,我们想要定制自己的打分公式,只需要实现Similarity或者SimilarityBase类,然后实现业务上的打分公式即可。

注意:实现了自己的Similarity类后solr不能直接使用,需要将其放到org.apache.solr.search.similarities,使用时

配置managed-schema如下:
<similarity class="solr.DavidSimilarityFactory"/>  注意,路径不是org.apache.solr.search.similarities.DavidSimilarityFactory,而是solr.DavidSimilarityFactory。若使用org.apache.solr.search.similarities.DavidSimilarityFactory则报错:

classnotfound

参考文献

【1】https://en.wikipedia.org/wiki/Okapi_BM25

【2】https://www.elastic.co/cn/blog/found-bm-vs-lucene-default-similarity

【3】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html

【4】https://cwiki.apache.org/confluence/display/GEODE/Lucene+Internals

lucene实战--打分算法没有那么难!的更多相关文章

  1. Lucene默认的打分算法——ES默认

    改变Lucene的打分模型 随着Apache Lucene 4.0版本在2012年的发布,这款伟大的全文检索工具包终于允许用户修改默认的基于TF/IDF原理的打分算法.Lucene API变得更加容易 ...

  2. Lucene实战(第2版)》

    <Lucene实战(第2版)>基于Apache的Lucene 3.0,从Lucene核心.Lucene应用.案例分析3个方面详细系统地介绍了Lucene,包括认识Lucene.建立索引.为 ...

  3. 3.2 Lucene实战:一个简单的小程序

    在讲解Lucene索引和检索的原理之前,我们先来实战Lucene:一个简单的小程序! 一.索引小程序 首先,new一个java project,名字叫做LuceneIndex. 然后,在project ...

  4. Lucene实战构建索引

    搭建lucene的步骤这里就不详细介绍了,无外乎就是下载相关jar包,在eclipse中新建java工程,引入相关的jar包即可 本文主要在没有剖析lucene的源码之前实战一下,通过实战来促进研究 ...

  5. opencv实战——图像矫正算法深入探讨

    摘要 在机器视觉中,对于图像的处理有时候因为放置的原因导致ROI区域倾斜,这个时候我们会想办法把它纠正为正确的角度视角来,方便下一步的布局分析与文字识别,这个时候通过透视变换就可以取得比较好的裁剪效果 ...

  6. 《Lucene实战(第2版)》 配书代码在IDEA下的编译方法

    参考: hankcs http://www.hankcs.com/program/java/lucene-combat-2nd-edition-book-with-code-compiled-unde ...

  7. lucene实战(第二版)学习笔记

    初识Lucene 构建索引 为应用程序添加搜索功能 Lucene的分析过程

  8. Lucene实战之初体验

    前言 最早做非结构化数据搜索时用的还是lucene.net,一直说在学习java的同时把lucene这块搞一搞,这拖了2年多了,终于开始搞这块了. 开发环境 idea2016.lucene6.0.jd ...

  9. Python人工智能之路 - 第二篇 : 算法实在太难了有现成的直接用吧

    本节内容 预备资料: 1.FFmpeg: 链接:https://pan.baidu.com/s/1jonSAa_TG2XuaJEy3iTmHg 密码:w6hk 2.baidu-aip: pip ins ...

随机推荐

  1. java 调用 api接口

    /* * Copyright 2018 textile.com All right reserved. This software is the * confidential and propriet ...

  2. shell编写自动化安装dhcp服务

    #!/bin/bash#Auth:Darius#自动化安装dhcp服务#"$1"为测试IP,用来查看IP段是否能通eno=`ifconfig|awk '{print $1}'|he ...

  3. rest_framework登录组件,权限组件

    昨日回顾: -HyperlinkedIdentityField(用来生成url),传三个参数 -实例化序列化类的时候,BookSerializer(ret, many=True, context={' ...

  4. <笔记>三码合一

    讲求三码合一,何为三码合一?(这里我用UTF8讲例子) 就是页面编码,文档编码,数据库编码要统一一种格式,切记不可有的是GBK,有的是UFT8 页面编码:也就是用header 函数申明:header( ...

  5. ajax提交数组至后台,无法获取值得问题

    $(".delAll_btn").click(function(){ var checkStatus = table.checkStatus('userList'), data = ...

  6. noip第27课资料

  7. 接口测试工具之Postman笔记

    根据学习内容对Postman进行的个人总结,对于Postman说明.安装方法等说明性文字就不赘述了. 下面是页面中元素的和输入说明: New collection:集合可以把同一平台.系统,或功能的接 ...

  8. @ResponseBody 返回乱码 的解决办法

    1:最快的  最简单的办法是  在Ajax请求脸面指定头信息Accept属性,StringHttpMessageConverter默认iso-8859-1编码,但是会根据请求头信息指定的编码格式来转换 ...

  9. Codeforces Round #421 (Div. 2)

    A: 题意:给你一本书共c页,第一天看v0页,第二天看v0+a,第二天看v0+2a以此类推,每天最多看v1页,但是后一天要重复看前一天的后l页. 代码: #include<stdio.h> ...

  10. hive 数值计算函数

    Hive数值计算函数 (1)round(45.666,2)作用:四舍五入,保留2位小数 ceil(45.6) 作用:向上取整         floor(45.6) 作用:向下取整 (2)rand() ...