lucene实战--打分算法没有那么难！

作为一个开放源代码项目，Lucene从问世之后，引发了开放源代码社群的巨大反响，程序员们不仅使用它构建具体的全文检索应用，而且将之集成到各种系统软件中去，以及构建Web应用，甚至某些商业软件也采用了Lucene作为其内部全文检索子系统的核心。apache软件基金会的网站使用了Lucene作为全文检索的引擎，IBM的开源软件eclipse的2.1版本中也采用了Lucene作为帮助子系统的全文索引引擎，相应的IBM的商业软件Web Sphere中也采用了Lucene。Lucene以其开放源代码的特性、优异的索引结构、良好的系统架构获得了越来越多的应用。

Lucene作为一个全文检索引擎，其具有如下突出的优点：

（1）索引文件格式独立于应用平台。Lucene定义了一套以8位字节为基础的索引文件格式，使得兼容系统或者不同平台的应用能够共享建立的索引文件。

（2）在传统全文检索引擎的倒排索引的基础上，实现了分块索引，能够针对新的文件建立小文件索引，提升索引速度。然后通过与原有索引的合并，达到优化的目的。

（3）优秀的面向对象的系统架构，使得对于Lucene扩展的学习难度降低，方便扩充新功能。

（4）设计了独立于语言和文件格式的文本分析接口，索引器通过接受Token流完成索引文件的创立，用户扩展新的语言和文件格式，只需要实现文本分析的接口。

（5）已经默认实现了一套强大的查询引擎，用户无需自己编写代码即使系统可获得强大的查询能力，Lucene的查询实现中默认实现了布尔操作、模糊查询（Fuzzy Search）、分组查询等等。

lucene的索引结构

1. 准备工作

1.1 下载最新源码，https://github.com/apache/lucene-solr

1.2 编译，按照说明，使用ant进行编译(我使用了ant eclipse)

1.3.将编译后的文件导入到eclipse，sts或者idea中

2.新建测试类

    public void test() throws IOException, ParseException {

        Analyzer analyzer = new NGramAnalyzer();

        // Store the index in memory:

        Directory directory = new RAMDirectory();

        // To store an index on disk, use this instead:

        //Path path = FileSystems.getDefault().getPath("E:\\demo\\data", "access.data");

        //Directory directory = FSDirectory.open(path);

        IndexWriterConfig config = new IndexWriterConfig(analyzer);

        IndexWriter iwriter = new IndexWriter(directory, config);

        Document doc = new Document();

        String text = "我是中国人.";

        doc.add(new Field("fieldname", text, TextField.TYPE_STORED));

        iwriter.addDocument(doc);

        iwriter.close();

        // Now search the index:

        DirectoryReader ireader = DirectoryReader.open(directory);

        IndexSearcher isearcher = new IndexSearcher(ireader);

        isearcher.setSimilarity(new BM25Similarity());

        // Parse a simple query that searches for "text":

        QueryParser parser = new QueryParser("fieldname", analyzer);

        Query query = parser.parse("中国,人");

        ScoreDoc[] hits = isearcher.search(query, 1000).scoreDocs;

        // Iterate through the results:

        for (int i = 0; i < hits.length; i++) {

          Document hitDoc = isearcher.doc(hits[i].doc);

          System.out.println(hitDoc.getFields().toString());

        }

        ireader.close();

        directory.close();

    }

      private static class NGramAnalyzer extends Analyzer {

            @Override

            protected TokenStreamComponents createComponents(String fieldName) {

              final Tokenizer tokenizer = new KeywordTokenizer();

              return new TokenStreamComponents(tokenizer, new NGramTokenFilter(tokenizer, 1, 4, true));

            }

          }

其中，分词使用自定义的NGramAnalyzer，它继承自Analyzer，Analyzer分析文本，并将文本转换为TokenStream。详细如下：

/**

 * An Analyzer builds TokenStreams, which analyze text.  It thus represents a

 * policy for extracting index terms from text.

 * <p>

 * In order to define what analysis is done, subclasses must define their

 * {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.

 * The components are then reused in each call to {@link #tokenStream(String, Reader)}.

 * <p>

 * Simple example:

 * <pre class="prettyprint">

 * Analyzer analyzer = new Analyzer() {

 *  {@literal @Override}

 *   protected TokenStreamComponents createComponents(String fieldName) {

 *     Tokenizer source = new FooTokenizer(reader);

 *     TokenStream filter = new FooFilter(source);

 *     filter = new BarFilter(filter);

 *     return new TokenStreamComponents(source, filter);

 *   }

 *   {@literal @Override}

 *   protected TokenStream normalize(TokenStream in) {

 *     // Assuming FooFilter is about normalization and BarFilter is about

 *     // stemming, only FooFilter should be applied

 *     return new FooFilter(in);

 *   }

 * };

 * </pre>

 * For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.

 * <p>

 * For some concrete implementations bundled with Lucene, look in the analysis modules:

 * <ul>

 *   <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:

 *       Analyzers for indexing content in different languages and domains.

 *   <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:

 *       Exposes functionality from ICU to Apache Lucene.

 *   <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:

 *       Morphological analyzer for Japanese text.

 *   <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:

 *       Dictionary-driven lemmatization for the Polish language.

 *   <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:

 *       Analysis for indexing phonetic signatures (for sounds-alike search).

 *   <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:

 *       Analyzer for Simplified Chinese, which indexes words.

 *   <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:

 *       Algorithmic Stemmer for the Polish Language.

 * </ul>

 *

 * @since 3.1

 */

ClassicSimilarity是TFIDFSimilarity的封装，因TFIDFSimilarity是抽象方法，无法直接new出实例.这个算法是lucene早期的默认打分实现。

将测试类放入solr-lucene源码中，并进行debug，如果想要分析TFIDF算法，可以直接new ClassicSimilarity 然后放入IndexSearch，其它的类似。

3.算法介绍

新版的lucene使用了BM25Similarity作为默认打分实现。这里显式使用了BM25Similarity，算法详细。这里简要介绍一下：

其中:

　　 D即文档(Document),Q即查询语句(Query),score(D,Q)指使用Q的查询语句在该文档下的打分函数。

　　IDF即倒排文件频次(Inverse Document Frequency)指在倒排文档中出现的次数，qi是Q分词后term

其中，N是总的文档数目，n(qi)是出现分词qi的文档数目。

　　f(qi,D)是qi分词在文档Document出现的频次

　　 k1和b是可调参数，默认值为1.2，0.75

　　|D|是文档的单词的个数，avgdl 指库里的平均文档长度。

4.算法实现

1.IDF实现

　　单个IDF实现

  /** Implemented as <code>log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))</code>. */

  protected float idf(long docFreq, long docCount) {

    return (float) Math.log(1 + (docCount - docFreq + 0.5D)/(docFreq + 0.5D));

  }

　　IDF的集合实现

  @Override

  public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {

    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);

    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];

    float[] cache = new float[256];

    for (int i = 0; i < cache.length; i++) {

      oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);

      cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);

    }

    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);

  }

  /**

   * Computes a score factor for a phrase.

   *

   * <p>

   * The default implementation sums the idf factor for

   * each term in the phrase.

   *

   * @param collectionStats collection-level statistics

   * @param termStats term-level statistics for the terms in the phrase

   * @return an Explain object that includes both an idf

   *         score factor for the phrase and an explanation

   *         for each term.

   */

  public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats[]) {

    double idf = 0d; // sum into a double before casting into a float

    List<Explanation> details = new ArrayList<>();

    for (final TermStatistics stat : termStats ) {

      Explanation idfExplain = idfExplain(collectionStats, stat);

      details.add(idfExplain);

      idf += idfExplain.getValue();

    }

    return Explanation.match((float) idf, "idf(), sum of:", details);

  }

2.k1和b参数实现

  public BM25Similarity(float k1, float b) {

    if (Float.isFinite(k1) == false || k1 < 0) {

      throw new IllegalArgumentException("illegal k1 value: " + k1 + ", must be a non-negative finite value");

    }

    if (Float.isNaN(b) || b < 0 || b > 1) {

      throw new IllegalArgumentException("illegal b value: " + b + ", must be between 0 and 1");

    }

    this.k1 = k1;

    this.b  = b;

  }

  /** BM25 with these default values:

   * <ul>

   *   <li>{@code k1 = 1.2}</li>

   *   <li>{@code b = 0.75}</li>

   * </ul>

   */

  public BM25Similarity() {

    this(1.2f, 0.75f);

  }

3.平均文档长度avgdl 计算

  /** The default implementation computes the average as <code>sumTotalTermFreq / docCount</code> */

  protected float avgFieldLength(CollectionStatistics collectionStats) {

    final long sumTotalTermFreq;

    if (collectionStats.sumTotalTermFreq() == -1) {

      // frequencies are omitted (tf=1), its # of postings

      if (collectionStats.sumDocFreq() == -1) {

        // theoretical case only: remove!

        return 1f;

      }

      sumTotalTermFreq = collectionStats.sumDocFreq();

    } else {

      sumTotalTermFreq = collectionStats.sumTotalTermFreq();

    }

    final long docCount = collectionStats.docCount() == -1 ? collectionStats.maxDoc() : collectionStats.docCount();

    return (float) (sumTotalTermFreq / (double) docCount);

  }

4.参数Weigh的计算

  /** Cache of decoded bytes. */

  private static final float[] OLD_LENGTH_TABLE = new float[256];

  private static final float[] LENGTH_TABLE = new float[256];

  static {

    for (int i = 1; i < 256; i++) {

      float f = SmallFloat.byte315ToFloat((byte)i);

      OLD_LENGTH_TABLE[i] = 1.0f / (f*f);

    }

    OLD_LENGTH_TABLE[0] = 1.0f / OLD_LENGTH_TABLE[255]; // otherwise inf

    for (int i = 0; i < 256; i++) {

      LENGTH_TABLE[i] = SmallFloat.byte4ToInt((byte) i);

    }

  }

  @Override

  public final SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) {

    Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats);

    float avgdl = avgFieldLength(collectionStats);

    float[] oldCache = new float[256];

    float[] cache = new float[256];

    for (int i = 0; i < cache.length; i++) {

      oldCache[i] = k1 * ((1 - b) + b * OLD_LENGTH_TABLE[i] / avgdl);

      cache[i] = k1 * ((1 - b) + b * LENGTH_TABLE[i] / avgdl);

    }

    return new BM25Stats(collectionStats.field(), boost, idf, avgdl, oldCache, cache);

  }

相当于

5.WeightValue计算

    BM25Stats(String field, float boost, Explanation idf, float avgdl, float[] oldCache, float[] cache) {

      this.field = field;

      this.boost = boost;

      this.idf = idf;

      this.avgdl = avgdl;

      this.weight = idf.getValue() * boost;

      this.oldCache = oldCache;

      this.cache = cache;

    }

    BM25DocScorer(BM25Stats stats, int indexCreatedVersionMajor, NumericDocValues norms) throws IOException {

      this.stats = stats;

      this.weightValue = stats.weight * (k1 + 1);

      this.norms = norms;

      if (indexCreatedVersionMajor >= 7) {

        lengthCache = LENGTH_TABLE;

        cache = stats.cache;

      } else {

        lengthCache = OLD_LENGTH_TABLE;

        cache = stats.oldCache;

      }

    }

相当于

红色部分相乘

6.总的得分计算

    @Override

    public float score(int doc, float freq) throws IOException {

      // if there are no norms, we act as if b=0

      float norm;

      if (norms == null) {

        norm = k1;

      } else {

        if (norms.advanceExact(doc)) {

          norm = cache[((byte) norms.longValue()) & 0xFF];

        } else {

          norm = cache[0];

        }

      }

      return weightValue * freq / (freq + norm);

    }

其中norm是从cache里取的，cache是放入了

那么整个公式就完整的出来了

7.深入

打分的数据来源于CollectionStatistics,TermStatistics及freq，那么它们是哪里得到的？

    SynonymWeight(Query query, IndexSearcher searcher, float boost) throws IOException {

      super(query);

      CollectionStatistics collectionStats = searcher.collectionStatistics(terms[0].field());//1

      long docFreq = 0;

      long totalTermFreq = 0;

      termContexts = new TermContext[terms.length];

      for (int i = 0; i < termContexts.length; i++) {

        termContexts[i] = TermContext.build(searcher.getTopReaderContext(), terms[i]);

        TermStatistics termStats = searcher.termStatistics(terms[i], termContexts[i]);//2

        docFreq = Math.max(termStats.docFreq(), docFreq);

        if (termStats.totalTermFreq() == -1) {

          totalTermFreq = -1;

        } else if (totalTermFreq != -1) {

          totalTermFreq += termStats.totalTermFreq();

        }

      }

      TermStatistics[] statics=new TermStatistics[terms.length];

      for(int i=0;i<terms.length;i++) {

        TermStatistics pseudoStats = new TermStatistics(terms[i].bytes(), docFreq, totalTermFreq,query.getKeyword());

        statics[i]=pseudoStats;

      }

      this.similarity = searcher.getSimilarity(true);

      this.simWeight = similarity.computeWeight(boost, collectionStats, statics);

    }

CollectionStatistics的来源

  /**

   * Returns {@link CollectionStatistics} for a field.

   *

   * This can be overridden for example, to return a field's statistics

   * across a distributed collection.

   * @lucene.experimental

   */

  public CollectionStatistics collectionStatistics(String field) throws IOException {

    final int docCount;

    final long sumTotalTermFreq;

    final long sumDocFreq;

    assert field != null;

    Terms terms = MultiFields.getTerms(reader, field);

    if (terms == null) {

      docCount = 0;

      sumTotalTermFreq = 0;

      sumDocFreq = 0;

    } else {

      docCount = terms.getDocCount();

      sumTotalTermFreq = terms.getSumTotalTermFreq();

      sumDocFreq = terms.getSumDocFreq();

    }

    return new CollectionStatistics(field, reader.maxDoc(), docCount, sumTotalTermFreq, sumDocFreq);

  }

TermStatistics的来源

  /**

   * Returns {@link TermStatistics} for a term.

   *

   * This can be overridden for example, to return a term's statistics

   * across a distributed collection.

   * @lucene.experimental

   */

  public TermStatistics termStatistics(Term term, TermContext context) throws IOException {

    return new TermStatistics(term.bytes(), context.docFreq(), context.totalTermFreq(),term.text());

  }

freq的来源(tf)

    @Override

    protected float score(DisiWrapper topList) throws IOException {

      return similarity.score(topList.doc, tf(topList));

    }

    /** combines TF of all subs. */

    final int tf(DisiWrapper topList) throws IOException {

      int tf = 0;

      for (DisiWrapper w = topList; w != null; w = w.next) {

        tf += ((TermScorer)w.scorer).freq();

      }

      return tf;

    }

底层实现

Lucene50PostingsReader.BlockPostingsEnum

 @Override

    public int nextDoc() throws IOException {

      if (docUpto == docFreq) {

        return doc = NO_MORE_DOCS;

      }

      if (docBufferUpto == BLOCK_SIZE) {

        refillDocs();

      }

      accum += docDeltaBuffer[docBufferUpto];

      freq = freqBuffer[docBufferUpto];

      posPendingCount += freq;

      docBufferUpto++;

      docUpto++;

      doc = accum;

      position = 0;

      return doc;

    }

8.总结

BM25算法的全称是 Okapi BM25，是一种二元独立模型的扩展，也可以用来做搜索的相关度排序。本文通过和lucene的BM25Similarity的实现来深入理解整个打分公式。

在此基础之上，又分析了CollectionStatistics,TermStatistics及freq这些参数是如何计算的。

通过整个分析过程，我们想要定制自己的打分公式，只需要实现Similarity或者SimilarityBase类，然后实现业务上的打分公式即可。

注意：实现了自己的Similarity类后solr不能直接使用，需要将其放到org.apache.solr.search.similarities，使用时

配置managed-schema如下：
<similarity class="solr.DavidSimilarityFactory"/> 注意，路径不是org.apache.solr.search.similarities.DavidSimilarityFactory，而是solr.DavidSimilarityFactory。若使用org.apache.solr.search.similarities.DavidSimilarityFactory则报错：

classnotfound

参考文献

【1】https://en.wikipedia.org/wiki/Okapi_BM25

【2】https://www.elastic.co/cn/blog/found-bm-vs-lucene-default-similarity

【3】http://www.blogjava.net/hoojo/archive/2012/09/06/387140.html

【4】https://cwiki.apache.org/confluence/display/GEODE/Lucene+Internals

lucene实战--打分算法没有那么难！的更多相关文章

Lucene默认的打分算法——ES默认
改变Lucene的打分模型随着Apache Lucene 4.0版本在2012年的发布,这款伟大的全文检索工具包终于允许用户修改默认的基于TF/IDF原理的打分算法.Lucene API变得更加容易 ...
Lucene实战(第2版)》
<Lucene实战(第2版)>基于Apache的Lucene 3.0,从Lucene核心.Lucene应用.案例分析3个方面详细系统地介绍了Lucene,包括认识Lucene.建立索引.为 ...
3.2 Lucene实战：一个简单的小程序
在讲解Lucene索引和检索的原理之前,我们先来实战Lucene:一个简单的小程序! 一.索引小程序首先,new一个java project,名字叫做LuceneIndex. 然后,在project ...
Lucene实战构建索引
搭建lucene的步骤这里就不详细介绍了,无外乎就是下载相关jar包,在eclipse中新建java工程,引入相关的jar包即可本文主要在没有剖析lucene的源码之前实战一下,通过实战来促进研究 ...
opencv实战——图像矫正算法深入探讨
摘要在机器视觉中,对于图像的处理有时候因为放置的原因导致ROI区域倾斜,这个时候我们会想办法把它纠正为正确的角度视角来,方便下一步的布局分析与文字识别,这个时候通过透视变换就可以取得比较好的裁剪效果 ...
《Lucene实战(第2版)》配书代码在IDEA下的编译方法
参考: hankcs http://www.hankcs.com/program/java/lucene-combat-2nd-edition-book-with-code-compiled-unde ...
lucene实战（第二版）学习笔记
初识Lucene 构建索引为应用程序添加搜索功能 Lucene的分析过程
Lucene实战之初体验
前言最早做非结构化数据搜索时用的还是lucene.net,一直说在学习java的同时把lucene这块搞一搞,这拖了2年多了,终于开始搞这块了. 开发环境 idea2016.lucene6.0.jd ...
Python人工智能之路 - 第二篇 : 算法实在太难了有现成的直接用吧
本节内容预备资料: 1.FFmpeg: 链接:https://pan.baidu.com/s/1jonSAa_TG2XuaJEy3iTmHg 密码:w6hk 2.baidu-aip: pip ins ...

随机推荐

Day05 (黑客成长日记) 文件操作系列
文件操作: 1.以什么编码方式输出,就以什么编码方式打开 f = open('d:\文件操作.txt',mode='r',encoding='GB2312') G = f.read() print(G ...
Filter的使用（web作业）
一:什么是过滤器 Filter:Servlet过滤器Fileter是一个小型的web组件,它们通过拦截请求和响应,以便查看.提取或以某种方式操作客户端和服务器之间交换的数据,实现“过滤”的功能.Fil ...
DOM对象和jQuery对象的转换
<script type="text/javascript"> //js的页面加载事件 window.onload = function () { //获取DOM对象 ...
IT行业三大定律
1:摩尔定律该定律由Inter公司创始人戈登摩尔提出,摩尔定律指出:每一年半计算机等IT产品的性能会翻一番:或者说相同性能的产品在一年半后价格会降一半. 表现为:为适应摩尔定律,IT公司必须在较 ...
php中的冒泡排序和选择排序d
//冒泡算法 //定义一个数组 $arr=arr{2,5,1155,3,8}; $len=count($arr); for($i=0;$i<$len-1;$i++) //定义以下需要宣传的次数 ...
IDEA 的黄线标注取消
用IDEA经常会遇到found duplicate code的问题,下面有黄线标注,严重影响强迫症患者的使用.下面就是如何取消的方法在设置中把这个勾取消即可1 转载于: https://blog. ...
SolrCloud的搭建与稳定性测试
转载请注明出处:http://www.cnblogs.com/wubdut/p/7573738.html 一.集群搭建 1. zookeeper搭建(版本:3.4) 1.1 zoo.cfg配置文件: ...
Minimum setup for Apache+AD SSO
参照: http://www.grolmsnet.de/kerbtut/ https://docs.typo3.org/typo3cms/extensions/ig_ldap_sso_auth/2.1 ...
Windows 10 无法使用搜索栏，显示一片空白
查看这个:https://blog.csdn.net/qq_41571056/article/details/82928919
CSS+DIV布局中absolute和relative区别
原文:http://developer.51cto.com/art/201009/225201.htm 这里向大家简单介绍一下CSS+DIV布局中absolute和relative属性的用法和区别,定 ...

lucene实战--打分算法没有那么难！

lucene实战--打分算法没有那么难！的更多相关文章

随机推荐

热门专题