Solr相似度算法二:Okapi BM25
In information retrieval, Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s byStephen E. Robertson, Karen Spärck Jones, and others.
The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s, was the first system to implement this function.
BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as web search.
The ranking function[edit]
BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.
Given a query  , containing keywords
, containing keywords  , the BM25 score of a document
, the BM25 score of a document  is:
 is:
where  is
 is  's term frequency in the document
's term frequency in the document  ,
,  is the length of the document
 is the length of the document  in words, and
 in words, and  is the average document length in the text collection from which documents are drawn.
 is the average document length in the text collection from which documents are drawn.  and
 and  are free parameters, usually chosen, in absence of an advanced optimization, as
 are free parameters, usually chosen, in absence of an advanced optimization, as  and
 and  .[1]
.[1]  is the IDF (inverse document frequency) weight of the query term
 is the IDF (inverse document frequency) weight of the query term  . It is usually computed as:
. It is usually computed as:
where  is the total number of documents in the collection, and
 is the total number of documents in the collection, and  is the number of documents containing
 is the number of documents containing  .
.
There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model.
Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:
- Each summand can be given a floor of 0, to trim out common terms;
- The IDF function can be given a floor of a constant  , to avoid common terms being ignored at all; , to avoid common terms being ignored at all;
- The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.
IDF information theoretic interpretation[edit]
Here is an interpretation from information theory. Suppose a query term  appears in
 appears in  documents. Then a randomly picked document
 documents. Then a randomly picked document  will contain the term with probability
 will contain the term with probability  (where
 (where  is again the cardinality of the set of documents in the collection). Therefore, the informationcontent of the message "
 is again the cardinality of the set of documents in the collection). Therefore, the informationcontent of the message " contains
 contains  " is:
" is:
Now suppose we have two query terms  and
 and  . If the two terms occur in documents entirely independently of each other, then the probability of seeing both
. If the two terms occur in documents entirely independently of each other, then the probability of seeing both  and
 and  in a randomly picked document
 in a randomly picked document  is:
 is:
and the information content of such an event is:
With a small variation, this is exactly what is expressed by the IDF component of BM25.
Modifications[edit]
- At the extreme values of the coefficient  BM25 turns into ranking functions known as BM11 (for BM25 turns into ranking functions known as BM11 (for ) and BM15 (for ) and BM15 (for ).[2] ).[2]
- BM25F[3] is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.
- BM25+[4] is an extension of BM25. BM25+ was developed to address one deficiency of the standard BM25 in which the component of term frequency normalization by document length is not properly lower-bounded; as a result of this deficiency, long documents which do match the query term can often be scored unfairly by BM25 as having a similar relevancy to shorter documents that do not contain the query term at all. The scoring formula of BM25+ only has one additional free parameter  (a default value is (a default value is in absence of a training data) as compared with BM25: in absence of a training data) as compared with BM25:
Solr相似度算法二:Okapi BM25的更多相关文章
- Solr相似度算法二:BM25Similarity
		BM25算法的全称是 Okapi BM25,是一种二元独立模型的扩展,也可以用来做搜索的相关度排序. Sphinx的默认相关性算法就是用的BM25.Lucene4.0之后也可以选择使用BM25算法(默 ... 
- Solr相似度算法三:DRFSimilarity框架介绍
		地址:http://terrier.org/docs/v3.5/dfr_description.html The Divergence from Randomness (DFR) paradigm i ... 
- elasticsearch算法之词项相似度算法(二)
		六.莱文斯坦编辑距离 前边的几种距离计算方法都是针对相同长度的词项,莱文斯坦编辑距离可以计算两个长度不同的单词之间的距离:莱文斯坦编辑距离是通过添加.删除.或者将一个字符替换为另外一个字符所需的最小编 ... 
- Solr相似度算法四:IBSimilarity
		Information based:它与Diveragence from randomness模型非常相似.与DFR相似度模型类似,据说该模型也适用于自然语言类的文本. 
- Solr相似度算法三:DRFSimilarity
		该Similarity 实现了 divergence from randomness (偏离随机性)框架,这是一种基于同名概率模型的相似度模型. 该 similarity有以下配置选项: basic ... 
- Okapi BM25算法
		引言 Okapi BM25,一般简称 BM25 算法,在 20 世纪 70 年代到 80 年代,由英国一批信息检索领域的计算机科学家发明.这里的 BM 是"最佳匹配"(Best M ... 
- ES BM25 TF-IDF相似度算法设置——
		Pluggable Similarity Algorithms Before we move on from relevance and scoring, we will finish this ch ... 
- TensorFlow 入门之手写识别(MNIST) softmax算法 二
		TensorFlow 入门之手写识别(MNIST) softmax算法 二 MNIST Fly softmax回归 softmax回归算法 TensorFlow实现softmax softmax回归算 ... 
- elasticsearch算法之词项相似度算法(一)
		一.词项相似度 elasticsearch支持拼写纠错,其建议词的获取就需要进行词项相似度的计算:今天我们来通过不同的距离算法来学习一下词项相似度算法: 二.数据准备 计算词项相似度,就需要首先将词项 ... 
随机推荐
- libevent源码学习
			怎么快速学习开源库比如libevent? libevent分析 - sparkliang的专栏 - 博客频道 - CSDN.NET Libevent源码分析 - luotuo44的专栏 - 博客频道 ... 
- 关于ROS证书导入的步骤
			在群里的vibbow大神指点下,做了一个ROS证书导入的步骤 1.到阿里云申请的免费证书清单如下:(如果你准备的自签名证书,那么在客户端也需要安装证书才行,否则就要到网上去申请真实的,或者花钱买的证书 ... 
- 第八章 Mixer 适配器的应用
			概述: Mixer “知晓”每一次服务间的调用过程,这些调用过程会为Mixer提供丰富的相关信息,Mixer通过接入的适配器对这些信息进行处理,能够在调用的预检(执行前)和报告(执行后)阶段执行多种任 ... 
- solr搜索之搜索精度问题我已经尽力了!!!
			solr搞了好久了,没啥进展,没啥大的突破,但是我真的尽力了! solr7可能是把默认搜索方式去掉了,如下: 在solr7里找了半天以及各种查资料也没发现这个默认搜索方式,后来想,可能是被edisma ... 
- Selenium Webdriver——使用reportng
			ReportNG is a simple HTML reporting plug-in for the TestNG unit-testing framework. It is intended as ... 
- DrawGrid DrawFocusRect
			http://docwiki.embarcadero.com/CodeExamples/XE7/en/GridLineWidth_%28C%2B%2B%29 void __fastcall TForm ... 
- c# 数据集调试工具插件
			DataSetSpySetup,调试期查看dataset数据集的记录内容, Debug DataSet 
- 深入了解 JPA
			转载自:http://www.cnblogs.com/crawl/p/7703679.html 前言:谈起操作数据库,大致可以分为几个阶段:首先是 JDBC 阶段,初学 JDBC 可能会使用原生的 J ... 
- Angular2中Input和Output
			@Input @Input是用来定义模块的输入的,用来让父模块往子模块传递内容: @Output 子模块自定义一些event传递给父模块用@Output. 对于angular2中的Input和Outp ... 
- JSON.parse() 方法解析一个JSON字符串
			JSON.parse() 方法解析一个JSON字符串,构造由字符串描述的JavaScript值或对象.可以提供可选的reviver函数以在返回之前对所得到的对象执行变换. 语法EDIT JSON.pa ... 
 
			
		




