ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model
The vector space model provides a way of comparing a multiterm query against a document. The output is a single score that represents how well the document matches the query. In order to do this, the model represents both the document and the query as vectors.
A vector is really just a one-dimensional array containing numbers, for example:
[1,2,5,22,3,8]
In the vector space model, each number in the vector is the weight of a term, as calculated with term frequency/inverse document frequency.

While TF/IDF is the default way of calculating term weights for the vector space model, it is not the only way. Other models like Okapi-BM25 exist and are available in Elasticsearch. TF/IDF is the default because it is a simple, efficient algorithm that produces high-quality search results and has stood the test of time.
Imagine that we have a query for “happy hippopotamus.” A common word like happy
will have a low weight, while an uncommon term like hippopotamus
will have a high weight. Let’s assume that happy
has a weight of 2 and hippopotamus
has a weight of 5. We can plot this simple two-dimensional vector—[2,5]
—as a line on a graph starting at point (0,0) and ending at point (2,5), as shown inFigure 27, “A two-dimensional query vector for “happy hippopotamus” represented”.
Figure 27. A two-dimensional query vector for “happy hippopotamus” represented

Now, imagine we have three documents:
- I am happy in summer.
- After Christmas I’m a hippopotamus.
- The happy hippopotamus helped Harry.
We can create a similar vector for each document, consisting of the weight of each query term—happy
and hippopotamus
—that appears in the document, and plot these vectors on the same graph, as shown in Figure 28, “Query and document vectors for “happy hippopotamus””:
- Document 1:
(happy,____________)
—[2,0]
- Document 2:
( ___ ,hippopotamus)
—[0,5]
- Document 3:
(happy,hippopotamus)
—[2,5]
Figure 28. Query and document vectors for “happy hippopotamus”

The nice thing about vectors is that they can be compared. By measuring the angle between the query vector and the document vector, it is possible to assign a relevance score to each document. The angle between document 1 and the query is large, so it is of low relevance. Document 2 is closer to the query, meaning that it is reasonably relevant, and document 3 is a perfect match.

In practice, only two-dimensional vectors (queries with two terms) can be plotted easily on a graph. Fortunately, linear algebra—the branch of mathematics that deals with vectors—provides tools to compare the angle between multidimensional vectors, which means that we can apply the same principles explained above to queries that consist of many terms.
You can read more about how to compare two vectors by using cosine similarity.
Now that we have talked about the theoretical basis of scoring, we can move on to see how scoring is implemented in Lucene.
ES搜索排序,文档相关度评分介绍——Vector Space Model的更多相关文章
- ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring Lucene (and thus Elasticsearch) uses the Boolean model to find match ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- 认识DOM 文档对象模型DOM(Document Object Model)定义访问和处理HTML文档的标准方法。元素、属性和文本的树结构(节点树)。
认识DOM 文档对象模型DOM(Document Object Model)定义访问和处理HTML文档的标准方法.DOM 将HTML文档呈现为带有元素.属性和文本的树结构(节点树). 先来看看下面代码 ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
随机推荐
- 何为SLAM
名词解释: SLAM (simultaneous localization and mapping),也称为CML (Concurrent Mapping and Localizatio ...
- 身份证号码 javascript 验证
function checkIsIdno(idcard) { var Errors=new Array( "SUCCESS", "身份证号码位数不对!", &q ...
- Android常用资源
Eclipse ADT http://developer.android.com/sdk/installing/installing-adt.html https://dl-ssl.google.co ...
- IE67实现inline-block布局
inline-block可以定义元素为行内块级元素,即既具有行内元素同占一行的特点,又具有块级元素的box模型.但是IE67和其他浏览器的支持差别比较大: 1.行内元素使用inline-block变成 ...
- snmp默认团体名/弱口令漏洞及安全加固
0x00基础知识 简单网络管理协议(SNMP)被广泛用于计算机操作系统设备.网络设备等领域监测连接到网络上的设备是否有任何引起管理上关注的情况.在运行SNMP服务的设备上,若管理员配置不当运行默认团体 ...
- MySQL5.7.18 备份、Mysqldump,mysqlpump,xtrabackup,innobackupex 全量,增量备份,数据导入导出
粗略介绍冷备,热备,温暖,及Mysqldump,mysqlpump,xtrabackup,innobackupex 全量,增量备份 --备份的目的 灾难恢复:意外情况下(如服务器宕机.磁盘损坏等)对损 ...
- ASP.NET动态网站制作(7)-- JS(2)
前言:这节课是JS的第二节课,主要是JS中的控制语句. 内容: 1.条件语句: (1)比较操作符:==,!=,>,>=,<,<=.字符串大小写转换:toUpperCase() ...
- ms人物上线
在看MS人物上线 else if(gs2ms_add_player == pkt.cmd) { PlayerChannel* pPC = new PlayerChannel(this); //加到地图 ...
- 【BZOJ3834】[Poi2014]Solar Panels 分块好题
[BZOJ3834][Poi2014]Solar Panels Description Having decided to invest in renewable energy, Byteasar s ...
- 怎么用cookie解决选项卡问题刷新后怎么保持原来的选项?
什么是cookie? Cookies虽然一般都以英文名呈现,但是它还是有一个可爱的中文名“小甜饼”.Cookies是指服务器暂存放在你的电脑里的txt格式的文本文件资料,主要用于网络服务器辨别电脑使用 ...