ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.
Boolean Model
The Boolean model simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms full, text, and search, and eitherelasticsearch or lucene.
This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.
Term frequency
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
![]()
|
|
The term frequency ( |
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
|
|
Setting |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
![]()
|
|
The inverse document frequency ( |
ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章
- ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
- rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式
今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...
随机推荐
- pythonkeywordis与 ==的差别
pythonkeywordis与 ==的差别 近期在学习Python.总结一下小知识点. Python中的对象包括三要素:id.type.value 当中id用来唯一标识一个对象.type标识对象的类 ...
- MySQL之慢查询-删除慢查询日志
一.环境 OS:CentOS release 5.8(64位) DB:MySQL5.5.17 二.操作 直接通过命令 rm -f 删除了慢查询日志 三.出现故障 慢查询日志没有自己主 ...
- 使用 mybatis + flying + 双向相关建模 的电商后端
代码地址如下:http://www.demodashi.com/demo/12468.html mybatis.flying 众所周知,mybatis 虽然易于上手,但放到互联网环境下使用时,不可避免 ...
- 我如何添加一个空目录到Git仓库?
新建了一个仓库,只是创建一些目录结构,还不里面放什么,要放的内容还没有,还不存在,应该怎么办呢? Git 是不跟踪空目录的,所以需要跟踪那么就需要添加文件! 也就是说 Git 中不存在真正意义上的空目 ...
- Vue.js 很好,但会比 Angular 或 React 更好吗?
文章转自:http://www.oschina.net/translate/vuejs-is-good-but-is-it-better-than-angular-or-rea Vue.js 是一个用 ...
- 双十一前4小时,CentOS 6.5server启动错误排查
11月10日晚上8点多.眼看要到双十一了... 但我要说的这段经历却和双十一毫无关系.哈哈. 这天准备向CentOS6.5server的svn上传一些文件,结果开机启动时,却出现了以下的界面: 这是肿 ...
- BC 1.2 模式(Battery Charging Specification 1.2)
转自:http://blog.csdn.net/liglei 转自:http://blog.csdn.net/liglei/article/details/22852755 USB BC1.2有以下三 ...
- 并发错误:事务(进程 ID )与另一个进程已被死锁在 lock 资源上,且该事务已被选作死锁牺牲品
这个是并发情况下导致的数据库事务错误,先介绍下背景. 背景 springboot+springmvc+sqlserver+mybatis 一个controller里有五六个接口,这些接口都用到了spr ...
- leetCode 61.Rotate List (旋转链表) 解题思路和方法
Rotate List Given a list, rotate the list to the right by k places, where k is non-negative. For ex ...
- log4j入门及常用配置
<pre class="java" name="code">import org.apache.log4j.BasicConfigurator; ...