ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.
Boolean Model
The Boolean model simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms full, text, and search, and eitherelasticsearch or lucene.
This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.
Term frequency
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
![]()
|
|
The term frequency ( |
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
|
|
Setting |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
![]()
|
|
The inverse document frequency ( |
ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章
- ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
- rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式
今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...
随机推荐
- vuex 中关于 mapActions 的作用
mapActions 工具函数会将 store 中的 dispatch 方法映射到组件的 methods 中.和 mapState.mapGetters 也类似,只不过它映射的地方不是计算属性,而是组 ...
- 使用Apache Benchmark做压力测试遇上的5个常见问题
这一篇文章主要记录我在使用Apache Benchmark(一下检测ab)做网站压力测试的过程中,遇到的一些问题以及解决办法,方便日后使用. 这一篇文章主要记录我在使用Apache Benchmark ...
- python 安装protobuf
安装准备:python和protoc(编译proto到各个语言) 下载protobuf源代码(各种语言实现):https://github.com/google/protobuf 1.到Python ...
- Linux 文件系统IO性能优化
对于LINUX SA来说,服务器性能是需要我们特别关注的,包括CPU.IO.内存等等系统的优化变得至关重要,这里转载一篇非常不错的关于IO优化的文章,供大家参考和学习: 一.关于页面缓存的信息,可以用 ...
- [转]Win10输入法图标消失且只能输入英文的解决方法
今天电脑开机后发现输入法图标不见了,而且只能输入英文,上网查了很多资料终于找到了解决方案,现摘录如下,以防再次遇到问题,便于查找.谢谢提供解决方案的大牛,如有侵权,请联系本人进行删除(文末放置了原文地 ...
- maven项目The superclass "javax.servlet.http.HttpServlet" was not found on the Java Build Path
用Eclipse创建了一个maven web程序,使用tomcat8.5作为服务器,可以正常启动,但是却报如下错误 The superclass "javax.servlet.http.Ht ...
- Java提高(二)---- HashTable
阅读博客 java提高篇(二五)—–HashTable 这篇博客由chenssy 发表与2014年4月,基于源码是jdk1.7 ========================== 本文针对jdk1. ...
- FloatingActionButton 完全解析
鸿洋大神说的很明白这里就直接引用一下: FloatingActionButton 完全解析[Design Support Library(2)]
- MySQL -- Ubuntu下的操作命令
=======================安装======================参照MySQL官网的步骤:https://dev.mysql.com/doc/mysql-apt-repo ...
- Something Starts While Something Ends
(1)最终还是没能参加比赛,一次都没有机会. (2)有梦想,不到最后一刻不会放弃. (3)这里应该会搬次家,转到github上. (4)作为一个新手,什么东西都需要从头学起来,就从最基础的数据结构开始 ...