ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.
Boolean Model
The Boolean model simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms full, text, and search, and eitherelasticsearch or lucene.
This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.
Term frequency
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
![]()
|
|
The term frequency ( |
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
|
|
Setting |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
![]()
|
|
The inverse document frequency ( |
ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章
- ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
- rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式
今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...
随机推荐
- url删除指定字符
var str = "http://www.xxx.com/?pn=0"; // 删除指定字符 pn=0 // 我将这个字符串里所可能想到的各种情况都列举出来 var a = [ ...
- Oracle 唯一 索引 约束 创建 删除
http://www.blogjava.net/lukangping/articles/340683.html/*给创建bitmap index分配的内存空间参数,以加速建索引*/ show para ...
- DisplayPort的时钟隐藏和时钟恢复
转:DisplayPort的时钟隐藏和时钟恢复 无时钟线的视频数据传输是DP协议的一大特点,将时钟信号隐藏在数据中是传输协议的设计趋势.时钟恢复技术也是DP芯片设计的关键技术.在这说一下在发送端时钟是 ...
- jquery瀑布流布局插件,兼容ie6不支持下拉加载,用于制作分类卡
调用方法 $('需要布局的块').sault() 如果要在图片加载后调用需要使用$(window).load(function(fx){});函数,即等待图片加载完成再调用 3个参数 1.left:左 ...
- linux脚本实现自己主动输入password
使用Linux的程序猿对输入password这个举动一定不陌生,在Linux下对用户有严格的权限限制,干非常多事情越过了权限就得输入password.比方使用超级用户运行命令,又比方ftp.ssh连接 ...
- Urho3D 在Win10下编辑器崩溃的解决方案
本解决方案来自于 https://github.com/urho3d/Urho3D/issues/2417 描述 在Win10中通过CMake启用URHO_ANGELSCRIPT选项的前提下生成Urh ...
- java eclipse使用不同jdk版本
因为开发需要,两个工程分别需要使用jdk1.6(elipse indigo)和jdk1.8(eclipse neon).因为两个eclipse对于jdk版本的要求不同,若只在环境变量中配置jdk版本, ...
- Oracle学习第一篇—安装和简单语句
一 安装 10G ----不适合Win7 Visual Machine-++++Visual Hard Disk 先安装介质(VM)---便于删除 11G-----适合Win7 1 把win64_1 ...
- Android模糊效果总结
1. 软件模糊 用软件的方法.利用cpu计算,无sdk版本号要求. 效果图: 关键模糊代码 github链接 原文 链接 译文 链接 演示样例 代码 本文地址 :http://blog.csdn.ne ...
- CGI的基本原理
一.基本原理 CGI:通用网关接口(Common Gateway Interface)是一个Webserver主机提供信息服务的标准接口.通过CGI接口,Webserver就行获取client提交的信 ...