ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.
Boolean Model
The Boolean model simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms full, text, and search, and eitherelasticsearch or lucene.
This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.
Term frequency
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
![]()
|
|
The term frequency ( |
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
|
|
Setting |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
![]()
|
|
The inverse document frequency ( |
ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章
- ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
- rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式
今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...
随机推荐
- TCP/IP详解 卷一(第一章 概述)
很多不同的厂家生产各种型号的计算机,它们运行完全不同的操作系统,但TCP/IP协议族允许它们相互进行通信. 1.分层 TCP/IP不是一个协议,而是一个协议族,通常它被认为是一个四层的协议系统,下面展 ...
- 【Excle数据透视表】如何隐藏数据透视表中行字段的”+/-”按钮
如下图:新建的数据透视表中有存在"+/-"符号,导致数据透视图不太美观,那么怎么处理呢? 解决方案 单击"显示"组中的"+/-"按钮显示或隐 ...
- javascript 关于弹出新页面始终在正中央方法
记录一个关于弹出新页面始终在正中央方法 function openwindow(url, name, iWidth, iHeight) { var url; ...
- Node.js学习笔记(5)——关于child_process模块
child_process是node一个比较重要的模块,通过它可以实现创建多线程,来利用多核CPU. 这个模块提供了四个创建子进程的函数. spawn.exec.execFile.fork. spaw ...
- sql查字符串包含某字段查询
select * from dbo.V_AgreementMaterialQuery where '上海市' like '%'+SaleRange+'%' ‘上海市’>SaleRange(上海)
- js的常用小技巧
//类对象转成数组 var domNodes = Array.prototype.slice.call(document.getElementsByTagName("*")); ...
- eclipse-jee版配置tomcat
Eclipse作为一款优秀的java开发开源IDE,集成了许多优秀的开发控件.下来我就如何安装eclipse及插件进行说明: 一.JDK安装 JDK是作为整个java的核心,包括运行环境,编译工具 ...
- redis错误error记录
早上登服务器,看到程序的redis的报错, 具体如下: (error) MISCONF Redis is configured to save RDB snapshots, but is curren ...
- Unity编辑器扩展之RequireComponent等详解
RequireComponent的使用: 当你添加的一个用了RequireComponent组件的脚本,需要的组件将会自动被添加到game object(游戏物体).这个可以有效的避免组装错误.举个例 ...
- lumen url重写
打开nginx配置文件vhosts.conf,加上try_files $uri $uri/ /index.php?$query_string; ,如下 location / { index index ...