ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.
Theory Behind Relevance Scoring
Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.
Don’t be alarmed! These concepts are not as complicated as the names make them appear. While this section mentions algorithms, formulae, and mathematical models, it is intended for consumption by mere humans. Understanding the algorithms themselves is not as important as understanding the factors that influence the outcome.
Boolean Model
The Boolean model simply applies the AND, OR, and NOT conditions expressed in the query to find all the documents that match. A query for
full AND text AND search AND (elasticsearch OR lucene)
will include only documents that contain all of the terms full, text, and search, and eitherelasticsearch or lucene.
This process is simple and fast. It is used to exclude any documents that cannot possibly match the query.
Term Frequency/Inverse Document Frequency (TF/IDF)
Once we have a list of matching documents, they need to be ranked by relevance. Not all documents will contain all the terms, and some terms are more important than others. The relevance score of the whole document depends (in part) on the weight of each query term that appears in that document.
The weight of a term is determined by three factors, which we already introduced in What Is Relevance?. The formulae are included for interest’s sake, but you are not required to remember them.
Term frequency
How often does the term appear in this document? The more often, the higher the weight. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention. The term frequency is calculated as follows:
tf(t in d) = √frequency
![]()
|
|
The term frequency ( |
If you don’t care about how often a term appears in a field, and all you care about is that the term is present, then you can disable term frequencies in the field mapping:
PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"index_options": "docs"
}
}
}
}
}
|
|
Setting |
Inverse document frequency
How often does the term appear in all documents in the collection? The more often, the lower the weight. Common terms like and or the contribute little to relevance, as they appear in most documents, while uncommon terms like elastic or hippopotamus help us zoom in on the most interesting documents. The inverse document frequency is calculated as follows:
idf(t) = 1 + log ( numDocs / (docFreq + 1))
![]()
|
|
The inverse document frequency ( |
ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.的更多相关文章
- ES搜索排序,文档相关度评分介绍——Vector Space Model
Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...
- ES搜索排序,文档相关度评分介绍——Field-length norm
Field-length norm How long is the field? The shorter the field, the higher the weight. If a term app ...
- ES 文档与索引介绍
在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...
- es搜索排序不正确
沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...
- ES-PHP向ES批量添加文档报No alive nodes found in your cluster
ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...
- atitit.vod search doc.doc 点播系统搜索功能设计文档
atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...
- es之对文档进行更新操作
5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...
- MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作
映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...
- rbac介绍、自动生成接口文档、jwt介绍与快速签发认证、jwt定制返回格式
今日内容概要 RBAC 自动生成接口文档 jwt介绍与快速使用 jwt定制返回格式 jwt源码分析 内容详细 1.RBAC(重要) # RBAC 是基于角色的访问控制(Role-Based Acces ...
随机推荐
- code::blocks(版本号10.05) 配置opencv2.4.3
(1)首先下载opencv2.4.3, 解压缩到D:下: (2)配置code::blocks, 详细操作例如以下: 第一步, 配置compiler, 操作步骤为Settings -> Comp ...
- 【原创】基于.NET的轻量级高性能 ORM - TZM.XFramework
[前言] 接上一篇<[原创]打造基于Dapper的数据访问层>,Dapper在应付多表自由关联.分组查询.匿名查询等应用场景时不免显得吃力,经常要手写SQL语句(或者用工具生成SQL配置文 ...
- Redis专题(2):Redis数据结构底层探秘
前言 上篇文章Redis闲谈(1):构建知识图谱介绍了redis的基本概念.优缺点以及它的内存淘汰机制,相信大家对redis有了初步的认识.互联网的很多应用场景都有着Redis的身影,它能做的事情远远 ...
- java String概述
class StringDemo { public static void main(String[] args) { String s1 = "abc";//s1 是一个类类 ...
- uva 11404 dp
UVA 11404 - Palindromic Subsequence 求给定字符串的最长回文子序列,长度一样的输出字典序最小的. 对于 [l, r] 区间的最长回文串.他可能是[l+1, r] 和[ ...
- android利用apkplug框架实现主应用与插件通讯(传递随意对象)实现UI替换
时光匆匆,乍一看已半年过去了,经过这半年的埋头苦干今天最终有满血复活了. 利用apkplug框架实现动态替换宿主Activity中的UI元素.以达到不用更新应用就能够更换UI样式的目的. 先看效果图: ...
- mysql日期格式转化
select DATE_FORMAT( '20170701', '%Y-%m-%d'); 先挖坑
- HDFS源码分析心跳汇报之BPServiceActor工作线程运行流程
在<HDFS源码分析心跳汇报之数据结构初始化>一文中,我们了解到HDFS心跳相关的BlockPoolManager.BPOfferService.BPServiceActor三者之间的关系 ...
- 修复open-ssl漏洞,升级open-ssl版本
升级openssl环境至openssl-1.0.1g 1.查看源版本 [root@zj ~]# openssl version -a OpenSSL 0.9.8e-fips-rhel5 01 Jul ...
- Java多线程面试问题
这篇文章主要是对多线程的面试问题进行总结的,罗列了40个多线程的问题. 1. 多线程有什么用? 一个可能在很多人看来很扯淡的一个问题:我会用多线程就好了,还管它有什么用?在我看来,这个回答更扯淡.所谓 ...