Field-length norm

How long is the field? The shorter the field, the higher the weight. If a term appears in a short field, such as a title field, it is more likely that the content of that field is about the term than if the same term appears in a much bigger body field. The field length norm is calculated as follows:

norm(d) = 1 / √numTerms 

The field-length norm (norm) is the inverse square root of the number of terms in the field.

While the field-length norm is important for full-text search, many other fields don’t need norms. Norms consume approximately 1 byte per string field per document in the index, whether or not a document contains the field. Exact-value not_analyzed string fields have norms disabled by default, but you can use the field mapping to disable norms on analyzed fields as well:

PUT /my_index
{
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"norms": { "enabled": false }
}
}
}
}
}

This field will not take the field-length norm into account. A long field and a short field will be scored as if they were the same length.

For use cases such as logging, norms are not useful. All you care about is whether a field contains a particular error code or a particular browser identifier. The length of the field does not affect the outcome. Disabling norms can save a significant amount of memory.

Putting it together

These three factors—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time. Together, they are used to calculate the weight of a single term in a particular document.

When we refer to documents in the preceding formulae, we are actually talking about a field within a document. Each field has its own inverted index and thus, for TF/IDF purposes, the value of the field is the value of the document.

When we run a simple term query with explain set to true (see Understanding the Score), you will see that the only factors involved in calculating the score are the ones explained in the preceding sections:

PUT /my_index/doc/1
{ "text" : "quick brown fox" } GET /my_index/doc/_search?explain
{
"query": {
"term": {
"text": "fox"
}
}
}

The (abbreviated) explanation from the preceding request is as follows:

weight(text:fox in 0) [PerFieldSimilarity]:  0.15342641 

result of:
fieldWeight in 0 0.15342641
product of:
tf(freq=1.0), with freq of 1: 1.0

        idf(docFreq=1, maxDocs=1):           0.30685282 

        fieldNorm(doc=0):                    0.5 

The final score for term fox in field text in the document with internal Lucene doc ID 0.

The term fox appears once in the text field in this document.

The inverse document frequency of fox in the text field in all documents in this index.

The field-length normalization factor for this field.

Of course, queries usually consist of more than one term, so we need a way of combining the weights of multiple terms. For this, we turn to the vector space model.

 

 

ES搜索排序,文档相关度评分介绍——Field-length norm的更多相关文章

  1. ES搜索排序,文档相关度评分介绍——Vector Space Model

    Vector Space Model The vector space model provides a way of comparing a multiterm query against a do ...

  2. ES搜索排序,文档相关度评分介绍——TF-IDF—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time.

    Theory Behind Relevance Scoring Lucene (and thus Elasticsearch) uses the Boolean model to find match ...

  3. ES 文档与索引介绍

    在之前的文章中,介绍了 ES 整体的架构和内容,这篇主要针对 ES 最小的存储单位 - 文档以及由文档组成的索引进行详细介绍. 会涉及到如下的内容: 文档的 CURD 操作. Dynamic Mapp ...

  4. ES-PHP向ES批量添加文档报No alive nodes found in your cluster

    ES-PHP向ES批量添加文档报No alive nodes found in your cluster 2016年12月14日 12:31:40 阅读数:2668 参考文章phpcurl 请求Chu ...

  5. atitit.vod search doc.doc 点播系统搜索功能设计文档

    atitit.vod search doc.doc 点播系统搜索功能设计文档 按键的enter事件1 Left rig事件1 Up down事件2 key_events.key_search = fu ...

  6. 【ElasticSearch】:索引Index、文档Document、字段Field

    因为从ElasticSearch6.X开始,官方准备废弃Type了.对应数据库,对ElasticSearch的理解如下: ElasticSearch 索引Index 文档Document 字段Fiel ...

  7. es之对文档进行更新操作

    5.7.1:更新整个文档 ES中并不存在所谓的更新操作,而是用新文档替换旧文档: 在内部,Elasticsearch已经标记旧文档为删除并添加了一个完整的新文档并建立索引.旧版本文档不会立即消失 ,但 ...

  8. es搜索排序不正确

    沿用该文章里的数据https://www.cnblogs.com/MRLL/p/12691763.html 查询时发现,一模一样的name,但是相关度不一样 GET /z_test/doc/_sear ...

  9. MongoDB中的映射,限制记录和记录拼排序 文档的插入查询更新删除操作

    映射 在 MongoDB 中,映射(Projection)指的是只选择文档中的必要数据,而非全部数据.如果文档有 5 个字段,而你只需要显示 3 个,则只需选择 3 个字段即可. find() 方法 ...

随机推荐

  1. ID3算法Java实现

    ID3算法java实现 1 ID3算法概述 1.1 信息熵 熵是无序性(或不确定性)的度量指标.假如事件A的全概率划分是(A1,A2,...,An),每部分发生的概率是(p1,p2,...,pn).那 ...

  2. css:清除浮动 overflow

    是因为overflow除了(visible)会重新给他里面的元素建立块级格式化(block formatting context)floats, position absolute, inline-b ...

  3. php遍历对象属性,可以使用foreach,直接打印出属性

    依然遵循私有属性不可以在外访问,(不能打印出来) 但可以在内部访问这个原则.

  4. Windows 10 1703创意者更新官方ISO镜像大全

    2017年04月07日 20:00 19867 次阅读 稿源:快科技 12 条评论 Windows 10 Creators Update创意者更新正式版已经发布,目前只能通过易生.MCT工具或者ISO ...

  5. javax.persistence.PersistenceException: org.hibernate.PersistentObjectException: detached entity passed to persist:

    javax.persistence.PersistenceException: org.hibernate.PersistentObjectException: detached entity pas ...

  6. 度度熊有一张网格纸,但是纸上有一些点过的点,每个点都在网格点上,若把网格看成一个坐标轴平行于网格线的坐标系的话,每个点可以用一对整数x,y来表示。度度熊必须沿着网格线画一个正方形,使所有点在正方形的内部或者边界。然后把这个正方形剪下来。问剪掉正方形的最小面积是多少。

    // ConsoleApplication10.cpp : 定义控制台应用程序的入口点. // #include "stdafx.h" #include <iostream& ...

  7. spring boot json 首字母大小写问题解决方案

     spring boot默认使用的json解析框架是jackson,对于.net转java的项目来说太坑了,首字母大写的属性会自动转为小写,然后前端就悲剧了,十几个属性的ViewModel增加几个Js ...

  8. 中面试中你不可回避的C、C++的问题(一)

    基础中的基础 局部变量与全局变量问题 (使用’ ::’) 2.      如何在另个文件中引用一个全局变量 (extern) 3.      全局变量可以定义被多个C文件包含,并且是static 4. ...

  9. Kali安装OCI8 for metasploit Oracle login

    ps:安装了好久,最好才发现很简单,步骤记录下吧 遇到oracle爆破登录的时候OCI8报错,如下图 安装oracle 前面关于oracle client的安装就看官方文档吧 http://dev.m ...

  10. C语言补漏(1)--- char到int赋值的一个陷阱

    作为一个C的新手(虽然学的第一门语言就是C,可是用C实际开发项目却是最近的事情),对使用C过程中遇到的各类问题.疑惑.知识漏洞进行弥补无疑是非常有必要的,于是决定将每次遇到的知识漏洞写到博客上. 今天 ...