Elasticsearch Field Options Norms

Elasticsearch 定义字段时Norms选项的作用

本文介绍ElasticSearch中2种字段(text 和 keyword)的Norms参数作用。

创建ES索引时，一般指定2种配置信息：settings、mappings。settings 与数据存储有关（几个分片、几个副本）；而mappings 是数据模型，类似于MySQL中的表结构定义。在Mapping信息中指定每个字段的类型，ElasticSearch支持多种类型的字段(field datatypes)，比如String、Numeric、Date…其中String又细分成为种：keyword 和 text。在创建索引时，需要定义字段并为每个字段指定类型，示例如下：

PUT my_index

{

  "settings": {

    "number_of_shards": 1,

    "number_of_replicas": 0

  },

  "mappings": {

    "_doc": {

      "_source": {

        "enabled": true

      },

      "properties": {

        "title": {

          "type": "text",

          "norms": false

        },

        "overview": {

          "type": "text",

          "norms": true

        },

        "body": {

          "type": "text"

        },

        "author": {

          "type": "keyword",

          "norms": true

        },

        "chapters": {

          "type": "keyword",

          "norms": false

        },

        "email": {

          "type": "keyword"

        }

      }

    }

  }

}

my_index 索引的 title 字段类型是 text，而 author 字段类型是 keyword。

对于 text 类型的字段而言，默认开启了norms，而 keyword 类型的字段则默认关闭了norms

Whether field-length should be taken into account when scoring queries. Accepts true（text filed datatype） or false(keyword filed datatype)

为什么 keyword 类型的字段默认关闭 norms 呢？keyword 类型的string 可理解为：Do index the field, but don't analyze the string value，也即：keyword 类型的字段是不会被Analyzer "分析成" 一个个的term的，它是一个single-token fields，因此也就不需要字段长度(fieldNorm)、tfNorm（term frequency Norm）这些归一化因子了。而 text 类型的字段会被分析器(Analyzer)分析，生成若干个terms，两个 text 类型的字段，一个可能有很多term(比如文章的正文)，另一个只有很少的term(比如文章的标题)，在多字段查询时，就需要长度归一化，这就是为什么 text 类型字段默认开启 norms 选项的原因吧。另外，对于Lucene常用的2种评分算法：tf-idf 和 bm25，tf-idf 就倾向于给长度较小的字段打高分，为什么呢？Lucene 的相似度评分公式，主要由三部分组成：IDF score，TF score 还有 fieldNorms。就TF-IDF评分公式而言，IDF score 是log(numDocs/(docFreq+1))，TF score 是 sqrt(tf)，fieldNorms 是 1/sqrt(length)，因此：文档长度越短，fieldNorms越大，评分越高，这也是为什么TF-IDF严重偏向于给短文本打高分的原因。

norms 作用是什么？

norms 是一个用来计算文档/字段得分(Score)的"调节因子"。TF-IDF、BM25算法计算文档得分时都用到了norms参数，具体可参考这篇文章中的Lucene文档得分计算公式。

ElasticSearch中的一篇文档(Document)，里面有多个字段。查询解析器(QueryParser)将用户输入的查询字符串解析成Terms ，在多字段搜索中，每个 Term 会去匹配各个字段，为每个字段计算一个得分，各个字段的得分经过某种方式(以词为中心的搜索 vs 以字段为中心的搜索)组合起来，最终得到一篇文档的得分。

ES官方文档关于Norms解释：

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.

这里的 normalization factors 用于查询计算文档得分时进行 boosting。比如根据BM25算法给出的公式(freq*(k1+1))/(freq+k1*(1-b+b*fieldLength/avgFieldLength))计算文档得分时，其中的fieldLength/avgFieldLength就是 normalization factors。

norms 的代价

开启norms之后，每篇文档的每个字段需要一个字节存储norms。对于 text 类型的字段而言是默认开启norms的，因此对于不需要评分的 text 类型的字段，可以禁用norms，这算是一个调优点吧。

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field

norms 因子属于 Index-time boosting一部分，也即：在索引文档(写入文档)的时候，就已经将所有boosting因子存储起来，在查询时从内存中读取，参与得分计算。参考《Lucene in action》中一段话：

During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

另一种类型的 boosting 是search time boosting，在查询语句中指定boosting因子，然后动态计算出文档得分，具体可参考：《relevant search with applications for solr and elasticsearch》，本文不再详述。但是值得注意的是：目前的ES版本已经不再推荐使用index time boosting了，而是推荐使用 search time boosting。ES官方文档给出的理由如下：

在索引文档时存储的boosting因子(开启 norms 选项)，一经存储，就无法改变。要想改变，只能reindex索引
search time boosting 的效果和 index time boosting是一样的，并且search time boosting能够动态指定boosting因子(但计算文档得分时更消耗CPU吧)，灵活性更大。而index time boosting需要额外的存储空间
index time boosting因子存储在norms字段，它影响了 field length normalization，从而导致文档相似度计算结果不太准确(lower quality relevance calculations)

附：my_index索引的mapping 信息：

GET my_index/_mapping

{

  "my_index": {

    "mappings": {

      "_doc": {

        "properties": {

          "author": {

            "type": "keyword",

            "norms": true

          },

          "body": {

            "type": "text"

          },

          "chapters": {

            "type": "keyword"

          },

          "email": {

            "type": "keyword"

          },

          "overview": {

            "type": "text"

          },

          "title": {

            "type": "text",

            "norms": false

          }

        }

      }

    }

  }

}

原文：https://www.cnblogs.com/hapjin/p/11254535.html