Elasticsearch由浅入深（八）搜索引擎：mapping、精确匹配与全文搜索、分词器、mapping总结

下面先简单描述一下mapping是什么？

自动或手动为index中的type建立的一种数据结构和相关配置，简称为mapping
dynamic mapping，自动为我们建立index，创建type，以及type对应的mapping，mapping中包含了每个field对应的数据类型，以及如何分词等设置

当我们插入几条数据，让ES自动为我们建立一个索引

PUT /website/article/

{

  "post_date": "2019-08-21",

  "title": "my first article",

  "content": "this is my first article in this website",

  "author_id":

}

PUT /website/article/

{

  "post_date": "2019-08-22",

  "title": "my second article",

  "content": "this is my second article in this website",

  "author_id":

}

PUT /website/article/

{

  "post_date": "2019-08-23",

  "title": "my third article",

  "content": "this is my third article in this website",

  "author_id":

}

查看mapping

GET /website/_mapping

{

  "website": {

    "mappings": {

      "article": {

        "properties": {

          "author_id": {

            "type": "long"

          },

          "content": {

            "type": "text",

            "fields": {

              "keyword": {

                "type": "keyword",

                "ignore_above":

              }

            }

          },

          "post_date": {

            "type": "date"

          },

          "title": {

            "type": "text",

            "fields": {

              "keyword": {

                "type": "keyword",

                "ignore_above":

              }

            }

          }

        }

      }

    }

  }

}

上面是插入数据自动生成的mapping，还有手动生成的mapping。这种自动或手动为index中的type建立的一种数据结构和相关配置，称为mapping。

尝试各种搜索

GET /website/article/_search?q=            //3条结果

GET /website/article/_search?q=--            //3条结果

GET /website/article/_search?q=post_date:--       //1条结果

GET /website/article/_search?q=post_date:         //0条结果

搜索结果为什么不一致，因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。
下面是手动创建的mapping。

PUT /test_mapping

{

  "mappings" : {

    "properties" : {

      "author_id" : {

        "type" : "long"

      },

      "content" : {

        "type" : "text",

        "fields" : {

          "keyword" : {

            "type" : "keyword",

            "ignore_above" :

          }

        }

      },

      "post_date" : {

        "type" : "date"

      },

      "title" : {

        "type" : "text",

        "fields" : {

          "keyword" : {

            "type" : "keyword",

            "ignore_above" :

          }

        }

      }

    }

  }

}

精确匹配与全文搜索的对比分析

exact value

也就是某个field必须全部匹配才能返回相应的document
示例:

GET /website/article/_search?q=post_date:--       //1条结果

GET /website/article/_search?q=post_date:         //0条结果

exact value，搜索的时候，必须输入2019-08-21，才能搜索出来
如果你输入一个21，是搜索不出来的

full text

full text与exact value不一样，不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配。
示例：

GET /website/article/_search?q=            //3条结果

GET /website/article/_search?q=--            //3条结果

倒排索引核心原理

下面演示一下倒排索引简单建立的过程，当然实际中倒排索引的建立过程会非常的复杂。
doc1: I really liked my small dogs, and I think my mom also liked them.
doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词，初步的倒排索引的建立

word    doc1    doc2

I        *        *

really   *

liked    *        *

my       *        *

small    *

dogs     *

and      *

think    *

mom      *        *

also     *

them     *

He                *

never             *

any               *

so                *

hope              *

that              *

will              *

not               *

expect            *

me                *

to                *

him               *

搜索 mother like little dog, 不会有任何结果
mother
like
little
dog
这肯定不是我们想要的结果。比如mother和mom其实根本就没有区别。但是却检索不到。但是做下测试发现ES是可以查到的。实际上ES在建立倒排索引的时候，还会执行一个操作，就是会对拆分的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率。像时态的转换，单复数的转换，同义词的转换，大小写的转换。这个过程称为正则化（normalization）
mother-> mom
liked -> like
small -> little
dogs -> dog
这样重新建立倒排索引：

word    doc1    doc2

I        *        *

really   *

like     *        *

my       *        *

little   *

dog      *

and      *

think    *

mom      *        *

also     *

them     *

He                *

never             *

any               *

so                *

hope              *

that              *

will              *

not               *

expect            *

me                *

to                *

him               *

查询：mother like little dog 分词正则化
mother -> mom
like -> like
little -> little
dog -> dog
doc1和doc2都会搜索出来
doc1：I really liked my small dogs, and I think my mom also liked them.
doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him.

分词器

切分词语，normalization（提升recall召回率）

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分瓷器
recall，召回率：搜索的时候，增加能够搜索到的结果的数量

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）
tokenizer：分词，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

内置分词器的介绍：

待分词：Set the shape to semi-transparent by calling set_trans()

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, （默认的是standard）

simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans

whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans()

language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran,

mapping引入案例遗留问题大揭秘

GET /_search?q=

搜索的是_all field，document所有的field都会拼接成一个大串，进行分词

2019-01-02 my second article this is my second article in this website 11400

        doc1        doc2        doc3

      *          *           *

        *

                   *

                               *

_all，2017，自然会搜索到3个docuemnt

GET /_search?q=post_date:--

date，会作为exact value去建立索引

             doc1        doc2        doc3

--    *

--                 *

--                             *

测试分词器

语法：

GET /_analyze

{

  "analyzer": "standard",

  "text": "Text to analyze"

}

{

  "tokens": [

    {

      "token": "text",

      "start_offset": ,

      "end_offset": ,

      "type": "<ALPHANUM>",

      "position":

    },

    {

      "token": "to",

      "start_offset": ,

      "end_offset": ,

      "type": "<ALPHANUM>",

      "position":

    },

    {

      "token": "analyze",

      "start_offset": ,

      "end_offset": ,

      "type": "<ALPHANUM>",

      "position":

    }

  ]

}

对mapping进一步总结

往ES里面直接插入数据，ES会自动建立索引，同时建立type以及对应的mapping
mapping中自动定义了每个fieldd的数据类型
不同的数据类型（比如说text和date），可能有的是exact value，有的是full text
exact value，在建立倒排索引的时候，分词的时候，都是将整个值一起作为关键字建立到倒排索引中；full text会经历各种各样的处理，分词，normalization（时态转换，同义词转换，大小写转换），才会建立到倒排索引中
在搜索的时候，exact value和full text类型就决定了，对exact value和full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text也会进行分词和正则化normalization再去倒排索引中去搜索。
可以用 ES的dynamic mapping，让其自动建立mapping,包括自动设置数据类型；也可以提前手动创建index和type的mapping,自己对各个field进行设置，包括数据类型，包括索引行为，包括分析器等等。

mapping本质上就是index的type的元数据，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为。

mapping核心数据类型以及dynamic mapping

核心数据类型

string text：字符串类型

byte:字节类型

short：短整型

integer：整型

long:长整型

float:浮点型

boolean:布尔类型

date:时间类型

当然还有一些高级类型，像数组，对象object，但其底层都是text字符串类型

dynamic mapping

true or false -> boolean

 -> long

123.45 -> float

-- -> date

"hello world" -> string text

查看mapping

语法：

GET /{index}/_mapping

GET /{index}/_mapping/{type}

手动建立和修改mapping以及定制string类型是否分词

注意：只能创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping。

```
"analyzer": "standard":自动分词
```
```
date：日期
```
```
keyword：不分词
```

# 创建索引

PUT /website

{

  "mappings": {

    "properties": {

      "author_id": {

        "type": "long"

      },

      "title": {

        "type": "text",

        "analyzer": "standard"

      },

      "content": {

        "type": "text"

      },

      "post_date": {

        "type": "date"

      },

      "publisher_id": {

        "type": "keyword"

      }

    }

  }

}

#修改字段的mapping

PUT /website

{

  "mappings": {

    "properties": {

      "author_id": {

        "type": "text"

      }

    }

  }

}

{

  "error": {

    "root_cause": [

      {

        "type": "resource_already_exists_exception",

        "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",

        "index_uuid": "5xLohnJITHqCwRYInmBFmA",

        "index": "website"

      }

    ],

    "type": "resource_already_exists_exception",

    "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",

    "index_uuid": "5xLohnJITHqCwRYInmBFmA",

    "index": "website"

  },

  "status":

}

#增加mapping的字段

PUT /website/_mapping

{

  "properties": {

    "new_field": {

      "type": "text"

    }

  }

}

{

  "acknowledged" : true

}

mapping复杂类型y以及object类型数据底层结构

multivalue field
```
{

    "tags": ["tag1", "tag2"]

}
```
建立索引时与string是一样的，数据类型不能混
empty field
```
null，[]，[null]
```

object field
初始化数据：

PUT /company/employee/

{

  "address": {

    "country": "china",

    "province": "guangdong",

    "city": "guangzhou"

  },

  "name": "jack",

  "age": ,

  "join_date": "2017-01-01"

}

查看mapping

GET /company/_mapping/employee

{

  "company": {

    "mappings": {

      "employee": {

        "properties": {

          "address": {

            "properties": {

              "city": {

                "type": "text",

                "fields": {

                  "keyword": {

                    "type": "keyword",

                    "ignore_above":

                  }

                }

              },

              "country": {

                "type": "text",

                "fields": {

                  "keyword": {

                    "type": "keyword",

                    "ignore_above":

                  }

                }

              },

              "province": {

                "type": "text",

                "fields": {

                  "keyword": {

                    "type": "keyword",

                    "ignore_above":

                  }

                }

              }

            }

          },

          "age": {

            "type": "long"

          },

          "join_date": {

            "type": "date"

          },

          "name": {

            "type": "text",

            "fields": {

              "keyword": {

                "type": "keyword",

                "ignore_above":

              }

            }

          }

        }

      }

    }

  }

}

object field底层解析

{

  "address": {

    "country": "china",

    "province": "guangdong",

    "city": "guangzhou"

  },

  "name": "jack",

  "age": ,

  "join_date": "2017-01-01"

}

↓↓↓↓

{

    "name":            [jack],

    "age":          [],

    "join_date":      [--],

    "address.country":         [china],

    "address.province":   [guangdong],

    "address.city":  [guangzhou]

}

{

    "authors": [

        { "age": , "name": "Jack White"},

        { "age": , "name": "Tom Jones"},

        { "age": , "name": "Kitty Smith"}

    ]

}

↓↓↓↓

{

    "authors.age":    [, , ],

    "authors.name":   [jack, white, tom, jones, kitty, smith]

}