数组如何在ElasticSearch中索引

一、简介

在ElasticSearch里没有专门的数组类型，任何一个字段都可以有零个和多个值。当字段值的个数大于1时，字段类型就变成了数组。

下面以视频数据为例，介绍ElasticSearch如何索引数组数据，以及如何检索数组中的字段值。

测试视频数据格式如下：

{

    "media_id": 88992211,

    "tags": ["电影","科技","恐怖","电竞"]

}

media_id代表视频id，tags是视频的标签，有多个值。业务上需要按视频标签检索标签下所有的视频。同一个视频有多个标签。

演示使用的ElasticSearch集群的版本是7.6.2。

二、测试演示

2.1 创建索引

PUT test_arrays

{

  "settings": {

    "number_of_shards": 1

  },

  "mappings": {

    "properties": {

      "media_id": {

        "type": "long"

      },

      "tags": {

        "type": "text"

      }

    }

  }

}

2.2 向test_arrays索引里写入测试数据

POST test_arrays/_doc

{

  "media_id": 887722,

  "tags": [

      "电影",

      "科技",

      "恐怖",

      "电竞"

    ]

}

2.3 查看test_arrays内部如何索引tags字段

{

  "tokens" : [

    {

      "token" : "电",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<IDEOGRAPHIC>",

      "position" : 0

    },

    {

      "token" : "影",

      "start_offset" : 1,

      "end_offset" : 2,

      "type" : "<IDEOGRAPHIC>",

      "position" : 1

    },

    {

      "token" : "科",

      "start_offset" : 3,

      "end_offset" : 4,

      "type" : "<IDEOGRAPHIC>",

      "position" : 102

    },

    {

      "token" : "技",

      "start_offset" : 4,

      "end_offset" : 5,

      "type" : "<IDEOGRAPHIC>",

      "position" : 103

    },

    {

      "token" : "恐",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "<IDEOGRAPHIC>",

      "position" : 204

    },

    {

      "token" : "怖",

      "start_offset" : 7,

      "end_offset" : 8,

      "type" : "<IDEOGRAPHIC>",

      "position" : 205

    },

    {

      "token" : "电",

      "start_offset" : 9,

      "end_offset" : 10,

      "type" : "<IDEOGRAPHIC>",

      "position" : 306

    },

    {

      "token" : "竞",

      "start_offset" : 10,

      "end_offset" : 11,

      "type" : "<IDEOGRAPHIC>",

      "position" : 307

    }

  ]

}

从响应结果可以看到，tags数组中的每个值被分词成多个token。

2.4 检索tags数组中的值

POST test_arrays/_search

{

  "query": {

    "match": {

      "tags": "电影"

    }

  }

}

响应结果：

{

  "took" : 1,

  "timed_out" : false,

  "_shards" : {

    "total" : 1,

    "successful" : 1,

    "skipped" : 0,

    "failed" : 0

  },

  "hits" : {

    "total" : {

      "value" : 1,

      "relation" : "eq"

    },

    "max_score" : 0.68324494,

    "hits" : [

      {

        "_index" : "test_arrays",

        "_type" : "_doc",

        "_id" : "MyhnpXQBGXOapfjvSpOW",

        "_score" : 0.68324494,

        "_source" : {

          "media_id" : 887722,

          "tags" : [

            "电影",

            "科技",

            "恐怖",

            "电竞"

          ]

        }

      }

    ]

  }

}

模糊检索：

POST test_arrays/_search

{

  "query": {

    "match": {

      "tags": "影"

    }

  }

}

响应结果

{

  "took" : 1,

  "timed_out" : false,

  "_shards" : {

    "total" : 1,

    "successful" : 1,

    "skipped" : 0,

    "failed" : 0

  },

  "hits" : {

    "total" : {

      "value" : 1,

      "relation" : "eq"

    },

    "max_score" : 0.2876821,

    "hits" : [

      {

        "_index" : "test_arrays",

        "_type" : "_doc",

        "_id" : "MyhnpXQBGXOapfjvSpOW",

        "_score" : 0.2876821,

        "_source" : {

          "media_id" : 887722,

          "tags" : [

            "电影",

            "科技",

            "恐怖",

            "电竞"

          ]

        }

      }

    ]

  }

}

视频数据业务上需要通过标签精确匹配，查询标签下的所有视频。实现这种效果，需要把tags字段类型修改为keyword。test_arrays索引的mappings设置如下：

PUT test_arrays

{

  "settings": {

    "number_of_shards": 1

  },

  "mappings": {

    "properties": {

      "media_id": {

        "type": "long"

      },

      "tags": {

        "type": "keyword"

      }

    }

  }

}

此时tags字段数组中每一个值对应一个token，可以实现按标签精准查询标签下视频的效果。

{

  "tokens" : [

    {

      "token" : "电影",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "科技",

      "start_offset" : 3,

      "end_offset" : 5,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "恐怖",

      "start_offset" : 6,

      "end_offset" : 8,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "电竞",

      "start_offset" : 9,

      "end_offset" : 11,

      "type" : "word",

      "position" : 3

    }

  ]

}

实际业务场景中，视频标签的数据可能不是按数组存储的，全部标签存储在一个字符串中，标签之间用逗号分隔。

{

    "media_id": 88992211,

    "tags": "电影,科技,恐怖,电竞"

}

上面的标签存储格式，通过调整索引字段的类型，同样可以实现精准检索单个标签下视频的效果。test_arrays索引的配置如下：

PUT test_arrays

{

  "settings": {

    "number_of_shards": 1,

    "analysis" : {

        "analyzer" : {

          "comma_analyzer": {

            "tokenizer": "comma_tokenizer"

          }

        },

        "tokenizer" : {

          "comma_tokenizer": {

            "type": "simple_pattern_split",

            "pattern": ","

          }

        }

      }

  },

  "mappings": {

    "properties": {

      "media_id": {

        "type": "long"

      },

      "tags": {

        "search_analyzer" : "simple",

        "analyzer" : "comma_analyzer",

        "type" : "text"

      }

    }

  }

}

写入一条测试数据到test_arrays索引

POST test_arrays/_doc

{

  "media_id": 887722,

  "tags": "电影,科技,恐怖,电竞"

}

tags字段的索引结构如下，同样实现了一个标签对应一个token。

{

  "tokens" : [

    {

      "token" : "电影",

      "start_offset" : 0,

      "end_offset" : 2,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "科技",

      "start_offset" : 3,

      "end_offset" : 5,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "恐怖",

      "start_offset" : 6,

      "end_offset" : 8,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "电竞",

      "start_offset" : 9,

      "end_offset" : 11,

      "type" : "word",

      "position" : 3

    }

  ]

}

通过标签精准匹配查询。

请求参数

POST test_arrays/_search

{

  "query": {

    "match": {

      "tags": "电影"

    }

  }

}

响应结果

{

  "took" : 6,

  "timed_out" : false,

  "_shards" : {

    "total" : 1,

    "successful" : 1,

    "skipped" : 0,

    "failed" : 0

  },

  "hits" : {

    "total" : {

      "value" : 1,

      "relation" : "eq"

    },

    "max_score" : 0.2876821,

    "hits" : [

      {

        "_index" : "test_arrays",

        "_type" : "_doc",

        "_id" : "3i2ipXQBGXOapfjv3THH",

        "_score" : 0.2876821,

        "_source" : {

          "media_id" : 887722,

          "tags" : "电影,科技,恐怖,电竞"

        }

      }

    ]

  }

}

三、总结

ElasticSearch采用的一种数据类型同时支持单值和多值的设计理念，即简化了数据类型的总量，同时也降低索引配置的复杂度，是一种非常优秀的设计。

同时标签数据的组织方式支持数组和分隔符分隔两种形式，体现了ElasticSearch功能的灵活性。