从零搭建 ES 搜索服务（四）拼音搜索

一、前言

上篇介绍了 ES 的同义词搜索，使我们的搜索更强大了，然而这还远远不够，在实际使用中还可能希望搜索「fanqie」能将包含「番茄」的结果也罗列出来，这就涉及到拼音搜索了，本篇将介绍如何具体实现。

二、安装 ES 拼音插件

2.1 拼音插件简介

GitHub 地址：https://github.com/medcl/elasticsearch-analysis-pinyin

2.2 安装步骤

① 进入 ES 的 bin 目录

$ cd /usr/local/elasticsearch/bin/

② 通过 elasticsearch-plugin 命令安装 pinyin 插件

$ ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v5.5.3/elasticsearch-analysis-pinyin-5.5.3.zip

③ 安装成功后会在 plugins 目录出现 analysis-pinyin 文件夹

三、自定义分析器

要使用「拼音插件」需要在创建索引时使用「自定义模板」并在自定义模板中「自定义分析器」。

3.1 具体配置

① 在上篇新建的「 yb_knowledge.json 」模板中修改「 setting 」配置，往其中添加自定义分析器

"analysis": {

    "filter": {

        ...省略其余部分...

        "pinyin_filter":{

            "type": "pinyin",

            "keep_first_letter": true,

            "keep_separate_first_letter": false,

            "keep_full_pinyin": true,

            "keep_joined_full_pinyin": true,

            "none_chinese_pinyin_tokenize": false,

            "keep_joined_full_pinyin": true,

            "remove_duplicated_term": true,

            "keep_original": true,

            "limit_first_letter_length": 50,

            "lowercase": true

        }

    },

    "analyzer": {

        ...省略其余部分...

        "ik_synonym_pinyin": {

            "type": "custom",

            "tokenizer": "ik_smart",

            "filter": ["synonym_filter","pinyin_filter"]

        }

    }

}

自定义分析器说明：

首先声明一个新「 token filter 」—— 「 pinyin_filter 」，其中 type 为 pinyin 即拼音插件，其余字段详见 GitHub 项目说明。
其次声明一个新「analyzer」—— 「ik_synonym_pinyin」，其中 type 为 custom 即自定义类型， tokenizer 为 ik_smart 即使用 ik 分析器的 ik_smart 分词模式， filter 为要使用的词过滤器，可以使用多个，这里使用了上述定义的 pinyin_filter 以及前篇的 synonym_filter 。

② 与此同时修改「 mappings 」中的 properties 配置，往「 knowledgeTitle 」及「 knowledgeContent 」这两个搜索字段里添加 fields 参数，它支持以不同方式对同一字段做索引，将原本的简单映射转化为多字段映射，此处设置一个名为「 pinyin 」的嵌套字段且使用上述自定义的「 ik_synonym_pinyin 」作为分析器。

"mappings": {

    "knowledge": {

        ...省略其余部分...

        "properties": {

            ...省略其余部分...

            "knowledgeTitle": {

                    "type": "text",

                    "analyzer": "ik_synonym_max",

                    "fields":{

                        "pinyin": {

                            "type":"text",

                            "analyzer": "ik_synonym_pinyin"

                        }

                    }

                },

                "knowledgeContent": {

                    "type": "text",

                    "analyzer": "ik_synonym_max",

                    "fields":{

                        "pinyin": {

                            "type":"text",

                            "analyzer": "ik_synonym_pinyin"

                        }

                    }

                }

        }

    }

}

③ 最后删除先前创建的 yb_knowledge 索引并重启 Logstash

注：重建索引后可以通过「_analyze」测试分词结果

curl -XGET http://localhost:9200/yb_knowledge/_analyze

{

    "analyzer":"ik_synonym_pinyin",

    "text":"番茄"

}

注：在添加了同义词「番茄、西红柿、圣女果」的基础上分词结果如下

{

    "tokens": [

        {

            "token": "fan",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 0

        },

        {

            "token": "番茄",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 0

        },

        {

            "token": "fanqie",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 0

        },

        {

            "token": "fq",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 0

        },

        {

            "token": "qie",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 1

        },

        {

            "token": "xi",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 2

        },

        {

            "token": "hong",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 3

        },

        {

            "token": "shi",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 4

        },

        {

            "token": "西红柿",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 4

        },

        {

            "token": "xihongshi",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 4

        },

        {

            "token": "xhs",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 4

        },

        {

            "token": "sheng",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 5

        },

        {

            "token": "nv",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 6

        },

        {

            "token": "guo",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 7

        },

        {

            "token": "圣女果",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 7

        },

        {

            "token": "shengnvguo",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 7

        },

        {

            "token": "sng",

            "start_offset": 0,

            "end_offset": 2,

            "type": "SYNONYM",

            "position": 7

        }

    ]

}

四、结语

至此拼音搜索已经实现完毕，最近两篇都是有关 ES 插件以及 Logstash 自定义模板的配置，没有涉及具体的 JAVA 代码实现，下一篇将介绍如何通过 JAVA API 实现搜索结果高亮。