需求

雪花啤酒需要搜索雪花、啤酒、雪花啤酒、xh、pj、xh啤酒、雪花pj

ik导入

参考https://www.cnblogs.com/LQBlog/p/10443862.html,不需要修改源码步骤就行

拼音分词器导入

跟ik一样下载下来打包移动到es plugins 目录名字改为pinyin https://github.com/medcl/elasticsearch-analysis-pinyin

测试

get请求:http://127.0.0.1:9200/_analyze

body:

{

"analyzer":"pinyin",

"text":"雪花啤酒"

}

响应:

{

    "tokens": [

        {

            "token": "xue",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 0

        },

        {

            "token": "xhpj",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 0

        },

        {

            "token": "hua",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 1

        },

        {

            "token": "pi",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 2

        },

        {

            "token": "jiu",

            "start_offset": 0,

            "end_offset": 0,

            "type": "word",

            "position": 3

        }

    ]

}

说明导入成功

测试中文加拼音搜索

自定义mapping和自定义分词器

put请求:http://127.0.0.1:9200/opcm3

body:

{

    "settings": {

        "analysis": {

            "analyzer": {

                "ik_pinyin_analyzer": {//自定义一个分词器名字叫ik_pinyin_analyzer

                    "type": "custom",//表示自定义分词器

                    "tokenizer": "ik_smart",//使用ik分词 ik_smart为粗粒度分词 ik_max_word为最细粒度分词

                    "filter": ["my_pinyin"]//分词后结果 交给过滤器再次分词

                },

                "onlyOne_analyzer": {

                    "tokenizer": "onlyOne_pinyin"

                }

            },

            "tokenizer": {

                "onlyOne_pinyin": {

                    "type": "pinyin",

                    "keep_separate_first_letter": "true",

                    "keep_full_pinyin":"false"

                }

            },"filter": {

                "my_pinyin": {//定义过滤器

                    "type": "pinyin",

                    "keep_joined_full_pinyin": true,//分词的时候词组首字母分词后组合 如：雪花 分词:xuehua  xh

                    "keep_separate_first_letter": true//分词的时候支持首字母不单独分词如:会分词xue hua xuehua  xh  x,h

                    "none_chinese_pinyin_tokenize": true//xh 分词为x,h,xh

                }

            }

        }

    },

    "mappings": {

        "doc": {

            "properties": {

                "productName": {

                    "type": "text",

                    "analyzer": "ik_pinyin_analyzer",//指定分词索引为自定义分词 中文分词后再通过filter交给pinyin分词

                    "fields": {//暂时未用 只是保留让 自己能够知道有这种方式根据不同条件选择不同的搜索分词

                        "keyword_once_pinyin": {//新的分词字段 只分词不存在source productName.keyword_once_pinyin 查询时需要判断如果是单字母使用此搜索

                            "type": "text",

                            "analyzer": "onlyOne_analyzer"

                        }

                    }

                }

            }

        }

    }

}

filter个人理解

我的理解是 ik分词然后将分词后的逐项结果通过filter交给拼音分词雪花啤酒 ik会分成雪花,啤酒然后雪花交给pinyin会分词 xue,hua,xh,x,h 啤酒会分词 pi,jiu,p,j

插入测试数据

http://127.0.0.1:9200/opcm3/doc/1

{

    "productName":"雪花纯生勇闯天涯9度100ml"

}

put请求:http://127.0.0.1:9200/opcm3/doc/2

body：

{

    "productName":"金威纯生勇闯天涯9度100ml"

}

查看分词结果

get请求:http://127.0.0.1:9200/opcm3/topic/{id}/_termvectors?fields=productName

get请求:http://127.0.0.1:9200/opcm3/topic/{id}/_termvectors?fields=productName.keyword_once_pinyin

测试搜索

http://127.0.0.1:9200/opcm3/_search

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生"

            }

        }

    }

}

会查出雪花纯生和金威纯生看个人是模糊匹配还是相邻匹配选用match或者match_phrase

我的需求是相邻匹配改为

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生"

            }

        }

    }

}

则只会搜索出雪花纯生

搜索雪花纯生9度的产品

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生9度"

            }

        }

    }

}

会发现搜索不出来数据

原因请查阅:https://www.cnblogs.com/LQBlog/p/10580247.html

改为就能搜索出来:

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"雪花纯生9度",

                "slop":5

            }

        }

    }

}

pingpin分词还支持很多参数比如：

以上模型排查及解决

添加测试数据

{
"productName":"纯生"
}

{
"productName":"纯爽"
}

测试

搜索

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"纯生",

                "slop":5

            }

        }

    }

}

返回结果

{

    "took": 3,

    "timed_out": false,

    "_shards": {

        "total": 5,

        "successful": 5,

        "skipped": 0,

        "failed": 0

    },

    "hits": {

        "total": 2,

        "max_score": 2.8277423,

        "hits": [

            {

                "_index": "opcm3",

                "_type": "doc",

                "_id": "1",

                "_score": 2.8277423,

                "_source": {

                    "productName": "纯爽"

                }

            },

            {

                "_index": "opcm3",

                "_type": "doc",

                "_id": "2",

                "_score": 1.4466299,

                "_source": {

                    "productName": "纯生"

                }

            }

        ]

    }

}

可以发现纯爽也出来了

排查

1.查看纯爽分词结果

http://127.0.0.1:9200/opcm3/doc/2/_termvectors?fields=productName

[c,chun,s,sheng]

[c,chun,s,shuang]

2.查看搜索分词

http://127.0.0.1:9200/opcm3/_validate/query?explain

{

    "query":{

        "match_phrase":{

            "productName":{

                "query":"纯生",

                "slop":5

            }

        }

    }

}

body

{

    "valid": true,

    "_shards": {

        "total": 1,

        "successful": 1,

        "failed": 0

    },

    "explanations": [

        {

            "index": "opcm3",

            "valid": true,

            "explanation": "productName:\"(c chun) (s sheng)\"~5"

        }

    ]

}

可以理解为index=(c or chun) and (s or shuang)

所以c,s 匹配了纯爽

解决办法

分词按最小粒度分搜索按最大粒度分

如纯生文档分词为[chun,sheng,chun,sheng,cs,c,s]

搜索分词为[chun,sheng,chunsheng]

一下模型就能满足搜索: 雪花，雪花cs ,雪花chunsheng ,xhcs,xh纯生,雪花纯生都能正确搜索出数据

{

    "settings": {

        "analysis": {

            "analyzer": {

                "ik_pinyin_analyzer": {

                    "type": "custom",

                    "tokenizer": "ik_smart",

                    "filter": ["pinyin_max_word_filter"]

                },

                "ik_pingying_smark": {

                     "type": "custom",

                     "tokenizer": "ik_smart",

                      "filter": ["pinyin_smark_word_filter"]

                }

            },

            "filter": {

                "pinyin_max_word_filter": {

                    "type": "pinyin",

                    "keep_full_pinyin": "true",#分词全拼如雪花 分词xue,hua

                    "keep_separate_first_letter":"true",#分词简写如雪花 分词xh

                    "keep_joined_full_pinyin":true#分词会quanpin 连接 比如雪花分词 xuehua

                },

                "pinyin_smark_word_filter": {

                    "type": "pinyin",

                    "keep_separate_first_letter": "false",#不分词简写如雪花 分词不分词xh

                    "keep_first_letter":"false"#不分词单个首字母 如雪花 不分词 x,h

                }

            }

        }

    },

    "mappings": {

        "doc": {

            "properties": {

                "productName": {

                    "type": "text",

                    "analyzer": "ik_pinyin_analyzer",#做文档所用的分词器

                    "search_analyzer":"ik_pingying_smark"#搜索使用的分词器

                }

            }

        }

    }

}

解决办法2

elasticsearch实战中文+拼音搜索的更多相关文章

elasticsearch之拼音搜索
拼音搜索在中文搜索环境中是经常使用的一种功能,用户只需要输入关键词的拼音全拼或者拼音首字母,搜索引擎就可以搜索出相关结果.在国内,中文输入法基本上都是基于汉语拼音的,这种在符合用户输入习惯的条件下缩短 ...
ElasticSearch 中文分词搜索环境搭建
ElasticSearch 是强大的搜索工具,并且是ELK套件的重要组成部分好记性不如乱笔头,这次是在windows环境下搭建es中文分词搜索测试环境,步骤如下 1.安装jdk1.8,配置好环境变量 ...
Elasticsearch实现类似 like '?%' 搜索
在做搜索的时候,下拉联想词的搜索肯定是最常见的一个场景,用户在输入的时候,要自动补全词干,说得简单点,就是以...开头搜索,如果是数据库,一句SQL就很容易实现,但在elasticsearch如何实现 ...
I-team 博客全文检索 Elasticsearch 实战
一直觉得博客缺点东西,最近还是发现了,当博客慢慢多起来的时候想要找一篇之前写的博客很是麻烦,于是作为后端开发的楼主觉得自己动手丰衣足食,也就有了这次博客全文检索功能Elasticsearch实战,这里 ...
Elasticsearch实战总结
上手elasticsearch有段时间了,主要以应用为主,未做深入的研究,下面就简单的日常作个简单的总结,做个记录. 版本问题 es版本繁杂,让首次使用的人无从下手.常见的有2+.5+版本,最新版已达 ...
【Solr】 solr对拼音搜索和拼音首字母搜索的支持
问:对于拼音和拼音首字母的支持,当你在搜商品的时候,如果想输入拼音和拼音首字母就给出商品的信息,怎么办呢? 实现方式有2种,但是他们其实是对应的. 用lucene实现 1.建索引, 多建一个索引字段 ...
为Elasticsearch添加中文分词，对比分词器效果
http://keenwon.com/1404.html Elasticsearch中,内置了很多分词器(analyzers),例如standard (标准分词器).english(英文分词)和chi ...
ElasticSearch实战－日志监控平台
1.概述在项目业务倍增的情况下,查询效率受到影响,这里我们经过讨论,引进了分布式搜索套件——ElasticSearch,通过分布式搜索来解决当下业务上存在的问题.下面给大家列出今天分析的目录: El ...
Elasticsearch java api 基本搜索部分详解
文档是结合几个博客整理出来的,内容大部分为转载内容.在使用过程中,对一些疑问点进行了整理与解析. Elasticsearch java api 基本搜索部分详解 ElasticSearch 常用的查询 ...

随机推荐

LESS2CSS for sumlime text2
Windows下的安装 Less2Css插件依赖lessc这个工具,在windows下可以下载或者用git cloneless.js-windows到本地目录.然后把目录地址加入到环境变量PATH的中 ...
DNS（域名系统）
DNS(Domain Name System),因特网上作为域名和IP地址相互映射的一个分布式数据库,能够使用户更方便的访问互联网,而不用去记住能够被机器直接读取的Ip数串.通过主机名,最终得到该主机 ...
工作2-5年，身为iOS开发的我应该怎么选择进修方向？
前言: 跳槽,面试,进阶,加薪:这些字眼,相信每位程序员都不陌生! 但是方向的选择,却不知如何抉择!其实最好的方向,已经在各个企业面试需求中完美的体现出来了: 本文展示了2份面试需求,以及方向的总结, ...
缓存，队列（Redis，RabbitMQ）
Redis Redis是一个key-value存储系统.和Memcached类似,它支持存储的value类型相对更多,包括string(字符串).list(链表).set(集合).zset(sorte ...
jQuery 对象转成 DOM 对象
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/stri ...
vue.js $set的使用数组
[javascript] view plain copy <!DOCTYPE html> <html lang="en"> <head> < ...
HttpWebRequest 知识点
string Url = System.Configuration.ConfigurationManager.AppSettings["CallPaperInvoiceURL"]; ...
T-SQL查询高级--理解SQL SERVER中非聚集索引的覆盖，连接，交叉和过滤
写在前面:这是第一篇T-SQL查询高级系列文章.但是T-SQL查询进阶系列还远远没有写完.这个主题放到高级我想是因为这个主题需要一些进阶的知识作为基础..如果文章中有错误的地方请不吝指正.本篇文章 ...
servlet-后台获取form表单传的参数
前台代码: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> & ...
2、scala条件控制与循环
1. if表达式 2. 句终结符.块表达式 3. 输入与输出 4. 循环 5. 高级for循环 1. if表达式 if表达式的定义:scala中,表达式是有值的,就是if或者else中最后 ...

elasticsearch实战 中文+拼音搜索

需求