来源：https://github.com/medcl/elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).

--------------------------------------------------

| Pinyin   Analysis Plugin      | Elasticsearch  |

--------------------------------------------------

| master                        | 5.x -> master  |

--------------------------------------------------

| 5.5.1                         | 5.5.1          |

--------------------------------------------------

| 5.3.3                         | 5.3.3          |

--------------------------------------------------

| 5.2.2                         | 5.2.2          |

--------------------------------------------------

| 5.1.2                         | 5.1.2          |

--------------------------------------------------

| 1.8.1                         | 2.4.1          |

--------------------------------------------------

| 1.7.5                         | 2.3.5          |

--------------------------------------------------

| 1.6.1                         | 2.2.1          |

--------------------------------------------------

| 1.5.0                         | 2.1.0          |

--------------------------------------------------

| 1.4.0                         | 2.0.x          |

--------------------------------------------------

| 1.3.0                         | 1.6.x          |

--------------------------------------------------

| 1.2.2                         | 1.0.x          |

--------------------------------------------------

The plugin includes analyzer: pinyin , tokenizer: pinyin and token-filter: pinyin.

** Optional Parameters **

keep_first_letter when this option enabled, eg: 刘德华>ldh, default: true
keep_separate_first_letter when this option enabled, will keep first letters separately, eg: 刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequency
limit_first_letter_length set max length of the first_letter result, default: 16
keep_full_pinyin when this option enabled, eg: 刘德华> [liu,de,hua], default: true
keep_joined_full_pinyin when this option enabled, eg: 刘德华> [liudehua], default: false
keep_none_chinese keep non chinese letter or number in result, default: true
keep_none_chinese_together keep non chinese letter together, default: true, eg: DJ音乐家 -> DJ,yin,yue,jia, when set to false, eg: DJ音乐家 -> D,J,yin,yue,jia, NOTE: keep_none_chinese should be enabled first
keep_none_chinese_in_first_letter keep non Chinese letters in first letter, eg: 刘德华AT2016->ldhat2016, default: true
keep_none_chinese_in_joined_full_pinyin keep non Chinese letters in joined full pinyin, eg: 刘德华2016->liudehua2016, default: false
none_chinese_pinyin_tokenize break non chinese letters into separate pinyin term if they are pinyin, default: true, eg: liudehuaalibaba13zhuanghan -> liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE: keep_none_chinese and keep_none_chinese_together should be enabled first
keep_original when this option enabled, will keep original input as well, default: false
lowercase lowercase non Chinese letters, default: true
trim_whitespace default: true
remove_duplicated_term when this option enabled, duplicated term will be removed to save index, eg: de的>de, default: false, NOTE: position related query maybe influenced

1.Create a index with custom pinyin analyzer

curl -XPUT http://localhost:9200/medcl/ -d'

{

    "index" : {

        "analysis" : {

            "analyzer" : {

                "pinyin_analyzer" : {

                    "tokenizer" : "my_pinyin"

                    }

            },

            "tokenizer" : {

                "my_pinyin" : {

                    "type" : "pinyin",

                    "keep_separate_first_letter" : false,

                    "keep_full_pinyin" : true,

                    "keep_original" : true,

                    "limit_first_letter_length" : 16,

                    "lowercase" : true,

                    "remove_duplicated_term" : true

                }

            }

        }

    }

}'

2.Test Analyzer, analyzing a chinese name, such as 刘德华

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer

{

  "tokens" : [

    {

      "token" : "liu",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "de",

      "start_offset" : 1,

      "end_offset" : 2,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "hua",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "刘德华",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "ldh",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 4

    }

  ]

}

3.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'

{

    "folks": {

        "properties": {

            "name": {

                "type": "keyword",

                "fields": {

                    "pinyin": {

                        "type": "text",

                        "store": "no",

                        "term_vector": "with_offsets",

                        "analyzer": "pinyin_analyzer",

                        "boost": 10

                    }

                }

            }

        }

    }

}'

4.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

5.Let's search

http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh

curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

curl -XPUT http://localhost:9200/medcl1/ -d'

{

    "index" : {

        "analysis" : {

            "analyzer" : {

                "user_name_analyzer" : {

                    "tokenizer" : "whitespace",

                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"

                }

            },

            "filter" : {

                "pinyin_first_letter_and_full_pinyin_filter" : {

                    "type" : "pinyin",

                    "keep_first_letter" : true,

                    "keep_full_pinyin" : false,

                    "keep_none_chinese" : true,

                    "keep_original" : false,

                    "limit_first_letter_length" : 16,

                    "lowercase" : true,

                    "trim_whitespace" : true,

                    "keep_none_chinese_in_first_letter" : true

                }

            }

        }

    }

}'

Token Test:刘德华张学友郭富城黎明四大天王

curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer

{

  "tokens" : [

    {

      "token" : "ldh",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "zxy",

      "start_offset" : 4,

      "end_offset" : 7,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "gfc",

      "start_offset" : 8,

      "end_offset" : 11,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "lm",

      "start_offset" : 12,

      "end_offset" : 14,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "sdtw",

      "start_offset" : 15,

      "end_offset" : 19,

      "type" : "word",

      "position" : 4

    }

  ]

}

7.Used in phrase query

option 1

  PUT /medcl/

  {

      "index" : {

          "analysis" : {

              "analyzer" : {

                  "pinyin_analyzer" : {

                      "tokenizer" : "my_pinyin"

                      }

              },

              "tokenizer" : {

                  "my_pinyin" : {

                      "type" : "pinyin",

                      "keep_first_letter":false,

                      "keep_separate_first_letter" : false,

                      "keep_full_pinyin" : true,

                      "keep_original" : false,

                      "limit_first_letter_length" : 16,

                      "lowercase" : true

                  }

              }

          }

      }

  }

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "刘德华"

    }}

  }

option 2

  PUT /medcl/

  {

      "index" : {

          "analysis" : {

              "analyzer" : {

                  "pinyin_analyzer" : {

                      "tokenizer" : "my_pinyin"

                      }

              },

              "tokenizer" : {

                  "my_pinyin" : {

                      "type" : "pinyin",

                      "keep_first_letter":false,

                      "keep_separate_first_letter" : true,

                      "keep_full_pinyin" : false,

                      "keep_original" : false,

                      "limit_first_letter_length" : 16,

                      "lowercase" : true

                  }

              }

          }

      }

  }

  POST /medcl/folks/andy

  {"name":"刘德华"}

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "刘德h"

    }}

  }

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "刘dh"

    }}

  }

  GET /medcl/folks/_search

  {

    "query": {"match_phrase": {

      "name.pinyin": "dh"

    }}

  }

8.That's all, have fun.

elasticsearch-analysis-pinyin的更多相关文章

Elasticsearch IK+pinyin
如何在Elasticsearch中安装中文分词器(IK+pinyin) 如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字 ...
Elasticsearch：Pinyin 分词器
Elastic的Medcl提供了一种搜索Pinyin搜索的方法.拼音搜索在很多的应用场景中都有被用到.比如在百度搜索中,我们使用拼音就可以出现汉字: 对于我们中国人来说,拼音搜索也是非常直接的.那么在 ...
ElasticSearch安装拼音插件（pinyin）
环境介绍集群环境如下: Ubuntu14.04 ElasticSearch 2.3.1(3节点) JDK1.8.0_60 开发环境: Windows10 JDK 1.8.0_66 Maven 3.3 ...
elasticsearch+logstash_jdbc 实现mysql数据实时同步至es
jdk安装1.8版本,es.ls.ik.kibana版本一致我这里使用的6.6.2版本安装es tar xf elasticsearch-6.6.2.tar.gz mv elasticsearch- ...
Elasticsearch搜索资料汇总
Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...
Elasticsearch实现搜索推荐词
本篇介绍的是基于Elasticsearch实现搜索推荐词,其中需要用到Elasticsearch的pinyin插件以及ik分词插件,代码的实现这里提供了java跟C#的版本方便大家参考. 1.实现的结 ...
（转）How to Use Elasticsearch, Logstash, and Kibana to Manage MySQL Logs
A comprehensive log management and analysis strategy is vital, enabling organizations to understand ...
linux环境下配置solr5.3详细步骤
本人上周五刚刚配置了一遍centos下配置solr5.3版本,综合借鉴并改进了一些教程,贴出如下单位使用内网,本教程暂无截图,抱歉另,本人是使用.net编程调用solr的使用的是solrnet,在 ...
solr5Ik分词2
<fieldType name="text_ik" class="solr.TextField"><ana ...

随机推荐

iOS开源项目：AFNetworking----写得非常好
https://github.com/AFNetworking/AFNetworking 与asi-http-request功能类似的网络库,不过是基于NSURLConnection 和 NSOper ...
du熊的机器人
[du熊的机器人] Description du熊正在玩一个别人刚送给它的机器人.这个机器人只能在一个棋盘中行走,棋盘的左上角格子为(0, 0),右下角格子为(X, Y). du熊控制这个机器人从棋盘 ...
04-SSH综合案例：环境搭建之jar包引入
刚才已经把表关系的分析已经分析完了,现在呢就先不去创建这个表,写到哪儿的时候再去创建这个表. 1.4 SSH环境搭建: 1.4.1 第一步:创建一个web项目. 1.4.2 第二步:导入相应jar包. ...
ArcEngine调用GP里的Merge工具传参问题
Merge工具inputs参数形式与Python中不同: string startLayerPath= cpj.TempWs.PathName + @"\" + datasetNa ...
wcf将一个服务同时绑定到http和tcp的写法
服务器端:<?xml version="1.0" encoding="utf-8" ?><configuration> <con ...
React Native开源项目案例
(六).React Native开源项目: 1.Pober Wong_17童鞋为gank.io做的纯React Native项目,开源地址:https://github.com/Bob1993/Rea ...
yum 系列（一） yum 和 rpm 常用命令
yum 系列(一) yum 和 rpm 常用命令一.yum 常用命令 yum 命令:http://man.linuxde.net/yum yum 是在 Fedora 和 RedHat 以及 SUSE ...
ToList和ToDataTable（其中也有反射的知识）
using System;using System.Collections.Generic;using System.Data;using System.Linq;using System.Refle ...
FW:程序在内存的划分（转）
一.预备知识—程序的内存分配一个由c/C++编译的程序占用的内存分为以下几个部分 1.栈区(stack)— 由编译器自动分配释放 ,存放函数的参数值,局部变量的值等.其操作方式类似于数据结构中的栈. ...
共享keychain数据
[共享keychain数据] 当往keychain中插入数据时,默认的 kSecAttrAccessGroup 就是App自身的BundleID. [官方文档] You can add a keych ...

elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

elasticsearch-analysis-pinyin的更多相关文章

随机推荐

热门专题