来源:https://github.com/medcl/elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).

--------------------------------------------------
| Pinyin Analysis Plugin | Elasticsearch |
--------------------------------------------------
| master | 5.x -> master |
--------------------------------------------------
| 5.5.1 | 5.5.1 |
--------------------------------------------------
| 5.3.3 | 5.3.3 |
--------------------------------------------------
| 5.2.2 | 5.2.2 |
--------------------------------------------------
| 5.1.2 | 5.1.2 |
--------------------------------------------------
| 1.8.1 | 2.4.1 |
--------------------------------------------------
| 1.7.5 | 2.3.5 |
--------------------------------------------------
| 1.6.1 | 2.2.1 |
--------------------------------------------------
| 1.5.0 | 2.1.0 |
--------------------------------------------------
| 1.4.0 | 2.0.x |
--------------------------------------------------
| 1.3.0 | 1.6.x |
--------------------------------------------------
| 1.2.2 | 1.0.x |
--------------------------------------------------

The plugin includes analyzer: pinyin , tokenizer: pinyin and token-filter: pinyin.

** Optional Parameters **

  • keep_first_letter when this option enabled, eg: 刘德华>ldh, default: true
  • keep_separate_first_letter when this option enabled, will keep first letters separately, eg: 刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequency
  • limit_first_letter_length set max length of the first_letter result, default: 16
  • keep_full_pinyin when this option enabled, eg: 刘德华> [liu,de,hua], default: true
  • keep_joined_full_pinyin when this option enabled, eg: 刘德华> [liudehua], default: false
  • keep_none_chinese keep non chinese letter or number in result, default: true
  • keep_none_chinese_together keep non chinese letter together, default: true, eg: DJ音乐家 -> DJ,yin,yue,jia, when set to false, eg: DJ音乐家 -> D,J,yin,yue,jia, NOTE: keep_none_chinese should be enabled first
  • keep_none_chinese_in_first_letter keep non Chinese letters in first letter, eg: 刘德华AT2016->ldhat2016, default: true
  • keep_none_chinese_in_joined_full_pinyin keep non Chinese letters in joined full pinyin, eg: 刘德华2016->liudehua2016, default: false
  • none_chinese_pinyin_tokenize break non chinese letters into separate pinyin term if they are pinyin, default: true, eg: liudehuaalibaba13zhuanghan -> liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE: keep_none_chinese and keep_none_chinese_together should be enabled first
  • keep_original when this option enabled, will keep original input as well, default: false
  • lowercase lowercase non Chinese letters, default: true
  • trim_whitespace default: true
  • remove_duplicated_term when this option enabled, duplicated term will be removed to save index, eg: de的>de, default: false, NOTE: position related query maybe influenced

1.Create a index with custom pinyin analyzer

curl -XPUT http://localhost:9200/medcl/ -d'
{
"index" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}'

2.Test Analyzer, analyzing a chinese name, such as 刘德华

http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "刘德华",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 3
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 4
}
]
}

3.Create mapping

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
"folks": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": "no",
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
}'

4.Indexing

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

5.Let's search

http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

curl -XPUT http://localhost:9200/medcl1/ -d'
{
"index" : {
"analysis" : {
"analyzer" : {
"user_name_analyzer" : {
"tokenizer" : "whitespace",
"filter" : "pinyin_first_letter_and_full_pinyin_filter"
}
},
"filter" : {
"pinyin_first_letter_and_full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_none_chinese" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
}
}
}
}
}'

Token Test:刘德华 张学友 郭富城 黎明 四大天王

curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer
{
"tokens" : [
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "zxy",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "gfc",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "lm",
"start_offset" : 12,
"end_offset" : 14,
"type" : "word",
"position" : 3
},
{
"token" : "sdtw",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 4
}
]
}

7.Used in phrase query

  • option 1

      PUT /medcl/
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "pinyin_analyzer" : {
    "tokenizer" : "my_pinyin"
    }
    },
    "tokenizer" : {
    "my_pinyin" : {
    "type" : "pinyin",
    "keep_first_letter":false,
    "keep_separate_first_letter" : false,
    "keep_full_pinyin" : true,
    "keep_original" : false,
    "limit_first_letter_length" : 16,
    "lowercase" : true
    }
    }
    }
    }
    }
    GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "刘德华"
    }}
    }
  • option 2

      PUT /medcl/
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "pinyin_analyzer" : {
    "tokenizer" : "my_pinyin"
    }
    },
    "tokenizer" : {
    "my_pinyin" : {
    "type" : "pinyin",
    "keep_first_letter":false,
    "keep_separate_first_letter" : true,
    "keep_full_pinyin" : false,
    "keep_original" : false,
    "limit_first_letter_length" : 16,
    "lowercase" : true
    }
    }
    }
    }
    } POST /medcl/folks/andy
    {"name":"刘德华"} GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "刘德h"
    }}
    } GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "刘dh"
    }}
    } GET /medcl/folks/_search
    {
    "query": {"match_phrase": {
    "name.pinyin": "dh"
    }}
    }

8.That's all, have fun.

elasticsearch-analysis-pinyin的更多相关文章

  1. Elasticsearch IK+pinyin

    如何在Elasticsearch中安装中文分词器(IK+pinyin)   如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字 ...

  2. Elasticsearch:Pinyin 分词器

    Elastic的Medcl提供了一种搜索Pinyin搜索的方法.拼音搜索在很多的应用场景中都有被用到.比如在百度搜索中,我们使用拼音就可以出现汉字: 对于我们中国人来说,拼音搜索也是非常直接的.那么在 ...

  3. ElasticSearch安装拼音插件(pinyin)

    环境介绍 集群环境如下: Ubuntu14.04 ElasticSearch 2.3.1(3节点) JDK1.8.0_60 开发环境: Windows10 JDK 1.8.0_66 Maven 3.3 ...

  4. elasticsearch+logstash_jdbc 实现mysql数据实时同步至es

    jdk安装1.8版本,es.ls.ik.kibana版本一致我这里使用的6.6.2版本 安装es tar xf elasticsearch-6.6.2.tar.gz mv elasticsearch- ...

  5. Elasticsearch搜索资料汇总

    Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...

  6. Elasticsearch实现搜索推荐词

    本篇介绍的是基于Elasticsearch实现搜索推荐词,其中需要用到Elasticsearch的pinyin插件以及ik分词插件,代码的实现这里提供了java跟C#的版本方便大家参考. 1.实现的结 ...

  7. (转)How to Use Elasticsearch, Logstash, and Kibana to Manage MySQL Logs

    A comprehensive log management and analysis strategy is vital, enabling organizations to understand ...

  8. linux环境下配置solr5.3详细步骤

    本人上周五刚刚配置了一遍centos下配置solr5.3版本,综合借鉴并改进了一些教程,贴出如下 单位使用内网,本教程暂无截图,抱歉 另,本人是使用.net编程调用solr的使用的是solrnet,在 ...

  9. solr5Ik分词2

    <!--IK分词器--><fieldType name="text_ik" class="solr.TextField"><ana ...

随机推荐

  1. eclipse中项目出现红色的!

    eclipse中项目出现红色的!的原因有二个:1.jdk不匹配    2.缺少jar包

  2. Redis搭建(七):Redis的Cluster集群动态增删节点

    一.引言 上一篇文章我们一步一步的教大家搭建了Redis的Cluster集群环境,形成了3个主节点和3个从节点的Cluster的环境.当然,大家可以使用 Cluster info 命令查看Cluste ...

  3. centos 使用windows7 存储

    1. 在Windows7上创建一个带密码的用户,如disk 2. 创建一个文件夹,如 D:\centos-disk2 3. 选中此文件夹,点击上方的  共享 -> 特定用户, 添加disk用户, ...

  4. Borg Maze(BFS+MST)

    Borg Maze http://poj.org/problem?id=3026 Time Limit: 1000MS   Memory Limit: 65536K Total Submissions ...

  5. 11-基于dev的bug(还没想通)

    十六进制转八进制 http://lx.lanqiao.cn/problem.page?gpid=T51 问题描述 给定n个十六进制正整数,输出它们对应的八进制数. 输入格式 输入的第一行为一个正整数n ...

  6. YourKit Java Profiler安装和破解

    YourKit Java Profiler是业界领先的Java性能剖析工具.其独立版本安装成功且首次启动 YourKit Java Profiler 后,会弹出一个对话框,让用户选择 YourKit ...

  7. qt学习(二) buttong layout spinbox slider 示例

    开启qt5 creator 新建项目 qt widgets 改写main.cpp #include "mainwindow.h" #include <QApplication ...

  8. msys2 + clion安装所需的mingw64编译环境

    pacman -S mingw-w64-x86_64-toolchain 输出结果为: $ pacman -S mingw-w64-x86_64-toolchain:: 在组 mingw-w64-x8 ...

  9. [原创]分享本人自己PY写的BOOST编译程序(源码)

    本程序WINDOWS专用,只做抛砖引玉,希望诸位按照各自需求自行修改,主要目的是为了让诸位编译时可以省一些组合指令的时间,只需要修改几个参数即可自动编译. 支持64位编译模式. 改进版本:http:/ ...

  10. WEB开发常见错误-php无法操作数据库

    Ubuntu 安装phpmyadmin 和配置   ubuntu 安装 phpmyadmin  两种 : 1: apt-get 安装  然后使用 已有的虚拟主机目录建立软连接  sudo  apt-g ...