elasticsearch-analysis-pinyin
来源:https://github.com/medcl/elasticsearch-analysis-pinyin
Pinyin Analysis for Elasticsearch
This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).
--------------------------------------------------
| Pinyin Analysis Plugin | Elasticsearch |
--------------------------------------------------
| master | 5.x -> master |
--------------------------------------------------
| 5.5.1 | 5.5.1 |
--------------------------------------------------
| 5.3.3 | 5.3.3 |
--------------------------------------------------
| 5.2.2 | 5.2.2 |
--------------------------------------------------
| 5.1.2 | 5.1.2 |
--------------------------------------------------
| 1.8.1 | 2.4.1 |
--------------------------------------------------
| 1.7.5 | 2.3.5 |
--------------------------------------------------
| 1.6.1 | 2.2.1 |
--------------------------------------------------
| 1.5.0 | 2.1.0 |
--------------------------------------------------
| 1.4.0 | 2.0.x |
--------------------------------------------------
| 1.3.0 | 1.6.x |
--------------------------------------------------
| 1.2.2 | 1.0.x |
--------------------------------------------------
The plugin includes analyzer: pinyin
, tokenizer: pinyin
and token-filter: pinyin
.
** Optional Parameters **
keep_first_letter
when this option enabled, eg:刘德华
>ldh
, default: truekeep_separate_first_letter
when this option enabled, will keep first letters separately, eg:刘德华
>l
,d
,h
, default: false, NOTE: query result maybe too fuzziness due to term too frequencylimit_first_letter_length
set max length of the first_letter result, default: 16keep_full_pinyin
when this option enabled, eg:刘德华
> [liu
,de
,hua
], default: truekeep_joined_full_pinyin
when this option enabled, eg:刘德华
> [liudehua
], default: falsekeep_none_chinese
keep non chinese letter or number in result, default: truekeep_none_chinese_together
keep non chinese letter together, default: true, eg:DJ音乐家
->DJ
,yin
,yue
,jia
, when set tofalse
, eg:DJ音乐家
->D
,J
,yin
,yue
,jia
, NOTE:keep_none_chinese
should be enabled firstkeep_none_chinese_in_first_letter
keep non Chinese letters in first letter, eg:刘德华AT2016
->ldhat2016
, default: truekeep_none_chinese_in_joined_full_pinyin
keep non Chinese letters in joined full pinyin, eg:刘德华2016
->liudehua2016
, default: falsenone_chinese_pinyin_tokenize
break non chinese letters into separate pinyin term if they are pinyin, default: true, eg:liudehuaalibaba13zhuanghan
->liu
,de
,hua
,a
,li
,ba
,ba
,13
,zhuang
,han
, NOTE:keep_none_chinese
andkeep_none_chinese_together
should be enabled firstkeep_original
when this option enabled, will keep original input as well, default: falselowercase
lowercase non Chinese letters, default: truetrim_whitespace
default: trueremove_duplicated_term
when this option enabled, duplicated term will be removed to save index, eg:de的
>de
, default: false, NOTE: position related query maybe influenced
1.Create a index with custom pinyin analyzer
curl -XPUT http://localhost:9200/medcl/ -d'
{
"index" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : true,
"limit_first_letter_length" : 16,
"lowercase" : true,
"remove_duplicated_term" : true
}
}
}
}
}'
2.Test Analyzer, analyzing a chinese name, such as 刘德华
http://localhost:9200/medcl/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer
{
"tokens" : [
{
"token" : "liu",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "de",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "hua",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 2
},
{
"token" : "刘德华",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 3
},
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 4
}
]
}
3.Create mapping
curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
"folks": {
"properties": {
"name": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": "no",
"term_vector": "with_offsets",
"analyzer": "pinyin_analyzer",
"boost": 10
}
}
}
}
}
}'
4.Indexing
curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'
5.Let's search
http://localhost:9200/medcl/folks/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/folks/_search?q=name.pinyin:de+hua
6.Using Pinyin-TokenFilter
curl -XPUT http://localhost:9200/medcl1/ -d'
{
"index" : {
"analysis" : {
"analyzer" : {
"user_name_analyzer" : {
"tokenizer" : "whitespace",
"filter" : "pinyin_first_letter_and_full_pinyin_filter"
}
},
"filter" : {
"pinyin_first_letter_and_full_pinyin_filter" : {
"type" : "pinyin",
"keep_first_letter" : true,
"keep_full_pinyin" : false,
"keep_none_chinese" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true,
"trim_whitespace" : true,
"keep_none_chinese_in_first_letter" : true
}
}
}
}
}'
Token Test:刘德华 张学友 郭富城 黎明 四大天王
curl -XGET http://localhost:9200/medcl1/_analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e+%e5%bc%a0%e5%ad%a6%e5%8f%8b+%e9%83%ad%e5%af%8c%e5%9f%8e+%e9%bb%8e%e6%98%8e+%e5%9b%9b%e5%a4%a7%e5%a4%a9%e7%8e%8b&analyzer=user_name_analyzer
{
"tokens" : [
{
"token" : "ldh",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "zxy",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "gfc",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "lm",
"start_offset" : 12,
"end_offset" : 14,
"type" : "word",
"position" : 3
},
{
"token" : "sdtw",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 4
}
]
}
7.Used in phrase query
option 1
PUT /medcl/
{
"index" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_first_letter":false,
"keep_separate_first_letter" : false,
"keep_full_pinyin" : true,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true
}
}
}
}
}
GET /medcl/folks/_search
{
"query": {"match_phrase": {
"name.pinyin": "刘德华"
}}
}option 2
PUT /medcl/
{
"index" : {
"analysis" : {
"analyzer" : {
"pinyin_analyzer" : {
"tokenizer" : "my_pinyin"
}
},
"tokenizer" : {
"my_pinyin" : {
"type" : "pinyin",
"keep_first_letter":false,
"keep_separate_first_letter" : true,
"keep_full_pinyin" : false,
"keep_original" : false,
"limit_first_letter_length" : 16,
"lowercase" : true
}
}
}
}
} POST /medcl/folks/andy
{"name":"刘德华"} GET /medcl/folks/_search
{
"query": {"match_phrase": {
"name.pinyin": "刘德h"
}}
} GET /medcl/folks/_search
{
"query": {"match_phrase": {
"name.pinyin": "刘dh"
}}
} GET /medcl/folks/_search
{
"query": {"match_phrase": {
"name.pinyin": "dh"
}}
}
8.That's all, have fun.
elasticsearch-analysis-pinyin的更多相关文章
- Elasticsearch IK+pinyin
如何在Elasticsearch中安装中文分词器(IK+pinyin) 如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字 ...
- Elasticsearch:Pinyin 分词器
Elastic的Medcl提供了一种搜索Pinyin搜索的方法.拼音搜索在很多的应用场景中都有被用到.比如在百度搜索中,我们使用拼音就可以出现汉字: 对于我们中国人来说,拼音搜索也是非常直接的.那么在 ...
- ElasticSearch安装拼音插件(pinyin)
环境介绍 集群环境如下: Ubuntu14.04 ElasticSearch 2.3.1(3节点) JDK1.8.0_60 开发环境: Windows10 JDK 1.8.0_66 Maven 3.3 ...
- elasticsearch+logstash_jdbc 实现mysql数据实时同步至es
jdk安装1.8版本,es.ls.ik.kibana版本一致我这里使用的6.6.2版本 安装es tar xf elasticsearch-6.6.2.tar.gz mv elasticsearch- ...
- Elasticsearch搜索资料汇总
Elasticsearch 简介 Elasticsearch(ES)是一个基于Lucene 构建的开源分布式搜索分析引擎,可以近实时的索引.检索数据.具备高可靠.易使用.社区活跃等特点,在全文检索.日 ...
- Elasticsearch实现搜索推荐词
本篇介绍的是基于Elasticsearch实现搜索推荐词,其中需要用到Elasticsearch的pinyin插件以及ik分词插件,代码的实现这里提供了java跟C#的版本方便大家参考. 1.实现的结 ...
- (转)How to Use Elasticsearch, Logstash, and Kibana to Manage MySQL Logs
A comprehensive log management and analysis strategy is vital, enabling organizations to understand ...
- linux环境下配置solr5.3详细步骤
本人上周五刚刚配置了一遍centos下配置solr5.3版本,综合借鉴并改进了一些教程,贴出如下 单位使用内网,本教程暂无截图,抱歉 另,本人是使用.net编程调用solr的使用的是solrnet,在 ...
- solr5Ik分词2
<!--IK分词器--><fieldType name="text_ik" class="solr.TextField"><ana ...
随机推荐
- iOS开源项目:AFNetworking----写得非常好
https://github.com/AFNetworking/AFNetworking 与asi-http-request功能类似的网络库,不过是基于NSURLConnection 和 NSOper ...
- 有道笔记链接地址 -----关于python
一.python相关 python列表的操作[list[]]: http://note.youdao.com/noteshare?id=93922f3174b1d8fac04514064656 ...
- Mysql,重复字段只取其中一行
Mysql,重复字段只取其中一行 格式 : select 字段 from [表] where 其他字段 in (select 函数(其他字段) from [表] group by 相同字段) 示例如下 ...
- 39. Combination Sum (Back-Track)
Given a set of candidate numbers (C) and a target number (T), find all unique combinations in C wher ...
- leetcode 204 count prim 数素数
描述: 给个整数n,计算小于n的素数个数. 思路: 埃拉托斯特尼筛法,其实就是普通筛子,当检测到2是素数,去除所有2的倍数:当检测到3是素数,去除其倍数. 不过这要求空间复杂度为n,时间复杂度为n. ...
- 31-字符串转为 url 格式的两种不同情况
将此字符串转为 url 格式的: # 如果是转化对象用:data=urllib.parse.urlencode(values) # 如果是转化字符串:s=urllib.parse.quote(s)
- Spring MVC的handlermapping之请求分发如何找到正确的Handler(RequestMappingHandlerMapping)
这个思路同样是通过在AbstractHandlerMethodMapping里面来实现getHandlerInternal()实现自己的方法来处理寻找正确的处理器,不懂得请看上一篇. protecte ...
- 企业搜索引擎开发之连接器connector(二十一)
从上文中的QueryTraverser对象的BatchResult runBatch(BatchSize batchSize)方法上溯到CancelableBatch类,该类实现了TimedCance ...
- poj2115 Looooops 扩展欧几里德的应用
好开心又做出一道,看样子做数论一定要先看书,认认真真仔仔细细的看一下各种重要的性质 及其用途,然后第一次接触的题目 边想边看别人的怎么做的,这样做出第一道题目后,后面的题目就完全可以自己思考啦 设要+ ...
- Windows Phone Update3 (新分辨率 1080 x 1920 不会影响到现有WP8应用)
更新内容: Update 3 OS version: 8.0.10501.127 or 8.0.10512.142* Accessibility. We've made several improve ...