elasticsearch 口水篇（8）分词中文分词 ik插件

先来一个标准分词（standard），配置如下：

curl -XPUT localhost:9200/local -d '{

    "settings" : {

        "analysis" : {

            "analyzer" : {

                "stem" : {

                    "tokenizer" : "standard",

                    "filter" : ["standard", "lowercase", "stop", "porter_stem"]

                }

            }

        }

    },

    "mappings" : {

        "article" : {

            "dynamic" : true,

            "properties" : {

                "title" : {

                    "type" : "string",

                    "analyzer" : "stem"

                }

            }

        }

    }

}'

index:local

type:article

default analyzer:stem (filter:小写、停用词等)

field:title　　

测试：

# Sample Analysis

curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Fight for your life}'

curl -XGET localhost:9200/local/_analyze?analyzer=stem -d '{Bruno fights Tyson tomorrow}'

# Index Data

curl -XPUT localhost:9200/local/article/1 -d'{"title": "Fight for your life"}'

curl -XPUT localhost:9200/local/article/2 -d'{"title": "Fighting for your life"}'

curl -XPUT localhost:9200/local/article/3 -d'{"title": "My dad fought a dog"}'

curl -XPUT localhost:9200/local/article/4 -d'{"title": "Bruno fights Tyson tomorrow"}'

# search on the title field, which is stemmed on index and search

curl -XGET localhost:9200/local/_search?q=title:fight

# searching on _all will not do anystemming, unless also configured on the mapping to be stemmed...

curl -XGET localhost:9200/local/_search?q=fight

例如：

Fight for your life

分词如下：

{"tokens":[

{"token":"fight","start_offset":1,"end_offset":6,"type":"<ALPHANUM>","position":1},
{"token":"your","start_offset":11,"end_offset":15,"type":"<ALPHANUM>","position":3},
{"token":"life","start_offset":16,"end_offset":20,"type":"<ALPHANUM>","position":4}

]}

部署ik分词器：

1）将ik分词器插件（es）拷贝到./plugins/analyzerIK/中

2）在elasticsearch.yml中配置

index.analysis.analyzer.ik.type : "ik"

3）在config中添加./config/ik

IKAnalyzer.cfg.xml

main.dic

quantifier.dic

ext.dic

stopword.dic

delete之前创建的index，重新配置如下：

curl -XPUT localhost:9200/local -d '{

    "settings" : {

        "analysis" : {

            "analyzer" : {

                "ik" : {

                    "tokenizer" : "ik"

                }

            }

        }

    },

    "mappings" : {

        "article" : {

            "dynamic" : true,

            "properties" : {

                "title" : {

                    "type" : "string",

                    "analyzer" : "ik"

                }

            }

        }

    }

}'

测试：

curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true' -d'

{

    "text":"中华人民共和国国歌"

}

'

{

  "tokens" : [ {

    "token" : "text",

    "start_offset" : 12,

    "end_offset" : 16,

    "type" : "ENGLISH",

    "position" : 1

  }, {

    "token" : "中华人民共和国",

    "start_offset" : 19,

    "end_offset" : 26,

    "type" : "CN_WORD",

    "position" : 2

  }, {

    "token" : "国歌",

    "start_offset" : 26,

    "end_offset" : 28,

    "type" : "CN_WORD",

    "position" : 3

  } ]

}

---------------------------------------

如果我们想返回最细粒度的分词结果，需要在elasticsearch.yml中配置如下：

index:

  analysis:

    analyzer:

      ik:

          alias: [ik_analyzer]

          type: org.elasticsearch.index.analysis.IkAnalyzerProvider

      ik_smart:

          type: ik

          use_smart: true

      ik_max_word:

          type: ik

          use_smart: false

测试：

curl 'http://localhost:9200/index/_analyze?analyzer=ik_max_word&pretty=true' -d'

{

    "text":"中华人民共和国国歌"

}

'

{

  "tokens" : [ {

    "token" : "text",

    "start_offset" : 12,

    "end_offset" : 16,

    "type" : "ENGLISH",

    "position" : 1

  }, {

    "token" : "中华人民共和国",

    "start_offset" : 19,

    "end_offset" : 26,

    "type" : "CN_WORD",

    "position" : 2

  }, {

    "token" : "中华人民",

    "start_offset" : 19,

    "end_offset" : 23,

    "type" : "CN_WORD",

    "position" : 3

  }, {

    "token" : "中华",

    "start_offset" : 19,

    "end_offset" : 21,

    "type" : "CN_WORD",

    "position" : 4

  }, {

    "token" : "华人",

    "start_offset" : 20,

    "end_offset" : 22,

    "type" : "CN_WORD",

    "position" : 5

  }, {

    "token" : "人民共和国",

    "start_offset" : 21,

    "end_offset" : 26,

    "type" : "CN_WORD",

    "position" : 6

  }, {

    "token" : "人民",

    "start_offset" : 21,

    "end_offset" : 23,

    "type" : "CN_WORD",

    "position" : 7

  }, {

    "token" : "共和国",

    "start_offset" : 23,

    "end_offset" : 26,

    "type" : "CN_WORD",

    "position" : 8

  }, {

    "token" : "共和",

    "start_offset" : 23,

    "end_offset" : 25,

    "type" : "CN_WORD",

    "position" : 9

  }, {

    "token" : "国",

    "start_offset" : 25,

    "end_offset" : 26,

    "type" : "CN_CHAR",

    "position" : 10

  }, {

    "token" : "国歌",

    "start_offset" : 26,

    "end_offset" : 28,

    "type" : "CN_WORD",

    "position" : 11

  } ]

}

elasticsearch 口水篇（8）分词中文分词 ik插件的更多相关文章

elasticsearch 口水篇（1）安装、插件
一)安装elasticsearch 1)下载elasticsearch-0.90.10,解压,运行\bin\elasticsearch.bat (windwos) 2)进入http://localho ...
elasticsearch 口水篇（4）java客户端 - 原生esClient
上一篇(elasticsearch 口水篇(3)java客户端 - Jest)Jest是第三方客户端,基于REST Api进行调用(httpClient),本篇简单介绍下elasticsearch原生 ...
ElasticSearch简介（三）——中文分词
很多时候,我们需要在ElasticSearch中启用中文分词,本文这里简单的介绍一下方法.首先安装中文分词插件.这里使用的是 ik,也可以考虑其他插件(比如 smartcn). $ ./bin/ela ...
elasticsearch学习笔记-倒排索引以及中文分词
我们使用数据库的时候,如果查询条件太复杂,则会涉及到很多问题 1.无法维护,各种嵌套查询,各种复杂的查询,想要优化都无从下手 2.效率低下,一般语句复杂了之后,比如使用or,like %,,%查询之后 ...
elasticsearch 口水篇（9）Facet
FACET 1)Terms Facet { "query" : { "match_all" : { } }, "facets" : { &q ...
elasticsearch 口水篇（2）CRUD Sense
Sense 为了方便.直观的使用es的REST Api,我们可以使用sense.Sense是Chrome浏览器的一个插件,使用简单. 如图: Sense安装: https://chrome.googl ...
elasticsearch 口水篇（7） Eclipse中部署ES源码、运行
ES源码可以直接从svn下载 https://github.com/elasticsearch/elasticsearch 下载后,用Maven导入(import——>Existing Mave ...
elasticsearch 口水篇（6） Mapping 定义索引
前面我们感觉ES就想是一个nosql数据库,支持Free Schema. 接触过Lucene.solr的同学这时可能会思考一个问题——怎么定义document中的field?store.index.a ...
elasticsearch 口水篇（3）java客户端 - Jest
elasticsearch有丰富的客户端,java客户端有Jest.其原文介绍如下: Jest is a Java HTTP Rest client for ElasticSearch.It is a ...

随机推荐

[LeetCode&Python] Problem 448. Find All Numbers Disappeared in an Array
Given an array of integers where 1 ≤ a[i] ≤ n (n = size of array), some elements appear twice and ot ...
golang相关网摘
1.golang开发50个坑 http://devs.cloudimmunity.com/gotchas-and-common-mistakes-in-go-golang/index.html#mli ...
NYOJ 6：喷水装置（一）（贪心）
6-喷水装置(一) 内存限制:64MB 时间限制:3000ms 特判: No 通过数:68 提交数:111 难度:3 题目描述: 现有一块草坪,长为20米,宽为2米,要在横中心线上放置半径为Ri的喷水 ...
opencv感兴趣区域ROI
addWeighted //显示原图 Mat src = imread("data/img/1.jpg"); imshow("src",src); //显示lo ...
EasyUI datagrid easyui datagrid +dialog 加载可直接运行七
<!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta ht ...
20165313 《Java程序设计》第八周学习总结
教材学习总结线程常用方法 1.start() 2.run()定义线程线程对象被调度之后所执行的操作 3.sleep(int millsecond),必须在try-catch语句块中调用sleep方法 ...
AtCoder Grand Contest 031 简要题解
AtCoder Grand Contest 031 Atcoder A - Colorful Subsequence description 求$s$中本质不同子序列的个数模$10^9+7$. ...
LeetCode–Flip Game
You are playing the following Flip Game with your friend: Given a string that contains only these tw ...
【传输协议】TCP、IP协议族之数字签名与HTTPS详解
文章转载出自:https://blog.51cto.com/11883699/2160032 安全的获取公钥细心的人可能已经注意到了如果使用非对称加密算法,我们的客户端A,B需要一开始就持有公钥,要 ...
sqler sql 转rest api redis 接口使用
sqler 支持redis 协议,我们可以用过redis client 连接sqler,他会将宏住转换为redis command 实现上看源码我们发现是基于一个开源的redis 协议的golang ...

elasticsearch 口水篇（8）分词 中文分词 ik插件

elasticsearch 口水篇（8）分词 中文分词 ik插件的更多相关文章

随机推荐

热门专题

elasticsearch 口水篇（8）分词中文分词 ik插件

elasticsearch 口水篇（8）分词中文分词 ik插件的更多相关文章