ElasticSearch中文分词（IK）

ElasticSearch常用的很受欢迎的是IK，这里稍微介绍下安装过程及测试过程。

1、ElasticSearch官方分词

自带的中文分词器很弱，可以体检下：

[zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d '岁月如梭'

{

    "tokens": [

        {

            "token": "岁",

            "start_offset": 0,

            "end_offset": 1,

            "type": "<IDEOGRAPHIC>",

            "position": 0

        },

        {

            "token": "月",

            "start_offset": 1,

            "end_offset": 2,

            "type": "<IDEOGRAPHIC>",

            "position": 1

        },

        {

            "token": "如",

            "start_offset": 2,

            "end_offset": 3,

            "type": "<IDEOGRAPHIC>",

            "position": 2

        },

        {

            "token": "梭",

            "start_offset": 3,

            "end_offset": 4,

            "type": "<IDEOGRAPHIC>",

            "position": 3

        }

    ]

}

[zsz@VS-zsz ~]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=standard' -d 'i am an enginner'

{

    "tokens": [

        {

            "token": "i",

            "start_offset": 0,

            "end_offset": 1,

            "type": "<ALPHANUM>",

            "position": 0

        },

        {

            "token": "am",

            "start_offset": 2,

            "end_offset": 4,

            "type": "<ALPHANUM>",

            "position": 1

        },

        {

            "token": "an",

            "start_offset": 5,

            "end_offset": 7,

            "type": "<ALPHANUM>",

            "position": 2

        },

        {

            "token": "enginner",

            "start_offset": 8,

            "end_offset": 16,

            "type": "<ALPHANUM>",

            "position": 3

        }

    ]

}

由此看见，ES的官方中文分词能力较差。

2、IK中文分词器

2.1、如何你下载的ik是源码半，需要打包该分词器，linux安装maven

wget http://mirrors.cnnic.cn/apache/maven/maven-3/3.0.5/binaries/apache-maven-3.0.5-bin.tar.gz

tar zxvf apache-maven-3.0.5-bin.tar.gz

mv apache-maven-3.0.5 /usr/local/apache-maven-3.0.5

vi /etc/profile

增加：

export MAVEN_HOME=/usr/local/apache-maven-3.0.5

export PATH=$PATH:$MAVEN_HOME/bin

source /etc/profile

mvn -v

2.2、对源码打包得到target/目录下的内容

mvn clean package

将打包好的IK插件内容部署到ES中：

[zsz@VS-zsz ~]$ cd /home/zsz/elasticsearch-analysis-ik-1.10.0/target/releases/

[zsz@VS-zsz releases]$ mkdir /usr/local/elasticsearch-2.4.0/plugins/ik/

[zsz@VS-zsz releases]$ cp elasticsearch-analysis-ik-1.10.0.zip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip

[zsz@VS-zsz releases]$ unzip /usr/local/elasticsearch-2.4.0/plugins/ik/elasticsearch-analysis-ik-1.10.0.zip

[zsz@VS-zsz releases]$ cd /usr/local/elasticsearch-2.4.0/plugins/ik/

[zsz@VS-zsz ik]$ rm elasticsearch-analysis-ik-1.10.0.zip

[zsz@VS-zsz ik]$ mkdir /usr/local/elasticsearch-2.4.0/config/ik

将IK的配置copy到ElasticSearch的配置中：

[zsz@VS-zsz ik]$ cp /home/zsz/elasticsearch-analysis-ik-1.10.0/config /usr/local/elasticsearch-2.4.0/config/ik

更改ElasticSearch的配置：

[zsz@VS-zsz ik]$ vi /usr/local/elasticsearch-2.4.0/config/elasticsearch.yml

在最后加上分词解析器的配置：

index.analysis.analyzer.ik.type : "ik"

启动ElasticSearch：

[zsz@VS-zsz ik]$ cd /usr/local/elasticsearch-2.4.0/

[zsz@VS-zsz elasticsearch-2.4.0]$ ./bin/elasticsearch -d

测试IK分词器的效果：

[zsz@VS-zsz elasticsearch-2.4.0]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d '岁月如梭'

{

    "tokens": [

        {

            "token": "岁月如梭",

            "start_offset": 0,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 0

        },

        {

            "token": "岁月",

            "start_offset": 0,

            "end_offset": 2,

            "type": "CN_WORD",

            "position": 1

        },

        {

            "token": "如梭",

            "start_offset": 2,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 2

        },

        {

            "token": "梭",

            "start_offset": 3,

            "end_offset": 4,

            "type": "CN_WORD",

            "position": 3

        }

    ]

}

[zsz@VS-zsz config]$ curl -XGET 'http://192.168.31.77:9200/_analyze?analyzer=ik' -d 'elasticsearch很受欢迎的的一款拥有活跃社区开源的搜索解决方案'

{

    "tokens": [

        {

            "token": "elasticsearch",

            "start_offset": 0,

            "end_offset": 13,

            "type": "CN_WORD",

            "position": 0

        },

        {

            "token": "elastic",

            "start_offset": 0,

            "end_offset": 7,

            "type": "CN_WORD",

            "position": 1

        },

        {

            "token": "很受",

            "start_offset": 13,

            "end_offset": 15,

            "type": "CN_WORD",

            "position": 2

        },

        {

            "token": "受欢迎",

            "start_offset": 14,

            "end_offset": 17,

            "type": "CN_WORD",

            "position": 3

        },

        {

            "token": "欢迎",

            "start_offset": 15,

            "end_offset": 17,

            "type": "CN_WORD",

            "position": 4

        },

        {

            "token": "一款",

            "start_offset": 19,

            "end_offset": 21,

            "type": "CN_WORD",

            "position": 5

        },

        {

            "token": "一",

            "start_offset": 19,

            "end_offset": 20,

            "type": "TYPE_CNUM",

            "position": 6

        },

        {

            "token": "款",

            "start_offset": 20,

            "end_offset": 21,

            "type": "COUNT",

            "position": 7

        },

        {

            "token": "拥有",

            "start_offset": 21,

            "end_offset": 23,

            "type": "CN_WORD",

            "position": 8

        },

        {

            "token": "拥",

            "start_offset": 21,

            "end_offset": 22,

            "type": "CN_WORD",

            "position": 9

        },

        {

            "token": "有",

            "start_offset": 22,

            "end_offset": 23,

            "type": "CN_CHAR",

            "position": 10

        },

        {

            "token": "活跃",

            "start_offset": 23,

            "end_offset": 25,

            "type": "CN_WORD",

            "position": 11

        },

        {

            "token": "跃",

            "start_offset": 24,

            "end_offset": 25,

            "type": "CN_WORD",

            "position": 12

        },

        {

            "token": "社区",

            "start_offset": 25,

            "end_offset": 27,

            "type": "CN_WORD",

            "position": 13

        },

        {

            "token": "开源",

            "start_offset": 27,

            "end_offset": 29,

            "type": "CN_WORD",

            "position": 14

        },

        {

            "token": "搜索",

            "start_offset": 30,

            "end_offset": 32,

            "type": "CN_WORD",

            "position": 15

        },

        {

            "token": "索解",

            "start_offset": 31,

            "end_offset": 33,

            "type": "CN_WORD",

            "position": 16

        },

        {

            "token": "索",

            "start_offset": 31,

            "end_offset": 32,

            "type": "CN_WORD",

            "position": 17

        },

        {

            "token": "解决方案",

            "start_offset": 32,

            "end_offset": 36,

            "type": "CN_WORD",

            "position": 18

        },

        {

            "token": "解决",

            "start_offset": 32,

            "end_offset": 34,

            "type": "CN_WORD",

            "position": 19

        },

        {

            "token": "方案",

            "start_offset": 34,

            "end_offset": 36,

            "type": "CN_WORD",

            "position": 20

        }

    ]

}

可以看到，中文分词变得更加合理。

本文地址：http://www.cnblogs.com/zhongshengzhen/p/elasticsearch_ik.html

ElasticSearch中文分词（IK）的更多相关文章

java中调用ElasticSearch中文分词ik没有起作用
问题描述: 项目中已经将'齐鲁壹点'加入到扩展词中,但是使用客户端调用的时候,高亮显示还是按照单个文字分词的: 解决方案: 1.创建Mapping使用的分词使用ik 2.查询使用QueryBuilde ...
Elasticsearch 中文分词(elasticsearch-analysis-ik) 安装
由于elasticsearch基于lucene,所以天然地就多了许多lucene上的中文分词的支持,比如 IK, Paoding, MMSEG4J等lucene中文分词原理上都能在elasticsea ...
ES5中文分词(IK)
ElasticSearch5中文分词(IK) ElasticSearch安装官网:https://www.elastic.co 1.ElasticSearch安装 1.1.下载安装公共密钥 rpm ...
elasticsearch 中文分词（elasticsearch-analysis-ik）安装
elasticsearch 中文分词(elasticsearch-analysis-ik)安装下载最新的发布版本 https://github.com/medcl/elasticsearch-ana ...
ElasticSearch(三) ElasticSearch中文分词插件IK的安装
正因为Elasticsearch 内置的分词器对中文不友好,会把中文分成单个字来进行全文检索,所以我们需要借助中文分词插件来解决这个问题. 一.安装maven管理工具 Elasticsearch 要使 ...
ElasticSearch 中文分词插件ik 的使用
下载 IK 的版本要与 Elasticsearch 的版本一致,因此下载 7.1.0 版本. 安装 1.中文分词插件下载地址:https://github.com/medcl/elasticsearc ...
elasticsearch中文分词器（ik）配置
elasticsearch默认的分词:http://localhost:9200/userinfo/_analyze?analyzer=standard&pretty=true&tex ...
ElasticSearch中文分词器-IK分词器的使用
IK分词器的使用首先我们通过Postman发送GET请求查询分词效果 GET http://localhost:9200/_analyze { "text":"农业银行 ...
ElasticSearch5中文分词(IK)
ElasticSearch安装官网:https://www.elastic.co 1.ElasticSearch安装 1.1.下载安装公共密钥 rpm --import https://artifa ...

随机推荐

Linux likely unlikely
/************************************************************************* * Linux likely unlikely * ...
DataTables ajax重新加载数据
传数据给后台返回数据,最开始的办法是重新生成一个datatable对象,但是在每次点击刷新时都会有闪动的现象,而且代价很高.理想中应该仅仅更新数据. 最后在文档中查到一个插件fnReloadAjax ...
Java [Leetcode 203]Remove Linked List Elements
题目描述: Remove all elements from a linked list of integers that have value val. ExampleGiven: 1 --> ...
c & c++中static的总结
static 修饰的三种作用 (1) 静态局部变量 (2) 模块内的全局变量.函数,不可以被其他模块访问 (3) 类的静态成员其中(3)只在c++中有. (1) 静态局部变量.局部变量一般在函数体内 ...
FZU 1591 Coral的烦恼
Problem Description 程序设计课的老师给Coral布置了一道题:用T(n)表示所有能整除n的正整数之和,对于给定的数字n,记S(n)=T(1)+T(2)+…+ T(n).你的任务就是 ...
SQLlite(WebSQL)如何排序并分页查询（SQLlite语法）
SELECT * FROM Table ORDER BY ID DESC Limit 10,9 limit语义:跳过10行,取9行参考: SQLite的limit用法如果我要去11-20的Ac ...
[Papers]NSE, $u_3$, Lebesgue space [Jia-Zhou, NARWA, 2014]
$$\bex u_3\in L^\infty(0,T;L^\frac{10}{3}(\bbR^3)). \eex$$
java多线程学习笔记——详细
一.线程类 1.新建状态(New):新创建了一个线程对象. 2.就绪状态(Runnable):线程对象创建后,其他线程调用了该对象的start()方法.该状态的线程位于可运行线程池中, ...
salt 批量部署与配置
salt是啥? salt是一个大型分布式的配置管理系统(安装升级卸载软件,检测环境),也是一个远程命令执行系统. salt 分为 master和minion,master顾名思义就是老大,管理子节点: ...
redo文件二
为什么要引入LGWR后台进程和redo log buffer 如果使用前台进程来将redo的信息写入到redo日志文件组中,那么会导致并发的前台进程对redo日志文件组的争用,从而使用后台进程LGWR ...

ElasticSearch中文分词（IK）

ElasticSearch中文分词（IK）的更多相关文章

随机推荐

热门专题