elasticsearch ik中文分词器的使用详解

（基于es5.4）先喵几眼github，按照步骤安装好分词器 link:https://github.com/medcl/elasticsearch-analysis-ik

复习一下常用的操作

.查看集群健康状况

GET /_cat/health?v&pretty

.查看my_index的mapping和setting的相关信息

GET /my_index?pretty

.查看所有的index

GET /_cat/indices?v&pretty

.删除 my_index_new

DELETE /my_index_new?pretty&pretty

先测试ik分词器的基本功能

GET _analyze?pretty

{

  "analyzer": "ik_smart",

  "text": "中华人民共和国国歌"

}

结果：

{

  "tokens": [

    {

      "token": "中华人民共和国",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "国歌",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    }

  ]

}

可以看出：通过ik_smart明显很智能的将 "中华人民共和国国歌"进行了正确的分词。

另外一个例子：

GET _analyze?pretty

{

  "analyzer": "ik_smart",

  "text": "王者荣耀是最好玩的游戏"

}

结果：

{

  "tokens": [

    {

      "token": "王者荣耀",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "最",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_CHAR",

      "position":

    },

    {

      "token": "好玩",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "游戏",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    }

  ]

}

如果结果跟我的不一样，那就对了，中文ik分词词库里面将“王者荣耀”是分开的，但是我们又不愿意将其分开，根据github上面的指示可以配置

IKAnalyzer.cfg.xml 目录在：elasticsearch-5.4.0/plugins/ik/config

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

    <comment>IK Analyzer 扩展配置</comment>

    <!--用户可以在这里配置自己的扩展字典 -->

    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>

    <!--用户可以在这里配置自己的扩展停止词字典-->

    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>

    <!--用户可以在这里配置远程扩展字典，下面是配置在nginx路径下面的 -->

    <entry key="remote_ext_dict">http://tagtic-slave01:82/HotWords.php</entry>

    <!--用户可以在这里配置远程扩展停止词字典-->

    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->

    <entry key="remote_ext_stopwords">http://tagtic-slave01:82/StopWords.php</entry>

</properties>

可以看到HotWords.php

<?php

$s = <<<'EOF'

王者荣耀

阴阳师

EOF;

header("Content-type: text/html; charset=utf-8");

header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, );

header('ETag: "5816f349-19"');

echo $s;

?>

配置完了之后就可以看到刚才的结果了

顺便测试一下ik_max_word

GET /index/_analyze?pretty

{

  "analyzer": "ik_max_word",

  "text": "中华人民共和国国歌"

}

结果看看就行了

{

  "tokens": [

    {

      "token": "中华人民共和国",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "中华人民",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "中华",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "华人",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "人民共和国",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "人民",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "共和国",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "共和",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    },

    {

      "token": "国",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_CHAR",

      "position":

    },

    {

      "token": "国歌",

      "start_offset": ,

      "end_offset": ,

      "type": "CN_WORD",

      "position":

    }

  ]

}

再看看github上面的一个例子

POST /index/fulltext/_mapping

{

  "fulltext": {

    "_all": {

      "analyzer": "ik_smart"

    },

    "properties": {

      "content": {

        "type": "text"

      }

    }

  }

}

存一些值

POST /index/fulltext/

{

  "content": "美国留给伊拉克的是个烂摊子吗"

}

POST /index/fulltext/

{

  "content": "公安部：各地校车将享最高路权"

}

POST /index/fulltext/

{

  "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"

}

POST /index/fulltext/

{

  "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"

}

取值

POST /index/fulltext/_search

{

  "query": {

    "match": {

      "content": "中国"

    }

  }

}

结果

{

  "took": ,

  "timed_out": false,

  "_shards": {

    "total": ,

    "successful": ,

    "failed":

  },

  "hits": {

    "total": ,

    "max_score": 1.0869478,

    "hits": [

      {

        "_index": "index",

        "_type": "fulltext",

        "_id": "",

        "_score": 1.0869478,

        "_source": {

          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"

        }

      },

      {

        "_index": "index",

        "_type": "fulltext",

        "_id": "",

        "_score": 0.61094594,

        "_source": {

          "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"

        }

      },

      {

        "_index": "index",

        "_type": "fulltext",

        "_id": "",

        "_score": 0.27179778,

        "_source": {

          "content": "美国留给伊拉克的是个烂摊子吗"

        }

      }

    ]

  }

}

es会按照分词进行索引，然后根据你的查询条件按照分数的高低给出结果

官网有一个例子，可以学习学习：https://github.com/medcl/elasticsearch-analysis-ik

看另一个有趣的例子

PUT /index1

{

  "settings": {

     "refresh_interval": "5s",

     "number_of_shards" :   ,

     "number_of_replicas" :

  },

  "mappings": {

    "_default_":{

      "_all": { "enabled":  false }

    },

    "resource": {

      "dynamic": false,

      "properties": {

        "title": {

          "type": "text",

          "fields": {

            "cn": {

              "type": "text",

              "analyzer": "ik_smart"

            },

            "en": {

              "type": "text",

              "analyzer": "english"

            }

          }

        }

      }

    }

  }

}

field的作用有二：

.比如一个string类型可以映射成text类型来进行全文检索，keyword类型作为排序和聚合;

 相当于起了个别名，使用不同的分类器

批量插入值

POST /_bulk

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "周星驰最新电影" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "周星驰最好看的新电影" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "周星驰最新电影，最好，新电影" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "最最最最好的新新新新电影" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "I'm not happy about the foxes" }

取值

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "fox",

      "fields": "title"

    }

  }

}

结果

{

  "took": ,

  "timed_out": false,

  "_shards": {

    "total": ,

    "successful": ,

    "failed":

  },

  "hits": {

    "total": ,

    "max_score": null,

    "hits": []

  }

}

原因，使用title里面查询fox,而title使用的是Standard标准分词器，被索引的是foxes，所以不会有结果，下面这种情况就会有结果了

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "fox",

      "fields": "title.en"

    }

  }

}

结果就不列出来了，因为title.en使用的是english分词器

对比一下下面的输出，体会一下field的使用

GET /index1/resource/_search

{

  "query": {

    "match": {

      "title.cn": "the最好游戏"

    }

  }

}

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "the最新游戏",

      "fields": [ "title", "title.cn", "title.en" ]

    }

  }

}

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "the最新",

      "fields": "title.cn"

    }

  }

}

根据结果体会体会用法

下面使用“王者荣耀做测试”，这里可以看到前面配置的HotWords.php是一把双刃剑，将“王者荣耀”放在里面之后，“王者荣耀”这个词就是一个整体，不会被切分成“王者”和“荣耀”，但是就是要搜索王者怎么办呢，这里就体现出fields的强大了，具体看下面

先存入数据

POST /_bulk

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "王者荣耀最好玩的游戏" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "王者荣耀最好玩的新游戏" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "王者荣耀最新游戏，最好玩，新游戏" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "最最最最好的新新新新游戏" }

{ "create": { "_index": "index1", "_type": "resource", "_id":  } }

{ "title": "I'm not happy about the foxes" }

查询

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "王者荣耀",

      "fields": "title.cn"

    }

  }

}

#下面会没有结果返回

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "王者",

      "fields": "title.cn"

    }

  }

}

POST /index1/resource/_search

{

  "query": {

    "multi_match": {

      "type":     "most_fields",

      "query":    "王者",

      "fields": "title"

    }

  }

}

对比结果就可以一目了然了，结果略！

所以一开始业务的需求要相当了解，才能有好的映射（mapping）被设计，搜索的时候也会省事不少

参考：

https://github.com/medcl/elasticsearch-analysis-ik

http://keenwon.com/1404.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html#_example_output

elasticsearch ik中文分词器的使用详解的更多相关文章

elasticsearch ik中文分词器安装
特殊说明:灰色文字用来辅助理解的. 安装IK中文分词器我在百度上搜索了下,大多介绍的都是用maven打包下载下来的源码,这种方法也行,但是不够方便,为什么这么说? 首先需要安装maven吧?其次需要 ...
elasticsearch ik中文分词器的安装配置使用
安装步骤 https://github.com/medcl/elasticsearch-analysis-ik 以插件形式安装: [elsearch@localhost elasticsearch- ...
如何给Elasticsearch安装中文分词器IK
安装Elasticsearch安装中文分词器IK的步骤: 1. 停止elasticsearch 2.2的服务 2. 在以下地址下载对应的elasticsearch-analysis-ik插件安装包(版 ...
【自定义IK词典】Elasticsearch之中文分词器插件es-ik的自定义词库
Elasticsearch之中文分词器插件es-ik 针对一些特殊的词语在分词的时候也需要能够识别有人会问,那么,例如: 如果我想根据自己的本家姓氏来查询,如zhouls,姓氏“周”. 如 ...
ElasticSearch速学 - IK中文分词器远程字典设置
前面已经对”IK中文分词器“有了简单的了解: 但是可以发现不是对所有的词都能很好的区分,比如: 逼格这个词就没有分出来. 词库实际上IK分词器也是根据一些词库来进行分词的,我们可以丰富这个词库. ...
沉淀再出发：ElasticSearch的中文分词器ik
沉淀再出发:ElasticSearch的中文分词器ik 一.前言为什么要在elasticsearch中要使用ik这样的中文分词呢,那是因为es提供的分词是英文分词,对于中文的分词就做的非常不好了 ...
ElasticSearch安装中文分词器IK
1.安装IK分词器,下载对应版本的插件,elasticsearch-analysis-ik中文分词器的开发者一直进行维护的,对应着elasticsearch的版本,所以选择好自己的版本即可.IKAna ...
ElasticSearch的中文分词器ik
一.前言为什么要在elasticsearch中要使用ik这样的中文分词呢,那是因为es提供的分词是英文分词,对于中文的分词就做的非常不好了,因此我们需要一个中文分词器来用于搜索和使用. 二.IK ...
elasticsearch使用ik中文分词器
elasticsearch使用ik中文分词器一.背景二.安装 ik 分词器 1.从 github 上找到和本次 es 版本匹配上的分词器 2.使用 es 自带的插件管理 elasticsearc ...

随机推荐

..\OBJ\CAN.axf: Error: L6411E: No compatible library exists with a definition of startup symbol __main.
..\OBJ\CAN.axf: Error: L6411E: No compatible library exists with a definition of startup symbol __ma ...
JZOJ-2019-11-5 A组
T1 给定由 n 个点 m 条边组成的无向连通图,保证没有重边和自环. 你需要找出所有边,满足这些边恰好存在于一个简单环中.一个环被称为简单环,当且仅当它包含的所有点都只在这个环中被经过了一次.(即求 ...
JAVA多线程的基础
线程与进程的区别 1.线程与进程每个正在系统上运行的程序都是一个进程.每个进程包含一到多个线程.线程是一组指令的集合,或者是程序的特殊段,它可以在程序里独立执行.也可以把它理解为代码运行的上下文.所 ...
mysql出现 too many connections
出现这个问题的原因网上大致都是说这三种 1.慢sql 2.大量持久性的连接 3.程序没有及时关闭连接解决方式 mysql -u 账号 -p 输入密码 show processlist; kill掉s ...
MSE（均方误差）、RMSE （均方根误差）、MAE （平均绝对误差）
1.MSE(均方误差)(Mean Square Error) MSE是真实值与预测值的差值的平方然后求和平均. 范围[0,+∞),当预测值与真实值完全相同时为0,误差越大,该值越大. import n ...
import datetime
import datetimenow = datetime.datetime.now()print('当前时间:',now) 当前时间: 2019-11-21 11:11:58.093122
吴裕雄--天生自然 JAVASCRIPT开发学习：prototype（原型对象）
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <title> ...
1月18日 LCA专项训练
A. Lorenzo Von Matterhorn B.Minimum spanning tree for each edge C.Misha, Grisha and Underground D.Fo ...
超级顽固的流方式读取doc,docx乱码问题
因为工作中需要一个把doc或者docx的office文档内容,需要读取出来,并且也没展示功能.代码中第一考虑可能就是通过读取流方式,结果写了以后,各种乱码,百科的解决方案也是千奇百怪,第一点:可能是文 ...
计蒜客王子救公主（DFS）
一天,蒜头君梦见自己当上了王子,但是不幸的是,自己的公主被可恶的巫婆抓走了.于是蒜头君动用全国的力量得知,自己的公主被巫婆抓进一个迷宫里面.由于全国只有蒜头君自己可以翻越迷宫外的城墙,蒜头君便自己一人 ...

elasticsearch ik中文分词器的使用详解

elasticsearch ik中文分词器的使用详解的更多相关文章

随机推荐

热门专题