Elasticsearch 分词器

无论是内置的分析器（analyzer），还是自定义的分析器（analyzer），都由三种构件块组成的：character filters ， tokenizers ， token filters。

内置的analyzer将这些构建块预先打包到适合不同语言和文本类型的analyzer中。

Character filters （字符过滤器）

字符过滤器以字符流的形式接收原始文本，并可以通过添加、删除或更改字符来转换该流。

举例来说，一个字符过滤器可以用来把阿拉伯数字（٠‎١٢٣٤٥٦٧٨‎٩）‎转成成Arabic-Latin的等价物（0123456789）。

一个分析器可能有0个或多个字符过滤器，它们按顺序应用。

（PS：类似Servlet中的过滤器，或者拦截器，想象一下有一个过滤器链）

Tokenizer （分词器）

一个分词器接收一个字符流，并将其拆分成单个token （通常是单个单词），并输出一个token流。例如，一个whitespace分词器当它看到空白的时候就会将文本拆分成token。它会将文本“Quick brown fox!”转换为[Quick, brown, fox!]

（PS：Tokenizer 负责将文本拆分成单个token ，这里token就指的就是一个一个的单词。就是一段文本被分割成好几部分，相当于Java中的字符串的 split ）

分词器还负责记录每个term的顺序或位置，以及该term所表示的原单词的开始和结束字符偏移量。（PS：文本被分词后的输出是一个term数组）

一个分析器必须只能有一个分词器

Token filters （token过滤器）

token过滤器接收token流，并且可能会添加、删除或更改tokens。

例如，一个lowercase token filter可以将所有的token转成小写。stop token filter可以删除常用的单词，比如 the 。synonym token filter可以将同义词引入token流。

不允许token过滤器更改每个token的位置或字符偏移量。

一个分析器可能有0个或多个token过滤器，它们按顺序应用。

小结&回顾

analyzer（分析器）是一个包，这个包由三部分组成，分别是：character filters （字符过滤器）、tokenizer（分词器）、token filters（token过滤器）

一个analyzer可以有0个或多个character filters

一个analyzer有且只能有一个tokenizer

一个analyzer可以有0个或多个token filters

character filter 是做字符转换的，它接收的是文本字符流，输出也是字符流

tokenizer 是做分词的，它接收字符流，输出token流（文本拆分后变成一个一个单词，这些单词叫token）

token filter 是做token过滤的，它接收token流，输出也是token流

由此可见，整个analyzer要做的事情就是将文本拆分成单个单词，文本 ----> 字符 ----> token

这就好比是拦截器

1. 测试分析器

analyze API 是一个工具，可以帮助我们查看分析的过程。（PS：类似于执行计划）

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "whitespace",

  "text":     "The quick brown fox."

}

'

curl -X POST "192.168.1.134:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "tokenizer": "standard",

  "filter":  [ "lowercase", "asciifolding" ],

  "text":      "Is this déja vu?"

}

'

输出：

{

    "tokens":[

        {

            "token":"The",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"quick",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"brown",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        },

        {

            "token":"fox.",

            "start_offset":,

            "end_offset":,

            "type":"word",

            "position":

        }

    ]

}

可以看到，对于每个term，记录了它的位置和偏移量

2. Analyzer

2.1. 配置内置的分析器

内置的分析器不用任何配置就可以直接使用。当然，默认配置是可以更改的。例如，standard分析器可以配置为支持停止字列表:

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "std_english": {

          "type":      "standard",

          "stopwords": "_english_"

        }

      }

    }

  },

  "mappings": {

    "_doc": {

      "properties": {

        "my_text": {

          "type":     "text",

          "analyzer": "standard",

          "fields": {

            "english": {

              "type":     "text",

              "analyzer": "std_english"

            }

          }

        }

      }

    }

  }

}

'

在这个例子中，我们基于standard分析器来定义了一个std_englisth分析器，同时配置为删除预定义的英语停止词列表。后面的mapping中，定义了my_text字段用standard，my_text.english用std_english分析器。因此，下面两个的分词结果会是这样的：

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "field": "my_text",

  "text": "The old brown cow"

}

'

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "field": "my_text.english",

  "text": "The old brown cow"

}

'

第一个由于用的standard分析器，因此分词的结果是：[ the, old, brown, cow ]

第二个用std_english分析的结果是：[ old, brown, cow ]

2.2. Standard Analyzer （默认）

如果没有特别指定的话，standard 是默认的分析器。它提供了基于语法的标记化（基于Unicode文本分割算法），适用于大多数语言。

例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "standard",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

上面例子中，那段文本将会输出如下terms：

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

2.2.1. 配置

标准分析器接受下列参数：

max_token_length ：最大token长度，默认255
stopwords ：预定义的停止词列表，如_english_ 或包含停止词列表的数组，默认是 _none_
stopwords_path ：包含停止词的文件路径

2.2.2. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_english_analyzer": {

          "type": "standard",

          "max_token_length": ,

          "stopwords": "_english_"

        }

      }

    }

  }

}

'

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_english_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

以上输出下列terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

2.2.3. 定义

standard分析器由下列两部分组成：

Tokenizer

Standard Tokenizer

Token Filters

Standard Token Filter
Lower Case Token Filter
Stop Token Filter （默认被禁用）

你还可以自定义

curl -X PUT "localhost:9200/standard_example" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "rebuilt_standard": {

          "tokenizer": "standard",

          "filter": [

            "lowercase"

          ]

        }

      }

    }

  }

}

'

2.3. Simple Analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成term，而且所有的term都是小写的。例如：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "simple",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输入结果如下：

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.3.1. 自定义

curl -X PUT "localhost:9200/simple_example" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "rebuilt_simple": {

          "tokenizer": "lowercase",

          "filter": [

          ]

        }

      }

    }

  }

}

'

2.4. Whitespace Analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

示例：

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "whitespace",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输出结果如下：

[ The, , QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

2.5. Stop Analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持。默认用的停止词是 _englisht_

（PS：意思是，假设有一句话“this is a apple”，并且假设“this” 和 “is”都是停止词，那么用simple的话输出会是[ this , is , a , apple ]，而用stop输出的结果会是[ a , apple ]，到这里就看出二者的区别了，stop 不会输出停止词，也就是说它不认为停止词是一个term）

（PS：所谓的停止词，可以理解为分隔符）

2.5.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

    "analyzer": "stop",

    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

输出

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

2.5.2. 配置

stop 接受以下参数：

stopwords ：一个预定义的停止词列表（比如，_englisht_）或者是一个包含停止词的列表。默认是 _english_
stopwords_path ：包含停止词的文件路径。这个路径是相对于Elasticsearch的config目录的一个路径

2.5.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_stop_analyzer": {

          "type": "stop",

          "stopwords": ["the", "over"]

        }

      }

    }

  }

}

'

上面配置了一个stop分析器，它的停止词有两个：the 和 over

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_stop_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

基于以上配置，这个请求输入会是这样的：

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

2.6. Pattern Analyzer

用Java正则表达式来将文本分割成terms，默认的正则表达式是\W+（非单词字符）

2.6.1. 示例输出

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "pattern",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

由于默认按照非单词字符分割，因此输出会是这样的：

[ the, , quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

2.6.2. 配置

pattern 分析器接受如下参数：

pattern ：一个Java正则表达式，默认 \W+
flags ： Java正则表达式flags。比如：CASE_INSENSITIVE 、COMMENTS
lowercase ：是否将terms全部转成小写。默认true
stopwords ：一个预定义的停止词列表，或者包含停止词的一个列表。默认是 _none_
stopwords_path ：停止词文件路径

2.6.3. 示例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_email_analyzer": {

          "type":      "pattern",

          "pattern":   "\\W|_",

          "lowercase": true

        }

      }

    }

  }

}

'

上面的例子中配置了按照非单词字符或者下划线分割，并且输出的term都是小写

curl -X POST "localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d'

{

  "analyzer": "my_email_analyzer",

  "text": "John_Smith@foo-bar.com"

}

'

因此，基于以上配置，本例输出如下：

[ john, smith, foo, bar, com ]

2.7. Language Analyzers

支持不同语言环境下的文本分析。内置（预定义）的语言有：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

2.8. 自定义Analyzer

前面也说过，一个分析器由三部分构成：

zero or more character filters
a tokenizer
zero or more token filters

2.8.1. 实例配置

curl -X PUT "localhost:9200/my_index" -H 'Content-Type: application/json' -d'

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_custom_analyzer": {

          "type":      "custom",

          "tokenizer": "standard",

          "char_filter": [

            "html_strip"

          ],

          "filter": [

            "lowercase",

            "asciifolding"

          ]

        }

      }

    }

  }

}

'

3. Tokenizer

3.1. Standard Tokenizer

curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'

{

  "tokenizer": "standard",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog\u0027s bone."

}

'

4. 中文分词器

4.1. smartCN

一个简单的中文或中英文混合文本的分词器

这个插件提供 smartcn analyzer 和 smartcn_tokenizer tokenizer，而且不需要配置

# 安装

bin/elasticsearch-plugin install analysis-smartcn

# 卸载

bin/elasticsearch-plugin remove analysis-smartcn

下面测试一下

可以看到，“今天天气真好”用smartcn分析器的结果是：

[ 今天 ， 天气 ， 真 ， 好 ]

如果用standard分析器的话，结果会是：

[ 今 ，天 ，气 ， 真 ， 好 ]

4.2. IK分词器

下载对应的版本，这里我下载6.5.3

然后，在Elasticsearch的plugins目录下建一个ik目录，将刚才下载的文件解压到该目录下

最后，重启Elasticsearch

接下来，还是用刚才那句话来测试一下

输出结果如下：

{

    "tokens": [

        {

            "token": "今天天气",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "今天",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "天天",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "天气",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        },

        {

            "token": "真好",

            "start_offset": ,

            "end_offset": ,

            "type": "CN_WORD",

            "position":

        }

    ]

}

显然比smartcn要更好一点

5. 参考

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html

https://github.com/medcl/elasticsearch-analysis-ik

Elasticsearch 分词器的更多相关文章

Elasticsearch——分词器对String的作用
更多内容参考:Elasticsearch学习总结关于String类型--分词与不分词在Elasticsearch中String是最基本的数据类型,如果不是数字或者标准格式的日期等这种很明显的类型, ...
elasticsearch分词器Jcseg安装手册
Jcseg是什么? Jcseg是基于mmseg算法的一个轻量级中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,并且提供了一个基于Jetty的web服务器,方便各大语言 ...
ElasticSearch分词器
什么是分词器? 分词器,是将用户输入的一段文本,分析成符合逻辑的一种工具.到目前为止呢,分词器没有办法做到完全的符合人们的要求.和我们有关的分词器有英文的和中文的.英文的分词器过程:输入文本-关键词切 ...
ElasticSearch 分词器，了解一下
这篇文章主要来介绍下什么是 Analysis ,什么是分词器,以及 ElasticSearch 自带的分词器是怎么工作的,最后会介绍下中文分词是怎么做的. 首先来说下什么是 Analysis: 什么是 ...
elasticsearch分词器ik
1. 下载和es配套的版本 git clone https://github.com/medcl/elasticsearch-analysis-ik 2. 编译 cd elasticsearch-an ...
Elasticsearch(10) --- 内置分词器、中文分词器
Elasticsearch(10) --- 内置分词器.中文分词器这篇博客主要讲:分词器概念.ES内置分词器.ES中文分词器. 一.分词器概念 1.Analysis 和 Analyzer Analy ...
elasticsearch教程--中文分词器作用和使用
概述本文都是基于elasticsearch安装教程中的elasticsearch安装目录(/opt/environment/elasticsearch-6.4.0)为范例环境准备 ·全新最小 ...
使用Docker 安装Elasticsearch、Elasticsearch-head、IK分词器和使用
原文:使用Docker 安装Elasticsearch.Elasticsearch-head.IK分词器和使用 Elasticsearch的安装一.elasticsearch的安装 1.镜像拉取 ...
如何在Elasticsearch中安装中文分词器(IK+pinyin)
如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题--中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组. ...

随机推荐

Cookie、cookie使用方法
Cookie.cookie使用方法.保存用户名密码 //设置Cookie, //cname 获取时所需参数 //username,password 用于记住账号密码,如果只要存一个参数 passwor ...
了解 ptyhon垃圾回收机制
Python的GC模块主要运用了“引用计数”(reference counting)来跟踪和回收垃圾.在引用计数的基础上,还可以通过“标记-清除”(mark and sweep)解决容器对象可能产生的 ...
python将多个pdf合成一个
'''# -*- coding:utf-8*-''' import sys import importlib importlib.reload(sys) import os import os.pat ...
Vue-Router嵌套路由
1:查看router-view所对应的位置,是属于顶级出口还是存在于某个组件当中 2:当router-view存在于某个组件当中时 const User = { template: ` <div ...
GitHub的简单使用记录
记录于:2013/4/24 GitHub(网址 https://github.com/)是一个面向开源及私有软件项目的托管平台,因为只支持Git作为唯一的版本库格式进行托管,故名GitHub. G ...
针对 jQuery Gridly 控件显示多少列的问题。
针对 jQuery Gridly 控件显示多少列的问题,完全根据 columns 的值来显示. 但是显示columns,并不是给多少值显示几列.到目前还是很模糊的.官方文档没有给出具体的一个解释. $ ...
MyBatis3系列__05查询补充&resultMap与resultType区别
1.查询补充当你查询一条记录并且是简单查询时,情况相对简单,可以参考以下的例子: public Employee getEmpById(Integer id); 对应的xml文件中: <sel ...
[jzoj]3468.【NOIP2013模拟联考7】OSU!(osu)
Link https://jzoj.net/senior/#main/show/3468 Description osu 是一款群众喜闻乐见的休闲软件. 我们可以把osu的规则简化与改编成以下的样子: ...
Linux 结构化命令
if -then 语句 if -then 语句有如下格式 if command then commands f i bash shell 的if语句会先运行if后面的那个命令,如果改命令的退出状态码是 ...
js拼接字符串后swiper不能动的解决方案
swiper的配置一定要放在拼接字符串之后,紧随其后,如果放在其他的位置,swiper是不识别HTML的.

Elasticsearch 分词器

Elasticsearch 分词器的更多相关文章

随机推荐

热门专题