es 分词器介绍

按照单词切分，不做处理

GET _analyze

{

  "analyzer": "standard",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,分词

      "end_offset" : 1,

      "type" : "<NUM>",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "<ALPHANUM>",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "<ALPHANUM>",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "<ALPHANUM>",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "<ALPHANUM>",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "<ALPHANUM>",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "<ALPHANUM>",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "<ALPHANUM>",

      "position" : 7

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "<ALPHANUM>",

      "position" : 8

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "<ALPHANUM>",

      "position" : 9

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "<ALPHANUM>",

      "position" : 10

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "<ALPHANUM>",

      "position" : 11

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "<ALPHANUM>",

      "position" : 12

    }

  ]

}

　　按照非字母的字符切分

GET _analyze

{

  "analyzer": "simple",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　按照空格切分不做任何处理

GET _analyze

{

  "analyzer": "whitespace",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "Quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brawn-foxes",

      "start_offset" : 16,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening.",

      "start_offset" : 62,

      "end_offset" : 70,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　按词切分去掉修饰词

GET _analyze

{

  "analyzer": "stop",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　不进行切分直接输出

GET _analyze

{

  "analyzer": "keyword",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2 running Quick brawn-foxes leap over lazy dogs in the summer evening.",

      "start_offset" : 0,

      "end_offset" : 70,

      "type" : "word",

      "position" : 0

    }

  ]

}

　　通过正则表达式方式进行切割，默认非字符的方式切割

GET _analyze

{

  "analyzer": "pattern",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 11

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 12

    }

  ]

}

　　英语分词器

GET _analyze

{

  "analyzer": "english",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<NUM>",

      "position" : 0

    },

    {

      "token" : "run",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "<ALPHANUM>",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "<ALPHANUM>",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "<ALPHANUM>",

      "position" : 3

    },

    {

      "token" : "fox",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "<ALPHANUM>",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "<ALPHANUM>",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "<ALPHANUM>",

      "position" : 6

    },

    {

      "token" : "lazi",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "<ALPHANUM>",

      "position" : 7

    },

    {

      "token" : "dog",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "<ALPHANUM>",

      "position" : 8

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "<ALPHANUM>",

      "position" : 11

    },

    {

      "token" : "even",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "<ALPHANUM>",

      "position" : 12

    }

  ]

}

　　中文分词器，一个字符一个字符切分

POST _analyze

{

  "analyzer": "standard",

  "text": "他说的确实在理"

}

{

  "tokens" : [

    {

      "token" : "他",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<IDEOGRAPHIC>",

      "position" : 0

    },

    {

      "token" : "说",

      "start_offset" : 1,

      "end_offset" : 2,

      "type" : "<IDEOGRAPHIC>",

      "position" : 1

    },

    {

      "token" : "的",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "<IDEOGRAPHIC>",

      "position" : 2

    },

    {

      "token" : "确",

      "start_offset" : 3,

      "end_offset" : 4,

      "type" : "<IDEOGRAPHIC>",

      "position" : 3

    },

    {

      "token" : "实",

      "start_offset" : 4,

      "end_offset" : 5,

      "type" : "<IDEOGRAPHIC>",

      "position" : 4

    },

    {

      "token" : "在",

      "start_offset" : 5,

      "end_offset" : 6,

      "type" : "<IDEOGRAPHIC>",

      "position" : 5

    },

    {

      "token" : "理",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "<IDEOGRAPHIC>",

      "position" : 6

    }

  ]

}

es 分词器介绍的更多相关文章

Es学习第五课，分词器介绍和中文分词器配置
上课我们介绍了倒排索引,在里面提到了分词的概念,分词器就是用来分词的. 分词器是ES中专门处理分词的组件,英文为Analyzer,定义为:从一串文本中切分出一个一个的词条,并对每个词条进行标准化.它由 ...
es学习(三)：分词器介绍以及中文分词器ik的安装与使用
什么是分词把文本转换为一个个的单词,分词称之为analysis.es默认只对英文语句做分词,中文不支持,每个中文字都会被拆分为独立的个体. 示例 POST http://192.168.247.8: ...
es分词器
1.默认的分词器 standard standard tokenizer:以单词边界进行切分standard token filter:什么都不做lowercase token filter:将所有字 ...
Lucene的分词_中文分词器介绍
Paoding:庖丁解牛分词器.已经没有更新了. MMSeg:搜狗的词库. MMSeg分词器的一些截图: 步骤: 1.导入包 2.创建的时候使用MMSegAnalyzer分词器
HanLP-分类模块的分词器介绍
最近发现一个很勤快的大神在分享他的一些实操经验,看了一些他自己关于hanlp方面的文章,写的挺好的!转载过来分享给大家!以下为分享原文(无意义的内容已经做了删除) 如下图所示,HanLP的分类模块中单 ...
Elasticsearch：ICU分词器介绍
ICU Analysis插件是一组将Lucene ICU模块集成到Elasticsearch中的库. 本质上,ICU的目的是增加对Unicode和全球化的支持,以提供对亚洲语言更好的文本分割分析. 从 ...
lucene-一篇分词器介绍很好理解的文章
本文来自这里在前面的概念介绍中我们已经知道了分析器的作用,就是把句子按照语义切分成一个个词语.英文切分已经有了很成熟的分析器: StandardAnalyzer,很多情况下StandardAnalyz ...
Elasticsearch（ES）分词器的那些事儿
1. 概述分词器是Elasticsearch中很重要的一个组件,用来将一段文本分析成一个一个的词,Elasticsearch再根据这些词去做倒排索引. 今天我们就来聊聊分词器的相关知识. 2. 内置 ...
Elasticsearch系列---倒排索引原理与分词器
概要本篇主要讲解倒排索引的基本原理以及ES常用的几种分词器介绍. 倒排索引的建立过程倒排索引是搜索引擎中常见的索引方法,用来存储在全文搜索下某个单词在一个文档中存储位置的映射.通过倒排索引,我们输 ...

随机推荐

CentOS 7 使用笔记
一.下载.解压或安装等命令: 目前自己用过的三个下载及安装命令:curl.wget.yum. yum用法: $ sudo yum install libpng16-1.6.29-alt1.i586.r ...
my codestyle
代码风格缩进缩进采用4个空格或tab. 原则是:如果地位相等,则不需要缩进:如果属于某一个代码的内部代码就需要缩进. 变量命名变量命名遵守遵从驼峰命名法,统一使用lowerCamelCase风格 ...
python中使用graphviz环境配置
去官网下载graphviz,并下一步安装配置graphviz的bin目录到path环境变量下 python相关包: 使用conda注意:conda install graphviz 可能没用,要使用 ...
tomcat查看当前内存
查看运行中的tomcat内存非常简单,只需运行一下此界面就可以看到. <html> <head><meta http-equiv="Content-Type&q ...
Java读取、写入、处理Excel文件中的数据(转载)
原文链接在日常工作中,我们常常会进行文件读写操作,除去我们最常用的纯文本文件读写,更多时候我们需要对Excel中的数据进行读取操作,本文将介绍Excel读写的常用方法,希望对大家学习Java读写Ex ...
其他 - 阻塞 & 同步的基本认识
1. 概述有些概念, 老是弄不清楚同步异步阻塞非阻塞 2. 准备场景角色 client 发起请求接受请求 server 接受请求执行操作返回响应行为大致是一个 C/S 模式的模型 ...
HttpClient与TestNG结合
1.HTTPclient插件的安装在maven项目的pom.xml中引用HTTPclient包,如下 <dependencies> <dependency> <grou ...
vscode管理员身份运行
管理员身份运行,如vscode 如何设置呢? vscode图标右键以管理员身份运行程序打钩就行了运行“在终端打开”的时候,要以管理员身份运行刚下载完vscode并运行并不是管理员身份会报错解 ...
关于pgsql 的json 和jsonb 的数据查询操作笔记整理
关于pgsql 的json 和jsonb 的数据处理笔记 1. json 和jsonb 区别两者从用户操作的角度来说没有区别,区别主要是存储和读取的系统处理(预处理)和耗时方面有区别.json写入快, ...
Go 后端主要做什么
漫谈 Go 语言后端开发 :https://blog.csdn.net/u010986776/article/details/87276303 Golang 资深后端工程师要了解的知识:https:/ ...

es 分词器介绍

es 分词器介绍的更多相关文章

随机推荐

热门专题