es 分词器介绍

按照单词切分，不做处理

GET _analyze

{

  "analyzer": "standard",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,分词

      "end_offset" : 1,

      "type" : "<NUM>",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "<ALPHANUM>",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "<ALPHANUM>",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "<ALPHANUM>",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "<ALPHANUM>",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "<ALPHANUM>",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "<ALPHANUM>",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "<ALPHANUM>",

      "position" : 7

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "<ALPHANUM>",

      "position" : 8

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "<ALPHANUM>",

      "position" : 9

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "<ALPHANUM>",

      "position" : 10

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "<ALPHANUM>",

      "position" : 11

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "<ALPHANUM>",

      "position" : 12

    }

  ]

}

　　按照非字母的字符切分

GET _analyze

{

  "analyzer": "simple",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　按照空格切分不做任何处理

GET _analyze

{

  "analyzer": "whitespace",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "Quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brawn-foxes",

      "start_offset" : 16,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening.",

      "start_offset" : 62,

      "end_offset" : 70,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　按词切分去掉修饰词

GET _analyze

{

  "analyzer": "stop",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 11

    }

  ]

}

　　不进行切分直接输出

GET _analyze

{

  "analyzer": "keyword",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2 running Quick brawn-foxes leap over lazy dogs in the summer evening.",

      "start_offset" : 0,

      "end_offset" : 70,

      "type" : "word",

      "position" : 0

    }

  ]

}

　　通过正则表达式方式进行切割，默认非字符的方式切割

GET _analyze

{

  "analyzer": "pattern",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "running",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "dogs",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "in",

      "start_offset" : 48,

      "end_offset" : 50,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "the",

      "start_offset" : 51,

      "end_offset" : 54,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "word",

      "position" : 11

    },

    {

      "token" : "evening",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "word",

      "position" : 12

    }

  ]

}

　　英语分词器

GET _analyze

{

  "analyzer": "english",

  "text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."

}

{

  "tokens" : [

    {

      "token" : "2",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<NUM>",

      "position" : 0

    },

    {

      "token" : "run",

      "start_offset" : 2,

      "end_offset" : 9,

      "type" : "<ALPHANUM>",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 10,

      "end_offset" : 15,

      "type" : "<ALPHANUM>",

      "position" : 2

    },

    {

      "token" : "brawn",

      "start_offset" : 16,

      "end_offset" : 21,

      "type" : "<ALPHANUM>",

      "position" : 3

    },

    {

      "token" : "fox",

      "start_offset" : 22,

      "end_offset" : 27,

      "type" : "<ALPHANUM>",

      "position" : 4

    },

    {

      "token" : "leap",

      "start_offset" : 28,

      "end_offset" : 32,

      "type" : "<ALPHANUM>",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 33,

      "end_offset" : 37,

      "type" : "<ALPHANUM>",

      "position" : 6

    },

    {

      "token" : "lazi",

      "start_offset" : 38,

      "end_offset" : 42,

      "type" : "<ALPHANUM>",

      "position" : 7

    },

    {

      "token" : "dog",

      "start_offset" : 43,

      "end_offset" : 47,

      "type" : "<ALPHANUM>",

      "position" : 8

    },

    {

      "token" : "summer",

      "start_offset" : 55,

      "end_offset" : 61,

      "type" : "<ALPHANUM>",

      "position" : 11

    },

    {

      "token" : "even",

      "start_offset" : 62,

      "end_offset" : 69,

      "type" : "<ALPHANUM>",

      "position" : 12

    }

  ]

}

　　中文分词器，一个字符一个字符切分

POST _analyze

{

  "analyzer": "standard",

  "text": "他说的确实在理"

}

{

  "tokens" : [

    {

      "token" : "他",

      "start_offset" : 0,

      "end_offset" : 1,

      "type" : "<IDEOGRAPHIC>",

      "position" : 0

    },

    {

      "token" : "说",

      "start_offset" : 1,

      "end_offset" : 2,

      "type" : "<IDEOGRAPHIC>",

      "position" : 1

    },

    {

      "token" : "的",

      "start_offset" : 2,

      "end_offset" : 3,

      "type" : "<IDEOGRAPHIC>",

      "position" : 2

    },

    {

      "token" : "确",

      "start_offset" : 3,

      "end_offset" : 4,

      "type" : "<IDEOGRAPHIC>",

      "position" : 3

    },

    {

      "token" : "实",

      "start_offset" : 4,

      "end_offset" : 5,

      "type" : "<IDEOGRAPHIC>",

      "position" : 4

    },

    {

      "token" : "在",

      "start_offset" : 5,

      "end_offset" : 6,

      "type" : "<IDEOGRAPHIC>",

      "position" : 5

    },

    {

      "token" : "理",

      "start_offset" : 6,

      "end_offset" : 7,

      "type" : "<IDEOGRAPHIC>",

      "position" : 6

    }

  ]

}

es 分词器介绍的更多相关文章

Es学习第五课，分词器介绍和中文分词器配置
上课我们介绍了倒排索引,在里面提到了分词的概念,分词器就是用来分词的. 分词器是ES中专门处理分词的组件,英文为Analyzer,定义为:从一串文本中切分出一个一个的词条,并对每个词条进行标准化.它由 ...
es学习(三)：分词器介绍以及中文分词器ik的安装与使用
什么是分词把文本转换为一个个的单词,分词称之为analysis.es默认只对英文语句做分词,中文不支持,每个中文字都会被拆分为独立的个体. 示例 POST http://192.168.247.8: ...
es分词器
1.默认的分词器 standard standard tokenizer:以单词边界进行切分standard token filter:什么都不做lowercase token filter:将所有字 ...
Lucene的分词_中文分词器介绍
Paoding:庖丁解牛分词器.已经没有更新了. MMSeg:搜狗的词库. MMSeg分词器的一些截图: 步骤: 1.导入包 2.创建的时候使用MMSegAnalyzer分词器
HanLP-分类模块的分词器介绍
最近发现一个很勤快的大神在分享他的一些实操经验,看了一些他自己关于hanlp方面的文章,写的挺好的!转载过来分享给大家!以下为分享原文(无意义的内容已经做了删除) 如下图所示,HanLP的分类模块中单 ...
Elasticsearch：ICU分词器介绍
ICU Analysis插件是一组将Lucene ICU模块集成到Elasticsearch中的库. 本质上,ICU的目的是增加对Unicode和全球化的支持,以提供对亚洲语言更好的文本分割分析. 从 ...
lucene-一篇分词器介绍很好理解的文章
本文来自这里在前面的概念介绍中我们已经知道了分析器的作用,就是把句子按照语义切分成一个个词语.英文切分已经有了很成熟的分析器: StandardAnalyzer,很多情况下StandardAnalyz ...
Elasticsearch（ES）分词器的那些事儿
1. 概述分词器是Elasticsearch中很重要的一个组件,用来将一段文本分析成一个一个的词,Elasticsearch再根据这些词去做倒排索引. 今天我们就来聊聊分词器的相关知识. 2. 内置 ...
Elasticsearch系列---倒排索引原理与分词器
概要本篇主要讲解倒排索引的基本原理以及ES常用的几种分词器介绍. 倒排索引的建立过程倒排索引是搜索引擎中常见的索引方法,用来存储在全文搜索下某个单词在一个文档中存储位置的映射.通过倒排索引,我们输 ...

随机推荐

Python爬取微博热搜以及链接
基本操作,不再详述直接贴源码(根据当前时间创建文件): import requests from bs4 import BeautifulSoup import time def input_to_ ...
13.56Mhz下50欧姆阻抗匹配简易教程
阻抗匹配(impedance matching) 主要用于传输线上,以此来达到所有高频的微波信号均能传递至负载点的目的,而且几乎不会有信号反射回来源点,从而提升能源效益.信号源内阻与所接传输线的特性阻 ...
力扣MYSQL练习
176编写一个 SQL 查询,获取 Employee 表中第二高的薪水(Salary) . select IFNULL((SELECT distinct salary from employee or ...
如何预测股票分析--长短期记忆网络(LSTM)
在上一篇中,我们回顾了先知的方法,但是在这个案例中表现也不是特别突出,今天介绍的是著名的l s t m算法,在时间序列中解决了传统r n n算法梯度消失问题的的它这一次还会有令人杰出的表现吗? 长短期 ...
SCROLLINFO结构详解
在刚开始使用SCROLLINFO结构时感觉很不顺手,尤其其中的成员fMask理解不太深刻,经过查询资料才理解一二. 在使用滚动条功能时,如果要设置它的范围和位置可以用以前的函数,例如:SetScrol ...
CLR处理损坏状态的异常
你有没有写过不太正确但足够接近的代码?当一切顺利的时候,你是否不得不编写运行良好的代码,但是你不太确定当出了问题时会发生什么?有一个简单的.不正确的语句可能位于您编写或必须维护的代码中:catch ( ...
边双连通分量 jarjan (poj 3177)
大意:给定一个无向连通图,判断至少加多少的边,才能使任意两点之间至少有两条的独立的路(没有公共的边,但可以经过同一个中间的顶点). 思路:在同一个双连通分量里的所有的点可以看做一个点,收缩后,新图是一 ...
wampserver 配置的几个坑（雾
1. 从安装版本说起自从我进入大学之后,便继承了学长那里的wampserver2.5版本直到有一天自己下载wamp的时候才注意到已经有 3.0.6版本了 (现在有更高的了但是3.0.6够用了) ...
吴裕雄 python 机器学习——多项式贝叶斯分类器MultinomialNB模型
import numpy as np import matplotlib.pyplot as plt from sklearn import datasets,naive_bayes from skl ...
glog与gflags的windows编译
参考博客:https://kezunlin.me/post/bb64e398/

es 分词器介绍

es 分词器介绍的更多相关文章

随机推荐

热门专题