按照单词切分,不做处理

GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "2",
"start_offset" : 0,分词
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brawn",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}

  按照非字母的字符切分

GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brawn",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}

  按照空格切分不做任何处理

GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "Quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "brawn-foxes",
"start_offset" : 16,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening.",
"start_offset" : 62,
"end_offset" : 70,
"type" : "word",
"position" : 11
}
]
}

  按词切分去掉修饰词

GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brawn",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}

  不进行切分直接输出

GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "2 running Quick brawn-foxes leap over lazy dogs in the summer evening.",
"start_offset" : 0,
"end_offset" : 70,
"type" : "word",
"position" : 0
}
]
}

  通过正则表达式方式进行切割,默认非字符的方式切割

GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "brawn",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 7
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 8
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "word",
"position" : 9
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "word",
"position" : 10
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 11
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 12
}
]
}

  英语分词器

GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brawn-foxes leap over lazy dogs in the summer evening."
} {
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "run",
"start_offset" : 2,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brawn",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "fox",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "lazi",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "dog",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "even",
"start_offset" : 62,
"end_offset" : 69,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}

  中文分词器,一个字符一个字符切分

POST _analyze
{
"analyzer": "standard",
"text": "他说的确实在理"
} {
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "说",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "的",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "确",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "实",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "在",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "理",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}
]
}

  

es 分词器介绍的更多相关文章

  1. Es学习第五课, 分词器介绍和中文分词器配置

    上课我们介绍了倒排索引,在里面提到了分词的概念,分词器就是用来分词的. 分词器是ES中专门处理分词的组件,英文为Analyzer,定义为:从一串文本中切分出一个一个的词条,并对每个词条进行标准化.它由 ...

  2. es学习(三):分词器介绍以及中文分词器ik的安装与使用

    什么是分词 把文本转换为一个个的单词,分词称之为analysis.es默认只对英文语句做分词,中文不支持,每个中文字都会被拆分为独立的个体. 示例 POST http://192.168.247.8: ...

  3. es分词器

    1.默认的分词器 standard standard tokenizer:以单词边界进行切分standard token filter:什么都不做lowercase token filter:将所有字 ...

  4. Lucene的分词_中文分词器介绍

    Paoding:庖丁解牛分词器.已经没有更新了. MMSeg:搜狗的词库. MMSeg分词器的一些截图: 步骤: 1.导入包 2.创建的时候使用MMSegAnalyzer分词器

  5. HanLP-分类模块的分词器介绍

    最近发现一个很勤快的大神在分享他的一些实操经验,看了一些他自己关于hanlp方面的文章,写的挺好的!转载过来分享给大家!以下为分享原文(无意义的内容已经做了删除) 如下图所示,HanLP的分类模块中单 ...

  6. Elasticsearch:ICU分词器介绍

    ICU Analysis插件是一组将Lucene ICU模块集成到Elasticsearch中的库. 本质上,ICU的目的是增加对Unicode和全球化的支持,以提供对亚洲语言更好的文本分割分析. 从 ...

  7. lucene-一篇分词器介绍很好理解的文章

    本文来自这里在前面的概念介绍中我们已经知道了分析器的作用,就是把句子按照语义切分成一个个词语.英文切分已经有了很成熟的分析器: StandardAnalyzer,很多情况下StandardAnalyz ...

  8. Elasticsearch(ES)分词器的那些事儿

    1. 概述 分词器是Elasticsearch中很重要的一个组件,用来将一段文本分析成一个一个的词,Elasticsearch再根据这些词去做倒排索引. 今天我们就来聊聊分词器的相关知识. 2. 内置 ...

  9. Elasticsearch系列---倒排索引原理与分词器

    概要 本篇主要讲解倒排索引的基本原理以及ES常用的几种分词器介绍. 倒排索引的建立过程 倒排索引是搜索引擎中常见的索引方法,用来存储在全文搜索下某个单词在一个文档中存储位置的映射.通过倒排索引,我们输 ...

随机推荐

  1. 命令行(二):Anaconda3

    1,进入base虚拟环境 $:activate 2,创建虚拟环境(自动下载Python3最新版本) $:conda create -n <virtual_name> python= 3,切 ...

  2. SpringBoot整合Mybatis案例

    SpringBoot整合Mybatis案例 2019/7/15以实习生身份入职公司前端做Angular ,但是感觉前途迷茫,于是乎学习一下Java的框架——SpringBooot. 参照大神博客:ht ...

  3. java基础(八)之函数的复写/重写(override)

    复写的意思就是子类对父类的修改. 复写的条件: 1.在具有父子类关系的两个类当中:2.父类和子类各有一个函数,这两个函数的定义保持一致(返回值类型.函数名.参数列表) 还是老样子,3个文件来说明. P ...

  4. Unity Coroutine详解(二)

    •        介绍•        Part 1. 同步等待•        Part 2. 异步协程•        Part 3. 同步协程•        Part 4. 并行协程 1.介绍 ...

  5. opencv:图像轮廓发现

    #include <opencv2/opencv.hpp> #include <iostream> using namespace cv; using namespace st ...

  6. 7-8 Left-pad

    思路 注意读入和输出格式 如果用fgets读入的话会带上回车,输出的时候一定不要输出了双回车 并且此时的length也会比原始长度多了一,要注意长度比较,这里容易出错 代码 #include < ...

  7. 概率dp sgu495

    题意: 有n个奖品,m个人排队来选礼物,对于每个人,他打开的盒子,可能有礼物,也有可能已经被之前的人取走了,然后把盒子放回原处.为最后m个人取走礼物的期望. 思路1: 排队取,第1个人取到1个,dp[ ...

  8. java后台接受不到vue传的参数

    @RequestMapping(value = "/delBelowImg") @Transactional public R delBelowFile(@RequestParam ...

  9. 关于MultiAutoCompleteTextView的用法:多文本匹配

  10. C语言运算符详解

    运算符是一种告诉编译器执行特定的数学或逻辑操作的符号.C 语言内置了丰富的运算符,并提供了以下类型的运算符: 算术运算符 关系运算符 逻辑运算符 位运算符 赋值运算符 杂项运算符 本章将逐一介绍算术运 ...