elasticsearch之使用正则表达式自定义分词逻辑

一、Pattern Analyzer简介

elasticsearch在索引和搜索之前都需要对输入的文本进行分词，elasticsearch提供的pattern analyzer使得我们可以通过正则表达式的简单方式来定义分隔符，从而达到自定义分词的处理逻辑；

内置的的pattern analyzer的名字为pattern，其使用的模式是W+，即除了字母和数字之外的所有非单词字符；

analyzers.add(new PreBuiltAnalyzerProviderFactory("pattern", CachingStrategy.ELASTICSEARCH,

            () -> new PatternAnalyzer(Regex.compile("\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/, null), true,

            CharArraySet.EMPTY_SET)));

作为全局的pattern analyzer，我们可以直接使用

POST _analyze

{

  "analyzer": "pattern",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."

}

{

  "tokens" : [

    {

      "token" : "the",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "2",

      "start_offset" : 4,

      "end_offset" : 5,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 6,

      "end_offset" : 11,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brown",

      "start_offset" : 12,

      "end_offset" : 17,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "foxes",

      "start_offset" : 18,

      "end_offset" : 23,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "jumped",

      "start_offset" : 24,

      "end_offset" : 30,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "over",

      "start_offset" : 31,

      "end_offset" : 35,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "the",

      "start_offset" : 36,

      "end_offset" : 39,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "lazy",

      "start_offset" : 40,

      "end_offset" : 44,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "dog",

      "start_offset" : 45,

      "end_offset" : 48,

      "type" : "word",

      "position" : 9

    },

    {

      "token" : "s",

      "start_offset" : 49,

      "end_offset" : 50,

      "type" : "word",

      "position" : 10

    },

    {

      "token" : "bone",

      "start_offset" : 51,

      "end_offset" : 55,

      "type" : "word",

      "position" : 11

    }

  ]

}

二、自定义Pattern Analyzer

我们可以通过以下方式自定pattern analyzer，并设置分隔符为所有的空格符号；

PUT my_pattern_test_space_analyzer

{

  "settings": {

    "analysis": {

      "analyzer": {

        "my_pattern_test_space_analyzer": {

          "type":      "pattern",

          "pattern":   "[\\p{Space}]",

          "lowercase": true

        }

      }

    }

  }

}

我们使用自定义的pattern analyzer测试一下效果

POST my_pattern_test_space_analyzer/_analyze

{

  "analyzer": "my_pattern_test_space_analyzer",

  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."

}

{

  "tokens" : [

    {

      "token" : "the",

      "start_offset" : 0,

      "end_offset" : 3,

      "type" : "word",

      "position" : 0

    },

    {

      "token" : "2",

      "start_offset" : 4,

      "end_offset" : 5,

      "type" : "word",

      "position" : 1

    },

    {

      "token" : "quick",

      "start_offset" : 6,

      "end_offset" : 11,

      "type" : "word",

      "position" : 2

    },

    {

      "token" : "brown-foxes",

      "start_offset" : 12,

      "end_offset" : 23,

      "type" : "word",

      "position" : 3

    },

    {

      "token" : "jumped",

      "start_offset" : 24,

      "end_offset" : 30,

      "type" : "word",

      "position" : 4

    },

    {

      "token" : "over",

      "start_offset" : 31,

      "end_offset" : 35,

      "type" : "word",

      "position" : 5

    },

    {

      "token" : "the",

      "start_offset" : 36,

      "end_offset" : 39,

      "type" : "word",

      "position" : 6

    },

    {

      "token" : "lazy",

      "start_offset" : 40,

      "end_offset" : 44,

      "type" : "word",

      "position" : 7

    },

    {

      "token" : "dog's",

      "start_offset" : 45,

      "end_offset" : 50,

      "type" : "word",

      "position" : 8

    },

    {

      "token" : "bone.",

      "start_offset" : 51,

      "end_offset" : 56,

      "type" : "word",

      "position" : 9

    }

  ]

}

三、常用的Java中的正则表达式

elasticsearch的Pattern Analyzer使用的Java Regular Expressions，只有了解Java中一些常用的正则表达式才能更好的自定义pattern analyzer；

单字符定义

x	        The character x

\\	        The backslash character

\0n	        The character with octal value 0n (0 <= n <= 7)

\0nn	    The character with octal value 0nn (0 <= n <= 7)

\0mnn	    The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)

\xhh	    The character with hexadecimal value 0xhh

\uhhhh	    The character with hexadecimal value 0xhhhh

\x{h...h}	The character with hexadecimal value 0xh...h (Character.MIN_CODE_POINT  <= 0xh...h <=  Character.MAX_CODE_POINT)

\t	        The tab character ('\u0009')

\n	        The newline (line feed) character ('\u000A')

\r	        The carriage-return character ('\u000D')

\f	        The form-feed character ('\u000C')

\a	        The alert (bell) character ('\u0007')

\e	        The escape character ('\u001B')

\cx	        The control character corresponding to x

字符分组

[abc]	        a, b, or c (simple class)

[^abc]	        Any character except a, b, or c (negation)

[a-zA-Z]	    a through z or A through Z, inclusive (range)

[a-d[m-p]]	    a through d, or m through p: [a-dm-p] (union)

[a-z&&[def]]	d, e, or f (intersection)

[a-z&&[^bc]]	a through z, except for b and c: [ad-z] (subtraction)

[a-z&&[^m-p]]	a through z, and not m through p: [a-lq-z](subtraction)

预定义的字符分组

.	Any character (may or may not match line terminators)

\d	A digit: [0-9]

\D	A non-digit: [^0-9]

\h	A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]

\H	A non-horizontal whitespace character: [^\h]

\s	A whitespace character: [ \t\n\x0B\f\r]

\S	A non-whitespace character: [^\s]

\v	A vertical whitespace character: [\n\x0B\f\r\x85\u2028\u2029]

\V	A non-vertical whitespace character: [^\v]

\w	A word character: [a-zA-Z_0-9]

\W	A non-word character: [^\w]

POSIX字符分组

\p{Lower}	A lower-case alphabetic character: [a-z]

\p{Upper}	An upper-case alphabetic character:[A-Z]

\p{ASCII}	All ASCII:[\x00-\x7F]

\p{Alpha}	An alphabetic character:[\p{Lower}\p{Upper}]

\p{Digit}	A decimal digit: [0-9]

\p{Alnum}	An alphanumeric character:[\p{Alpha}\p{Digit}]

\p{Punct}	Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

\p{Graph}	A visible character: [\p{Alnum}\p{Punct}]

\p{Print}	A printable character: [\p{Graph}\x20]

\p{Blank}	A space or a tab: [ \t]

\p{Cntrl}	A control character: [\x00-\x1F\x7F]

\p{XDigit}	A hexadecimal digit: [0-9a-fA-F]

\p{Space}	A whitespace character: [ \t\n\x0B\f\r]

以下我们通过正则表达式[\p{Punct}|\p{Space}]可以找出字符串中的标点符号；

import java.util.regex.Matcher;

import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {

        Pattern p = Pattern.compile("[\\p{Punct}|\\p{Space}]");

        Matcher matcher = p.matcher("The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.");

        while(matcher.find()){

            System.out.println("find "+matcher.group()

                    +" position: "+matcher.start()+"-"+matcher.end());

        }

    }

}

find   position: 3-4

find   position: 5-6

find   position: 11-12

find - position: 17-18

find   position: 23-24

find   position: 30-31

find   position: 35-36

find   position: 39-40

find   position: 44-45

find ' position: 48-49

find   position: 50-51

find . position: 55-56

四、 Pattern Analyzer的实现

PatternAnalyzer会根据具体的配置信息,使用PatternTokenizer、LowerCaseFilter、StopFilter来组合构建TokenStreamComponents

PatternAnalyzer.java 

protected TokenStreamComponents createComponents(String s) {

    final Tokenizer tokenizer = new PatternTokenizer(pattern, -1);

    TokenStream stream = tokenizer;

    if (lowercase) {

        stream = new LowerCaseFilter(stream);

    }

    if (stopWords != null) {

        stream = new StopFilter(stream, stopWords);

    }

    return new TokenStreamComponents(tokenizer, stream);

}

PatternTokenizer里的incrementToken会对输入的文本进行分词处理；由于PatternAnalyzer里初始化PatternTokenizer里的incrementToken会对输入的文本进行分词处理的时候对group设置为-1，所以这里走else分支，最终提取命中符号之间的单词；

PatternTokenizer.java

  @Override

  public boolean incrementToken() {

    if (index >= str.length()) return false;

    clearAttributes();

    if (group >= 0) {

      // match a specific group

      while (matcher.find()) {

        index = matcher.start(group);

        final int endIndex = matcher.end(group);

        if (index == endIndex) continue;

        termAtt.setEmpty().append(str, index, endIndex);

        offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));

        return true;

      }

      index = Integer.MAX_VALUE; // mark exhausted

      return false;

    } else {

      // String.split() functionality

      while (matcher.find()) {

        if (matcher.start() - index > 0) {

          // found a non-zero-length token

          termAtt.setEmpty().append(str, index, matcher.start());

          offsetAtt.setOffset(correctOffset(index), correctOffset(matcher.start()));

          index = matcher.end();

          return true;

        }

        index = matcher.end();

      }

      if (str.length() - index == 0) {

        index = Integer.MAX_VALUE; // mark exhausted

        return false;

      }

      termAtt.setEmpty().append(str, index, str.length());

      offsetAtt.setOffset(correctOffset(index), correctOffset(str.length()));

      index = Integer.MAX_VALUE; // mark exhausted

      return true;

    }

  }

elasticsearch之使用正则表达式自定义分词逻辑的更多相关文章

Elasticsearch笔记六之中文分词器及自定义分词器
中文分词器在lunix下执行下列命令,可以看到本来应该按照中文"北京大学"来查询结果es将其分拆为"北","京","大" ...
【分词器及自定义】Elasticsearch中文分词器及自定义分词器
中文分词器在lunix下执行下列命令,可以看到本来应该按照中文”北京大学”来查询结果es将其分拆为”北”,”京”,”大”,”学”四个汉字,这显然不符合我的预期.这是因为Es默认的是英文分词器我需要为 ...
Elasticsearch修改分词器以及自定义分词器
Elasticsearch修改分词器以及自定义分词器参考博客:https://blog.csdn.net/shuimofengyang/article/details/88973597
ElasticSearch教程——自定义分词器（转学习使用）
一.分词器 Elasticsearch中,内置了很多分词器(analyzers),例如standard(标准分词器).english(英文分词)和chinese(中文分词),默认是standard. ...
ElasticSearch已经配置好ik分词和mmseg分词(转)
ElasticSearch是一个基于Lucene构建的开源,分布式,RESTful搜索引擎.设计用于云计算中,能够达到实时搜索,稳定,可靠,快速,安装使用方便.支持通过HTTP使用JSON进行数据索引 ...
在ElasticSearch中使用 IK 中文分词插件
我这里集成好了一个自带IK的版本,下载即用, https://github.com/xlb378917466/elasticsearch5.2.include_IK 添加了IK插件意味着你可以使用ik ...
使用Docker 安装Elasticsearch、Elasticsearch-head、IK分词器和使用
原文:使用Docker 安装Elasticsearch.Elasticsearch-head.IK分词器和使用 Elasticsearch的安装一.elasticsearch的安装 1.镜像拉取 ...
ElasticSearch第三步-中文分词
ElasticSearch系列学习 ElasticSearch第一步-环境配置 ElasticSearch第二步-CRUD之Sense ElasticSearch第三步-中文分词 ElasticS ...
自定义分词器Analyzer
Analyzer,或者说文本分析的过程,实质上是将输入文本转化为文本特征向量的过程.这里所说的文本特征,可以是词或者是短语.它主要包括以下四个步骤: 1.分词,将文本解析为单词或短语 2.归一化,将文 ...
根据异常自定义处理逻辑（【附】java异常处理规范）
▄︻┻┳═一『异常捕获系列』Agenda: ▄︻┻┳═一有关于异常捕获点滴,plus我也揭揭java的短 ▄︻┻┳═一根据异常自定义处理逻辑([附]java异常处理规范) ▄︻┻┳═一利用自定义异常来 ...

随机推荐

Linux 基础-查看 cpu、内存和环境等信息
Linux 基础-查看 cpu.内存和环境等信息在使用 Linux 系统的过程中,我们经常需要查看系统.资源.网络.进程.用户等方面的信息,查看这些信息的常用命令值得了解和熟悉. 1,系统信息查看常 ...
在Java Web中setContentType与setCharacterEncoding中设置字符编码格式的区别
在Java Web中setContentType与setCharacterEncoding中设置字符编码格式的区别通用解释 setCharacterEncoding只是设置字符的编码方式 setCo ...
2022-2023年度必备宇宙最全Windows系统软件清单
作为PC端的第一生产力工具,相信对于绝大部分人来说,Windows系统是一款不可替代的产品.既然如此,Pytrick今天就拿出珍藏多年的压箱底宝贝无偿分享给各位,给大家逐一介绍下这些体验一级棒的应用软 ...
-webkit-box-orient:vertical 编译报错之autoprefixer问题
由于各大浏览器的兼容问题,autoprefixer 插件就可以帮我们自动补齐前缀.它和 less.scss 这样的预处理器不同,它属于后置处理器. 预处理器:在打包之前进行处理后置处理器:在代码打 ...
火山引擎 DataLeap 的 Data Catalog 系统公有云实践
Data Catalog 通过汇总技术和业务元数据,解决大数据生产者组织梳理数据.数据消费者找数和理解数的业务场景.本篇内容源自于火山引擎大数据研发治理套件 DataLeap 中的 Data Ca ...
【SQL基础】【关键字大写】条件查询：比较、不等于、IN、为空、BETWEEN
〇.概述 1.内容介绍条件查询:比较.不等于.IN.为空.BETWEEN 2.建表语句 drop table if exists user_profile; CREATE TABLE `user_p ...
【Zookeeper】结构、应用、安装部署与参数、客户端命令行操作、API应用、内部原理（选举机制、写数据、监听器）
一.Zookeeper入门 1.概述分布式服务管理框架(存储和管理数据) Zookeeper=文件系统+通知机制 2.特点主从集群半数以上,正常工作请求顺序执行数据更新具有原子性 3.数据结 ...
angr_ctf——从0学习angr（三）：Hook与路径爆炸
路径爆炸之前说过,angr在处理分支时,采取统统收集的策略,因此每当遇见一个分支,angr的路径数量就会乘2,这是一种指数增长,也就是所说的路径爆炸. 以下是路径爆炸的一个例子: char buff ...
O-MVLL:支持ARM64的基于LLVM的代码混淆模块
O-MVLL介绍 O-MVLL的开发灵感来自于另一个著名的基于LLVM的代码混淆项目ollvm,并在其基础上做了创新和改进.O-MVLL的混淆逻辑实现方式也是通过LLVM Pass,支持也仅会支持AR ...
虚拟网络VLAN
一.VLAN划分基础 1.VLAN概念 VLAN叫做虚拟局域网,逻辑上将网络划分 2.VLAN的分类静态vlan:基于端口划分静态VLAN 动态vlan:基于MAC地址划分动态VLAN 3.VLAN ...

elasticsearch之使用正则表达式自定义分词逻辑

elasticsearch之使用正则表达式自定义分词逻辑的更多相关文章

随机推荐

热门专题