lucene_02_IKAnalyre

前言

在lucene中虽然已经提供了许多的分词器：StandardAnalyzer、CJKAnalyzer等，但在解析中文的时候都会把文中拆成一个个的单子。

毕竟老外不懂中文。这里介绍一个中文的分词器：IKAnalyre。虽然在其在分词的时候还不够完美

例如：将“高富帅，是2012年之后才有的词汇”

拆分为下图：

但是它可以通过配置文件来，增加新词和过滤不许出现的词比如：“的、啊、呀”等等没有具体意思的修饰副词和语气词等等。

配置IK解析器

第一步：在pom.xml 引入IK，注意：这个分词器由于从2012年之后就没有更新过，所以只能在低版本的lucene的版本中使用，该例使用的是：4.10.3

<!--ik 中文分词器-->

    <!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->

    <dependency>

      <groupId>com.janeluo</groupId>

      <artifactId>ikanalyzer</artifactId>

      <version>2012_u6</version>

    </dependency>

完整pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  <modelVersion>4.0.0</modelVersion>

  <groupId>com.chen</groupId>

  <artifactId>lucene</artifactId>

  <version>1.0-SNAPSHOT</version>

  <packaging>jar</packaging>

  <name>lucene</name>

  <url>http://maven.apache.org</url>

  <properties>

    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

  </properties>

  <dependencies>

    <dependency>

      <groupId>junit</groupId>

      <artifactId>junit</artifactId>

      <version>3.8.1</version>

      <scope>test</scope>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->

    <dependency>

      <groupId>org.apache.lucene</groupId>

      <artifactId>lucene-core</artifactId>

      <version>4.10.3</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->

    <dependency>

      <groupId>org.apache.lucene</groupId>

      <artifactId>lucene-queryparser</artifactId>

      <version>4.10.3</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->

    <dependency>

      <groupId>org.apache.lucene</groupId>

      <artifactId>lucene-analyzers-common</artifactId>

      <version>4.10.3</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->

    <dependency>

      <groupId>commons-io</groupId>

      <artifactId>commons-io</artifactId>

      <version>2.6</version>

    </dependency>

    <dependency>

      <groupId>junit</groupId>

      <artifactId>junit</artifactId>

      <version>RELEASE</version>

    </dependency>

    <!-- https://mvnrepository.com/artifact/io.github.zacker330.es/ik-analysis-core -->

    <!--ik 中文分词器-->

    <!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->

    <dependency>

      <groupId>com.janeluo</groupId>

      <artifactId>ikanalyzer</artifactId>

      <version>2012_u6</version>

    </dependency>

  </dependencies>

  <build>

    <plugins>

      <plugin>

        <groupId>org.apache.maven.plugins</groupId>

        <artifactId>maven-compiler-plugin</artifactId>

        <version>3.6.0</version>

        <configuration>

          <source>1.8</source>

          <target>1.8</target>

        </configuration>

      </plugin>

    </plugins>

  </build>

</project>

第二步：在资源目录下引入配置文件和扩展词汇文件、过滤词文件

IKAnalyzer.cfg.xml，是该分词器的核心配置文件，管理着ext.dic(扩展词汇文件)、stopword.dic(禁词文件)

内容如下：

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

    <comment>IK Analyzer 扩展配置</comment>

    <!--用户可以在这里配置自己的扩展字典 -->

    <entry key="ext_dict">ext.dic;</entry> 

    <!--用户可以在这里配置自己的扩展停止词字典-->

    <entry key="ext_stopwords">stopword.dic;</entry> 

</properties>

ext.dic 内容示例：

高富帅

白富美

java工程师

stopword.dic内容示例：

我

是

用

的

你

它

他

她

a

an

and

are

as

at

be

but

by

for

if

in

into

is

it

no

not

of

on

or

such

that

the

their

then

there

these

they

this

to

was

will

with

测试代码

 // 查看标准分析器的分词效果

    @Test

    public void testTokenStream() throws Exception {

        // 创建一个标准分析器对象

//        Analyzer analyzer = new StandardAnalyzer();

//        Analyzer analyzer = new CJKAnalyzer();

//        Analyzer analyzer = new SmartChineseAnalyzer();

        Analyzer analyzer = new IKAnalyzer();

        // 获得tokenStream对象

        // 第一个参数：域名，可以随便给一个

        // 第二个参数：要分析的文本内容

//        TokenStream tokenStream = analyzer.tokenStream("test",

//                "The Spring Framework provides a comprehensive programming and configuration model.");

        TokenStream tokenStream = analyzer.tokenStream("test",

                "高富帅，是2012年之后才有的词汇");

        // 添加一个引用，可以获得每个关键词

        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

        // 添加一个偏移量的引用，记录了关键词的开始位置以及结束位置

        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);

        // 将指针调整到列表的头部

        tokenStream.reset();

        // 遍历关键词列表，通过incrementToken方法判断列表是否结束

        while (tokenStream.incrementToken()) {

            // 关键词的起始位置

            System.out.println("start->" + offsetAttribute.startOffset());

            // 取关键词

            System.out.println(charTermAttribute);

            // 结束位置

            System.out.println("end->" + offsetAttribute.endOffset());

        }

        tokenStream.close();

    }

结果如下图：

lucene_02_IKAnalyre的更多相关文章

随机推荐

stl之hash_multimap
hash_multimap的元素不能自己主动排序
HDU 4259(Double Dealing-lcm(x1..xn)=lcm(x1,lcm(x2..xn))
Double Dealing Time Limit: 50000/20000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) ...
实例介绍Cocos2d-x中Box2D物理引擎：碰撞检測
在Box2D中碰撞事件通过实现b2ContactListener类函数实现,b2ContactListener是Box2D提供的抽象类,它的抽象函数:virtual void BeginContact ...
visual studio2013 C++查看对象布局
一在visual studio中进行设置,可以方便的查看对象的内存布局右键所要显示的*.cpp >> 属性 >> 命令行 >> 其它选项在其他选项中添加: /d ...
Oracle创建用户教程
计算机-->管理-->应用程序与服务-->(OracleOraDb11g_home1TNSListener 和 OracleServiceORCL 服务)->启动服务打开Or ...
Hive搭建与简单使用
hive搭建与简单使用(1) 标签(空格分隔): hive,mysql hive相当于编译器的组件,他并不存储数据,元数据存储在mysql中,数据则存放在hdfs中,通过hive,可以利用sql语句对 ...
Redis(二)-Win系统下安装
下载地址:https://github.com/MSOpenTech/redis/releases. Redis 支持 32 位和 64 位.这个需要根据你系统平台的实际情况选择,这里我们下载 Red ...
Web框架系列之Tornado
前言 Tornado是使用Python编写的一个强大的.可扩展的Web服务器.它在处理严峻的网络流量时表现得足够强健,但却在创建和编写时有着足够的轻量级,并能够被用在大量的应用和工具中. Tornad ...
xhtml1-strict.dtd
<!-- Extensible HTML version 1.0 Strict DTD This is the same as HTML 4 Strict except for changes ...
MYSQL日期时间字符串互转
--MYSQL date_format(date,'%Y-%m-%d') -------------->oracle中的to_char(); 日期时间转字符串 --MYSQL str_to_da ...

lucene_02_IKAnalyre

lucene_02_IKAnalyre的更多相关文章

随机推荐

热门专题