stanford corenlp自定义切词类

stanford corenlp的中文切词有时不尽如意，那我们就需要实现一个自定义切词类，来完全满足我们的私人定制（加各种词典干预）。上篇文章《IKAnalyzer》介绍了IKAnalyzer的自由度，本篇文章就说下怎么把IKAnalyzer作为corenlp的切词工具。

《stanford corenlp的TokensRegex》提到了corenlp的配置CoreNLP-chinese.properties，其中customAnnotatorClass.segment就是用于指定切词类的，在这里我们只需要模仿ChineseSegmenterAnnotator来实现一个自己的Annotator，并设置在配置文件中即可。

customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator

下面是我的实现：

public class IKSegmenterAnnotator extends ChineseSegmenterAnnotator {

    public IKSegmenterAnnotator() {

        super();

    }

    public IKSegmenterAnnotator(boolean verbose) {

        super(verbose);

    }

    public IKSegmenterAnnotator(String segLoc, boolean verbose) {

        super(segLoc, verbose);

    }

    public IKSegmenterAnnotator(String segLoc, boolean verbose, String serDictionary, String sighanCorporaDict) {

        super(segLoc, verbose, serDictionary, sighanCorporaDict);

    }

    public IKSegmenterAnnotator(String name, Properties props) {

        super(name, props);

    }

    private List<String> splitWords(String str) {

        try {

            List<String> words = new ArrayList<String>();

            IKSegmenter ik = new IKSegmenter(new StringReader(str), true);

            Lexeme lex = null;

            while ((lex = ik.next()) != null) {

                words.add(lex.getLexemeText());

            }

            return words;

        } catch (IOException e) {

            //LOGGER.error(e.getMessage(), e);

            System.out.println(e);

            List<String> words = new ArrayList<String>();

            words.add(str);

            return words;

        }

    }

    @Override

    public void runSegmentation(CoreMap annotation) {

        //0 2

        // A BC D E

        // 1 10 1 1

        // 0 12 3 4

        // 0, 0+1 ,

        String text = annotation.get(CoreAnnotations.TextAnnotation.class);

        List<CoreLabel> sentChars = annotation.get(ChineseCoreAnnotations.CharactersAnnotation.class);

        List<CoreLabel> tokens = new ArrayList<CoreLabel>();

        annotation.set(CoreAnnotations.TokensAnnotation.class, tokens);

        //List<String> words = segmenter.segmentString(text);

        List<String> words = splitWords(text);

        System.err.println(text);

        System.err.println("--->");

        System.err.println(words);

        int pos = 0;

        for (String w : words) {

            CoreLabel fl = sentChars.get(pos);

            fl.set(CoreAnnotations.ChineseSegAnnotation.class, "1");

            if (w.length() == 0) {

                continue;

            }

            CoreLabel token = new CoreLabel();

            token.setWord(w);

            token.set(CoreAnnotations.CharacterOffsetBeginAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class));

            pos += w.length();

            fl = sentChars.get(pos - 1);

            token.set(CoreAnnotations.CharacterOffsetEndAnnotation.class, fl.get(CoreAnnotations.CharacterOffsetEndAnnotation.class));

            tokens.add(token);

        }

    }

}

在外面为IKAnalyzer初始化词典，指定扩展词典和删除词典

        //为ik初始化词典，删除干扰词

        Dictionary.initial(DefaultConfig.getInstance());

        String delDic = System.getProperty(READ_IK_DEL_DIC, null);

        BufferedReader reader = new BufferedReader(new FileReader(delDic));

        String line = null;

        List<String> delWords = new ArrayList<>();

        while ((line = reader.readLine()) != null) {

            delWords.add(line);

        }

        Dictionary.getSingleton().disableWords(delWords);

stanford corenlp自定义切词类的更多相关文章

stanford corenlp的TokensRegex
最近做一些音乐类.读物类的自然语言理解,就调研使用了下Stanford corenlp,记录下来. 功能 Stanford Corenlp是一套自然语言分析工具集包括: POS(part of spe ...
用 Python 和 Stanford CoreNLP 进行中文自然语言处理
实验环境:Windows 7 / Python 3.6.1 / CoreNLP 3.7.0 一.下载 CoreNLP 在 Stanford NLP 官网下载最新的模型文件: CoreNLP 完整包 ...
开源中文分词工具探析（五）：Stanford CoreNLP
CoreNLP是由斯坦福大学开源的一套Java NLP工具,提供诸如:词性标注(part-of-speech (POS) tagger).命名实体识别(named entity recognizer ...
Stanford CoreNLP使用需要注意的一点
1.Stanford CoreNLP maven依赖,jdk依赖1.8 <dependency> <groupId>edu.stanford.nlp</groupId&g ...
开源中文分词工具探析（六）：Stanford CoreNLP
CoreNLP是由斯坦福大学开源的一套Java NLP工具,提供诸如:词性标注(part-of-speech (POS) tagger).命名实体识别(named entity recognizer ...
Stanford Corenlp学习笔记——词性标注
使用Stanford Corenlp对中文进行词性标注语言为Scala,使用的jar的版本是3.6.0,而且是手动添加jar包,使用sbt添加其他版本的时候出现了各种各样的问题添加的jar包有5个 ...
Eclipse下使用Stanford CoreNLP的方法
源码下载地址:CoreNLP官网. 目前release的CoreNLP version 3.5.0版本仅支持java-1.8及以上版本,因此有时需要为Eclipse添加jdk-1.8配置,配置方法如下 ...
Stanford CoreNLP 3.6.0 中文指代消解模块调用失败的解决方案
当前中文指代消解领域比较活跃的研究者是Chen和Vincent Ng,这两个人近两年在AAAI2014, 2015发了一些相关的文章,研究领域跨越零指代.代词指代.名词指代等,方法也不是很复杂,集中于 ...
【转载】Stanford CoreNLP Typed Dependencies
总结自Stanford typed dependencies manual 原文链接:http://www.jianshu.com/p/5c461cf096c4 依存关系描述句子中词与词之间的各种语法 ...

随机推荐

运行执行sql文件脚本的例子
sqlcmd -s -d db_test -r -i G:\test.sql 黑色字体为关键命令,其他颜色(从左至右):服务器名称,用户名,密码,数据库,文件路径通过select @@servern ...
C# 发送邮件附件名称为空
示例代码: // 1.创建邮件 MailMessage mailMsg = new MailMessage(); mailMsg.To.Add(new MailAddress("test@ ...
Android注解使用之注解编译android-apt如何切换到annotationProcessor
前言: 自从EventBus 3.x发布之后其通过注解预编译的方式解决了之前通过反射机制所引起的性能效率问题,其中注解预编译所采用的的就是android-apt的方式,不过最近Apt工具的作者宣布了不 ...
Photoshop将普通照片快速制作二次元漫画风格效果
今天为大家分享Photoshop将普通照片快速制作二次元漫画风格效果,教程很不错,对于喜欢漫画的朋友可以参考本文,希望能对大家有所帮助! 一提到日本动画电影,大家第一印象肯定是宫崎骏,但是日本除了宫崎 ...
spring源码分析之@ImportSelector、@Import、ImportResource工作原理分析
1. @importSelector定义: /** * Interface to be implemented by types that determine which @{@link Config ...
使用SecureCRT连接虚拟机（ubuntu）配置记录
这种配置方法,可以非常方便的操作虚拟机里的Linux系统,且让VMware在后台运行,因为有时候我直接在虚拟机里操作会稍微卡顿,或者切换速度不理想,使用该方法亲测本机效果确实ok,特此记录. Secu ...
html5 与视频
1.视频支持格式. 有3种视频格式被浏览器广泛支持:.ogg,.mp4,.webm. Theora+Vorbis=.ogg (Theora:视频编码器,Vorbis:音频编码器) H.264+$$$ ...
ADO.NET编程之美----数据访问方式(面向连接与面向无连接)
最近,在学习ADO.NET时,其中提到了数据访问方式:面向连接与面向无连接.于是,百度了一下,发现并没有很好的资料,然而,在学校图书馆中发现一本好书(<ASP.NET MVC5 网站开发之美&g ...
java面向对象六原则一法则
1. 单一职责原则:一类只做它该做的事. 2. 里氏替换原则:子类必须能够替换基类(父类),否则不应当设计为其子类. 3. 依赖倒换原则:设计要依赖于抽象而不是具体化. 4. 接口隔离原则:接口要小而 ...
Android—万能ListView适配器
ListView是开发中最常用的控件了,但是总是会写重复的代码,浪费时间又没有意义. 最近参考一些资料,发现一个万能ListView适配器,代码量少,节省时间,总结一下分享给大家. 首先有一个自定义的 ...

stanford corenlp自定义切词类

stanford corenlp自定义切词类的更多相关文章

随机推荐

热门专题