ikanalyzer分词，计算信息熵排序分词结果

因需求，现需分词接口，故记录之。

1、需要依赖：

 <!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer -->

         <dependency>

             <groupId>com.janeluo</groupId>

             <artifactId>ikanalyzer</artifactId>

             <version>2012_u6</version>

         </dependency>

maven依赖

2、完整代码如下：

 public JSONArray entropy(String content, Integer quantity) throws Exception {

         List<String> words = extract(DelHtmlTagUtil.delHTMLTag(content), quantity);

         JSONArray array = calculateWordEntropy(words);

         return array;

     }

 /**

      * 传入String类型的文章，智能提取单词放入list

      *

      * @param content  传入分词的内容

      * @param quantity 截取关键字在几个单词以上的数量，默认为1

      * @return

      */

     private List<String> extract(String content, Integer quantity) throws IOException {

         List<String> list = Lists.newArrayList();

         StringReader reader = new StringReader(content);

         IKSegmenter ik = new IKSegmenter(reader, true);

         Lexeme lex = null;

         while ((lex = ik.next()) != null) {

             //String typeString = lex.getLexemeTypeString();  词语类型

             String word = lex.getLexemeText();

             if (word.length() > quantity) {//判断截取关键字在几个单词以上的数量

                 list.add(word);

             }

         }

         return list;

     }

     private JSONArray calculateWordEntropy(List<String> words) throws Exception{

         int length = words.size();

         ArrayList<String[]> wordList = new ArrayList<String[]>();

         // 将分好的词每3个一组存到数组中

         for (int i = 0; i < length; i++) {

             String[] wordSeg = new String[3];

             if (i == 0) {

                 wordSeg[0] = "null";

                 wordSeg[1] = words.get(i);

                 wordSeg[2] = words.get(i + 1);

             } else if (i == length - 1) {

                 wordSeg[0] = words.get(i - 1);

                 wordSeg[1] = words.get(i);

                 wordSeg[2] = "null";

             } else {

                 wordSeg[0] = words.get(i - 1);

                 wordSeg[1] = words.get(i);

                 wordSeg[2] = words.get(i + 1);

             }

             wordList.add(wordSeg);

         }

         // 去除重复的词

         List<String> lists = Lists.newArrayList();

         for (int l = 0; l < length; l++) {

             lists.add(words.get(l));

         }

         List<String> tempList = Lists.newArrayList();

         for (String str : lists) {

             if (!(tempList.contains(str))) {

                 tempList.add(str);

             }

         }

         String[] wordClean = new String[tempList.size()];

         for (int m = 0; m < tempList.size(); m++) {

             wordClean[m] = tempList.get(m);

         }

         // 统计每个词的词频

         int[] frequent = new int[wordClean.length];

         for (int j = 0; j < wordClean.length; j++) {

             int count = 0;

             for (int k = 0; k < words.size(); k++) {

                 if (wordClean[j].equals(words.get(k))) {

                     count++;

                 }

             }

             frequent[j] = count;

         }

         // 将三元组中中间的那个词相同的存到一个list中，然后计算该词的信息熵

         double[] allEntropy = new double[wordClean.length];

         for (int n = 0; n < wordClean.length; n++) {

             ArrayList<String[]> wordSegList = new ArrayList<String[]>();

             int count = 1;

             for (int p = 0; p < wordList.size(); p++) {

                 String[] wordSegStr = wordList.get(p);

                 if (wordSegStr[1].equals(wordClean[n])) {

                     count++;

                     wordSegList.add(wordSegStr);

                 }

             }

             String[] leftword = new String[wordSegList.size()];

             String[] rightword = new String[wordSegList.size()];

             // 计算左信息熵

             for (int i = 0; i < wordSegList.size(); i++) {

                 String[] left = wordSegList.get(i);

                 leftword[i] = left[0];

             }

             // 去除左边重复的词

             List<String> listsLeft = new ArrayList<String>();

             for (int l = 0; l < leftword.length; l++) {

                 listsLeft.add(leftword[l]);

             }

             List<String> tempListLeft = new ArrayList<String>();

             for (String str : listsLeft) {

                 if (!(tempListLeft.contains(str))) {

                     tempListLeft.add(str);

                 }

             }

             String[] leftWordClean = new String[tempListLeft.size()];

             for (int m = 0; m < tempListLeft.size(); m++) {

                 leftWordClean[m] = tempListLeft.get(m);

             }

             // 统计左边每个词的词频

             int[] leftFrequent = new int[leftWordClean.length];

             for (int j = 0; j < leftWordClean.length; j++) {

                 int leftcount = 0;

                 for (int k = 0; k < leftword.length; k++) {

                     if (leftWordClean[j].equals(leftword[k])) {

                         leftcount++;

                     }

                 }

                 leftFrequent[j] = leftcount;

             }

             // 计算左熵值

             double leftEntropy = 0;

             for (int i = 0; i < leftFrequent.length; i++) {

                 double a = (double) leftFrequent[i] / count;

                 double b = Math.log((double) leftFrequent[i] / count);

                 leftEntropy += -a * b;

                 // leftEntropy +=

                 // (-(double)(leftFrequent[i]/count))*Math.log((double)(leftFrequent[i]/count));

             }

             // 计算右信息熵

             for (int i = 0; i < wordSegList.size(); i++) {

                 String[] right = wordSegList.get(i);

                 rightword[i] = right[2];

             }

             // 去除右边重复的词

             List<String> listsRight = new ArrayList<String>();

             for (int l = 0; l < rightword.length; l++) {

                 listsRight.add(rightword[l]);

             }

             List<String> tempListRight = new ArrayList<String>();

             for (String str : listsRight) {

                 if (!(tempListRight.contains(str))) {

                     tempListRight.add(str);

                 }

             }

             String[] rightWordClean = new String[tempListRight.size()];

             for (int m = 0; m < tempListRight.size(); m++) {

                 rightWordClean[m] = tempListRight.get(m);

             }

             // 统计右边每个词的词频

             int[] rightFrequent = new int[rightWordClean.length];

             for (int j = 0; j < rightWordClean.length; j++) {

                 int rightcount = 0;

                 for (int k = 0; k < rightword.length; k++) {

                     if (rightWordClean[j].equals(rightword[k])) {

                         rightcount++;

                     }

                 }

                 rightFrequent[j] = rightcount;

             }

             // 计算右熵值

             double rightEntropy = 0.0;

             for (int i = 0; i < rightFrequent.length; i++) {

                 double a = (double) rightFrequent[i] / count;

                 double b = Math.log((double) rightFrequent[i] / count);

                 rightEntropy += -a * b;

                 // rightEntropy +=

                 // (-(double)(rightFrequent[i]/count))*Math.log((double)(rightFrequent[i]/count));

             }

             // 计算词的总信息熵

             double wordEntropy = leftEntropy + rightEntropy;

             allEntropy[n] = wordEntropy;

         }

         JSONArray list = new JSONArray();

         for (int i = 0; i < allEntropy.length; i++) {

             JSONObject obj = new JSONObject();

             obj.put("name", wordClean[i]);

             obj.put("entropy", allEntropy[i]);

             list.add(obj);

         }

         Collections.sort(list, (o1, o2) -> {

             Double d1 = ((JSONObject) o1).getDouble("entropy");

             Double d2 = ((JSONObject) o2).getDouble("entropy");

             return d2.compareTo(d1);

         });

         return list;

     }

处理代理

ikanalyzer分词，计算信息熵排序分词结果的更多相关文章

python 分词计算文档TF-IDF值并排序
文章来自于我的个人博客:python 分词计算文档TF-IDF值并排序该程序实现的功能是:首先读取一些文档,然后通过jieba来分词,将分词存入文件,然后通过sklearn计算每一个分词文档中的tf ...
IKAnalyzer结合Lucene实现中文分词
1.基本介绍随着分词在信息检索领域应用的越来越广泛,分词这门技术对大家并不陌生.对于英文分词处理相对简单,经过拆分单词.排斥停止词.提取词干的过程基本就能实现英文分词,单对于中文分词而言,由于语义的 ...
php 分词 —— PHPAnalysis无组件分词系统
分词,顾名思义就是把词语分开,从哪里分开?当然是一大堆词语里了,一大堆词语是什么?是废话或者名言.这在数据库搜索时非常有用. 官方网站 http://www.phpbone.com/phpanalys ...
自然语言处理之中文分词器－jieba分词器详解及python实战
(转https://blog.csdn.net/gzmfxy/article/details/78994396) 中文分词是中文文本处理的一个基础步骤,也是中文人机自然语言交互的基础模块,在进行中文自 ...
利用IK分词器，自定义分词规则
IK分词源码下载地址:https://code.google.com/p/ik-analyzer/downloads/list lucene源码下载地址:http://www.eu.apache.or ...
Python 结巴分词（1）分词
利用结巴分词来进行词频的统计,并输出到文件中. 结巴分词github地址:结巴分词结巴分词的特点: 支持三种分词模式: 精确模式,试图将句子最精确地切开,适合文本分析: 全模式,把句子中所有的可以成 ...
【Lucene3.6.2入门系列】第05节_自定义停用词分词器和同义词分词器
首先是用于显示分词信息的HelloCustomAnalyzer.java package com.jadyer.lucene; import java.io.IOException; import j ...
Lucene学习-深入Lucene分词器,TokenStream获取分词详细信息
Lucene学习-深入Lucene分词器,TokenStream获取分词详细信息在此回复牛妞的关于程序中分词器的问题,其实可以直接很简单的在词库中配置就好了,Lucene中分词的所有信息我们都可以从 ...
盘古分词demo，盘古分词怎么用
1.下载PanGu.dll dll地址:http://download.csdn.net/detail/dhfekl/7493687 2.将PanGu.dll和词库引入到项目最新词库地址:http: ...

随机推荐

linux时间格式总结
原文:https://blog.csdn.net/drcwr/article/details/50971637 %% a literal % 一个文字 %a locale's abbre ...
ArrayBlockingQueue源码解析（1）
此文已由作者赵计刚授权网易云社区发布. 欢迎访问网易云社区,了解更多网易技术产品运营经验. 注意:在阅读本文之前或在阅读的过程中,需要用到ReentrantLock,内容见<第五章 Reentr ...
PICE（3）：CassandraStreaming - gRPC-CQL Service
在上一篇博文里我们介绍了通过gRPC实现JDBC数据库的streaming,这篇我们介绍关于cassandra的streaming实现方式.如果我们需要从一个未部署cassandra的节点或终端上读取 ...
mybatis四大接口之 StatementHandler
1. 继承结构 StatementHandler:顶层接口 BaseStatementHandler : 实现顶层接口的抽象类,实现了部分接口,并定义了一个抽象方法 SimpleStatementHa ...
Mybatis 逆向工程学习随笔
一.逆向工程的作用简单来说,就是替我们生成Java代码. 之前使用Mybatis的Mapper代理方法开发,还需要自己创建实体类,而且属性还得和数据库中的字段对应.这着实是机械化的而且比较麻烦的事, ...
全网最详细的Windows系统里Oracle 11g R2 Database服务器端（64bit）的下载与安装（图文详解）
不多说,直接上干货! 环境: windows10系统(64位) 最好先安装jre或jdk(此软件用来打开oracle自带的可视化操作界面,不装也没关系:可以安装plsql,或者直接用命令行操作) Or ...
全网最详细的hive-site.xml配置文件里如何添加达到Hive与HBase的集成，即Hive通过这些参数去连接HBase（图文详解）
不多说,直接上干货! 一般,普通的情况是全网最详细的hive-site.xml配置文件里添加<name>hive.cli.print.header</name>和<na ...
【Java初探03】——流程控制语句
做任何事情都应当遵守一定的原则,程序设计也是如此,需要有流程控制语言来实现与用户的交流.流程控制对于任何一门编程语言来说都是至关重要的,它提供了控制程序步骤的基本手段,如果没有流程控制语句,整个程序将 ...
高可用Hadoop平台－实战
1.概述今天继续<高可用的Hadoop平台>系列,今天开始进行小规模的实战下,前面的准备工作完成后,基本用于统计数据的平台都拥有了,关于导出统计结果的文章留到后面赘述.今天要和大家分享的 ...
j2ee高级开发技术课程第八周
介绍一. hashCode()方法和equal()方法的作用其实一样,在Java里都是用来对比两个对象是否相等一致,那么equal()既然已经能实现对比的功能了,为什么还要hashCode()呢? 因 ...

ikanalyzer分词，计算信息熵排序分词结果

ikanalyzer分词，计算信息熵排序分词结果的更多相关文章

随机推荐

热门专题