[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取

算法任务：

1. 给定一个文件，统计这个文件中所有字符的相对频率（相对频率就是这些字符出现的概率——该字符出现次数除以字符总个数，并计算该文件的熵）。

2. 给定另外一个文件，按上述同样的方法计算字符分布的概率，然后计算两个文件中的字符分布的KL距离。

（熵和KL距离都是NLP自然语言处理中术语，仅仅是涉及到一两个公式而已，不影响您对代码的理解，so just try！）

说明：

1. 给定的文件可以是两个中文文件或两个英文文件，也可以是两个中英文混合文件。对于中文，计算字符，对于英文，计算词。

2.有效字符不包括空格换行符标点符号。

3.将中文字符、英文单词、其他非有效字符及其出现次数，分别写入三个文件中。

4.代码用java完成。

文章的重点：

1.如何判断一个字符是汉字，而不是ASCII、标点、日文、阿拉伯文……

2.了解汉字是如何编码的。“UTF8”绝逼是要花你一整个下午时间去弄明白的东西。

3.正则表达式。对于计算机科班出身的人应该不陌生，在此我就不造次了。

代码如下：

 import java.io.BufferedReader;

 import java.io.FileInputStream;

 import java.io.FileReader;

 import java.io.FileWriter;

 import java.util.HashMap;

 import java.util.Iterator;

 import java.util.Map.Entry;

 import java.util.regex.Matcher;

 import java.util.regex.Pattern;

 public class NLPFileUnit {

     public HashMap<String, Integer> WordOccurrenceNumber;//The Occurrence Number of the single Chinese character

     //or Single English word in the file

     public HashMap<String, Float> WordProbability;//The probability of single Chinese character or English word

     public HashMap<String, Integer> Punctuations;//The punctuation that screened out from the file

     public float entropy;//熵，本文主要计算单个汉字，或者单个英文单词的熵值

     private String filePath;

     //构造函数

     public NLPFileUnit(String filePath) throws Exception {

         this.filePath = filePath;

         WordOccurrenceNumber = createHash(createReader(filePath));

         Punctuations = filterPunctuation(WordOccurrenceNumber);

         WordProbability = calProbability(WordOccurrenceNumber);

         this.entropy = calEntropy(this.WordProbability);

         System.out.println("all punctuations were saved at " + filePath.replace(".", "_punctuation.") + "!");

         this.saveFile(Punctuations, filePath.replace(".", "_punctuation."));

         System.out.println("all words(En & Ch) were saved at " + filePath.replace(".", "_AllWords.") + "!");

         this.saveFile(this.WordOccurrenceNumber, filePath.replace(".", "_AllWords."));

     }

     /**

      * get the English words form the file to HashMap

      * @param hash

      * @param path

      * @throws Exception

      */

     public void getEnWords(HashMap<String, Integer> hash, String path) throws Exception {

         FileReader fr = new FileReader(path);

         BufferedReader br = new BufferedReader(fr);

         //read all lines into content

         String content = "";

         String line = null;

         while((line = br.readLine())!=null){

             content+=line;

         }

         br.close();

         //extract words by regex正则表达式

         Pattern enWordsPattern = Pattern.compile("([A-Za-z]+)");

         Matcher matcher = enWordsPattern.matcher(content);

         while (matcher.find()) {

             String word = matcher.group();

             if(hash.containsKey(word))

                 hash.put(word, 1 + hash.get(word));

             else{

                 hash.put(word, 1);

             }

         }

     }

     private boolean isPunctuation(String tmp) {

         //Punctuation should not be EN words/ Chinese

         final String cnregex = "\\p{InCJK Unified Ideographs}";

         final String enregex = "[A-Za-z]+";

         return !(tmp.matches(cnregex) || tmp.matches(enregex)) ;

     }

     /**

      * judge whether the file is encoded by UTF-8 (UCS Transformation Format)format.

      * @param fs

      * @return

      * @throws Exception

      */

     private boolean isUTF8(FileInputStream fs) throws Exception {

         if (fs.read() == 0xEF && fs.read() == 0xBB && fs.read() == 0xBF)//所有utf8编码的文件前三个字节为0xEFBBBF

             return true;

         return false;

     }

     /**

      * utf8格式编码的字符，其第一个byte的二进制编码可以判断该字符的长度（汉字一般占三个字节）ASCII占一byte

      * @param b

      * @return

      */

     private int getlength(byte b) {

         int v = b & 0xff;//byte to 十六进制数

         if (v > 0xF0) {

             return 4;

         }

         // 110xxxxx

         else if (v > 0xE0) {

             return 3;

         } else if (v > 0xC0) {

             return 2;//该字符长度占2byte

         }

         return 1;

     }

     /**

      * 通过读取头一个byte来判断该字符占用字节数，并读取该字符，如1110xxxx，表示这个字符占三个byte

      * @param fs

      * @return

      * @throws Exception

      */

     private String readUnit(FileInputStream fs) throws Exception {

         byte b = (byte) fs.read();

         if (b == -1)

             return null;

         int len = getlength(b);

         byte[] units = new byte[len];

         units[0] = b;

         for (int i = 1; i < len; i++) {

             units[i] = (byte) fs.read();

         }

         String ret = new String(units, "UTF-8");

         return ret;

     }

     /**

      * 把单词，标点，汉字等全都读入hashmap

      * @param inputStream

      * @return

      * @throws Exception

      */

     private HashMap<String, Integer> createHash(FileInputStream inputStream)

             throws Exception {

         HashMap<String, Integer> hash = new HashMap<String, Integer>();

         String key = null;

         while ((key = readUnit(inputStream)) != null) {

             if (hash.containsKey(key)) {

                 hash.put(key, 1 + (int) hash.get(key));

             } else {

                 hash.put(key, 1);

             }

         }

         inputStream.close();

         getEnWords(hash, this.filePath);

         return hash;

     }

     /**

      * FileInputStream读取文件，若文件不是UTF8编码，返回null

      * @param path

      * @return

      * @throws Exception

      */

     private FileInputStream createReader(String path) throws Exception {

         FileInputStream br = new FileInputStream(path);

         if (!isUTF8(br))

             return null;

         return br;

     }

     /**

      * save punctuation filtered form (HashMap)hash into (HashMap)puncs,

      * @param hash;remove punctuation form (HashMap)hash at the same time

      * @return

      */

     private HashMap<String, Integer> filterPunctuation(

             HashMap<String, Integer> hash) {

         HashMap<String, Integer> puncs = new HashMap<String, Integer>();

         Iterator<?> iterator = hash.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<?, ?> entry = (Entry<?, ?>) iterator.next();

             String key = entry.getKey().toString();

             if (isPunctuation(key)) {

                 puncs.put(key, hash.get(key));

                 iterator.remove();

             }

         }

         return puncs;

     }

     /**

      * calculate the probability of the word in hash

      * @param hash

      * @return

      */

     private HashMap<String, Float> calProbability(HashMap<String, Integer> hash) {

         float count = countWords(hash);

         HashMap<String, Float> prob = new HashMap<String, Float>();

         Iterator<?> iterator = hash.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<?, ?> entry = (Entry<?, ?>) iterator.next();

             String key = entry.getKey().toString();

             prob.put(key, hash.get(key) / count);

         }

         return prob;

     }

     /**

      * save the content in the hash into file.txt

      * @param hash

      * @param path

      * @throws Exception

      */

     private void saveFile(HashMap<String, Integer> hash, String path)

             throws Exception {

         FileWriter fw = new FileWriter(path);

         fw.write(hash.toString());

         fw.close();

     }

     /**

      * calculate the total words in hash

      * @param hash

      * @return

      */

     private int countWords(HashMap<String, Integer> hash) {

         int count = 0;

         for (Entry<String, Integer> entry : hash.entrySet()) {

             count += entry.getValue();

         }

         return count;

     }

     /**

      * calculate the entropy（熵） of the characters

      * @param hash

      * @return

      */

     private float calEntropy(HashMap<String, Float> hash) {

         float entropy = 0;

         Iterator<Entry<String, Float>> iterator = hash.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<String, Float> entry = (Entry<String, Float>) iterator.next();

             Float prob = entry.getValue();//get the probability of the characters

             entropy += 0 - (prob * Math.log(prob));//calculate the entropy of the characters

         }

         return entropy;

     }

 }

 import java.io.BufferedReader;

 import java.io.FileNotFoundException;

 import java.io.IOException;

 import java.io.InputStreamReader;

 import java.util.HashMap;

 import java.util.Iterator;

 import java.util.Map.Entry;

 public class NLPWork {

     /**

      * calculate the KL distance form file u1 to file u2

      * @param u1

      * @param u2

      * @return

      */

     public static float calKL(NLPFileUnit u1, NLPFileUnit u2) {

         HashMap<String, Float> hash1 = u1.WordProbability;

         HashMap<String, Float> hash2 = u2.WordProbability;

         float KLdistance = 0;

         Iterator<Entry<String, Float>> iterator = hash1.entrySet().iterator();

         while (iterator.hasNext()) {

             Entry<String, Float> entry = iterator.next();

             String key = entry.getKey().toString();

             if (hash2.containsKey(key)) {

                 Float value1 = entry.getValue();

                 Float value2 = hash2.get(key);

                 KLdistance += value1 * Math.log(value1 / value2);

             }

         }

         return KLdistance;

     }

     public static void main(String[] args) throws IOException, Exception {

         //all punctuation will be saved under working directory

         System.out.println("Now only UTF8 encoded file is supported!!!");

         System.out.println("PLS input file 1 path:");

         BufferedReader cin = new BufferedReader(

                 new InputStreamReader(System.in));

         String file1 = cin.readLine();

         System.out.println("PLS input file 2 path:");

         String file2 = cin.readLine();

         NLPFileUnit u1 = null;

         NLPFileUnit u2 = null;

         try{

             u1 = new NLPFileUnit(file1);//NLP:Nature Language Processing

             u2 = new NLPFileUnit(file2);

         }

         catch(FileNotFoundException e){

             System.out.println("File Not Found!!");

             e.printStackTrace();

             return;

         }

         float KLdistance = calKL(u1, u2);

         System.out.println("KLdistance is :" + KLdistance);

         System.out.println("File 1 Entropy: " + u1.entropy);

         System.out.println("File 2 Entropy: " + u2.entropy);

     }

 }

计算结果：

[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取的更多相关文章

各种形式的熵函数，KL距离
自信息量I(x)=-log(p(x)),其他依次类推. 离散变量x的熵H(x)=E(I(x))=-$\sum\limits_{x}{p(x)lnp(x)}$ 连续变量x的微分熵H(x)=E(I(x)) ...
KL距离，Kullback-Leibler Divergence
http://www.cnblogs.com/ywl925/p/3554502.html http://www.cnblogs.com/hxsyl/p/4910218.html http://blog ...
（转载）KL距离，Kullback-Leibler Divergence
转自:KL距离,Kullback-Leibler Divergence KL距离,是Kullback-Leibler差异(Kullback-Leibler Divergence)的简称,也叫做相对 ...
【机器学习基础】熵、KL散度、交叉熵
熵(entropy).KL 散度(Kullback-Leibler (KL) divergence)和交叉熵(cross-entropy)在机器学习的很多地方会用到.比如在决策树模型使用信息增益来选择 ...
深度学习中交叉熵和KL散度和最大似然估计之间的关系
机器学习的面试题中经常会被问到交叉熵(cross entropy)和最大似然估计(MLE)或者KL散度有什么关系,查了一些资料发现优化这3个东西其实是等价的. 熵和交叉熵提到交叉熵就需要了解下信息论 ...
【转载】 KL距离（相对熵）
原文地址: https://www.cnblogs.com/nlpowen/p/3620470.html ----------------------------------------------- ...
KL距离（相对熵）
KL距离,是Kullback-Leibler差异(Kullback-Leibler Divergence)的简称,也叫做相对熵(Relative Entropy).它衡量的是相同事件空间里的两个概率分 ...
NLP 自然语言处理实战
前言自然语言处理 ( Natural Language Processing, NLP) 是计算机科学领域与人工智能领域中的一个重要方向.它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和 ...
最大熵与最大似然，以及KL距离。
DNN中最常使用的离散数值优化目标,莫过于交差熵.两个分布p,q的交差熵,与KL距离实际上是同一回事. $-\sum plog(q)=D_{KL}(p\shortparallel q)-\sum pl ...

随机推荐

试想一下，在代码学习Swift！
文件 https://itunes.apple.com/us/book/the-swift-programming-language/id881256329?mt=11 htt ...
linux在构建SVNserver
最近搞了一个云计算server,一些尝试部署server相关的东西.作为用显影剂server.首先要考虑的是建立SVNserver.关于构建过程记录.方便以后. 一.安装svn软件.有些云server ...
搭建及修正Hadoop1.2.1 MapReduce Pipes C++开发环境
Hadoop目前人气超旺,返璞归真的KV理念让人们再一次换一个角度来冷静思考一些问题. 但随着近些年来写C/C++的人越来越少,网上和官方WIKI的教程直接落地的成功率却不高,多少会碰到这样那样的问题 ...
SoC嵌入式软件架构设计II：否MMU的CPU虚拟内存管理的设计与实现方法
大多数的程序代码是必要的时,它可以被加载到内存中运行.手术后,可直接丢弃或覆盖其他代码.我们PC然在同一时间大量的应用,能够整个线性地址空间(除了部分留给操作系统或者预留它用),能够觉得每一个应用程序 ...
linux 安装httpd（验证通过）
一.安装apache(http服务) 1. 从apache.org下载源码安装包 2. 解压缩 # tar zxf httpd-2.2.4.tar.gz # cd httpd-2.2.4 3. 安装a ...
Jquery页面中添加键盘按键事件，如ESC事件
$(document).keydown(function(event){ if(event.keyCode == 38 || event.keyCode == 104){ i--; if(i<= ...
android 当目录路径从n层按back键退回到n-19层的时候，file manager自己主动退出
当目录路径从n层按back键退回到n-19层的时候,file manager自己主动退出,比方在63层按back 键退回到44层的时候,file manager自己主动退出. 1.FileMana ...
.NET：从 Mono、.NET Core 说起
魅力 .NET:从 Mono..NET Core 说起前段时间,被问了这样一个问题:.NET 应用程序是怎么运行的? 当时大概愣了好久,好像也没说出个所以然,得到的回复是:这是 .NET 程序员最基 ...
protobuf-net-data
protobuf-net http://www.codeproject.com/Articles/642677/Protobuf-net-the-unofficial-manual https://g ...
Asp.net vNext 学习3
Asp.net vNext 学习之路(三) asp.net vNext 对于构建asp.net 程序带来了一些重大的改变,让我们开发asp.net 程序的时候更加的方便和高效. 1,可以很容易的去管理 ...

[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取

[NLP自然语言处理]计算熵和KL距离，java实现汉字和英文单词的识别，UTF8变长字符读取的更多相关文章

随机推荐

热门专题