利用JAVA计算TFIDF和Cosine相似度-学习版本
写在前面的话,既然是学习版本,那么就不是一个好用的工程实现版本,整套代码全部使用List进行匹配效率可想而知。
【原文转自】:http://computergodzilla.blogspot.com/2013/07/how-to-calculate-tf-idf-of-document.html,修改了其中一些bug。
P.S:如果不是被迫需要语言统一,尽量不要使用此工程计算TF-IDF,计算2W条短文本,Matlab实现仅是几秒之间,此Java工程要计算良久。。半个小时?甚至更久,因此此程序作为一个学习版本,并不适用于工程实现。。工程试验版本
For beginners doing a project in text mining aches them a lot by various term like :
- TF-IDF
- COSINE SIMILARITY
- CLUSTERING
- DOCUMENT VECTORS
In my earlier post I showed you guys what is Cosine Similarity. I will not talk about Cosine Similarity in this post but rather I will show a nice little code to calculate Cosine Similarity in java.
Many of you must be familiar with Tf-Idf(Term frequency-Inverse Document Frequency).
I will enlighten them in brief.
Term Frequency:
Suppose for a document “Tf-Idf Brief Introduction” there are overall 60000 words and a word Term-Frequency occurs 60times.
Then , mathematically, its Term Frequency, TF = 60/60000 =0.001.
Inverse Document Frequency:
Suppose one bought Harry-Potter series, all series. Suppose there are 7 series and a word “AbraKaDabra” comes in 2 of the series.
Then, mathematically, its Inverse-Document Frequency , IDF = 1 +
log(7/2) = …….(calculated it guys, don’t be lazy, I am lazy not you
guys.)
And Finally, TFIDF = TF * IDF;
By mathematically I assume you now know its meaning physically.
Document Vector:
There are various ways to calculate document vectors. I am just giving
you an example. Suppose If I calculate all the term’s TF-IDF of a
document A and store them in an array(list, matrix … in any ordered way,
.. you guys are genius you know how to create a vector. ) then I get an
Document Vector of TF-IDF scores of document A.
The class shown below calculates the Term Frequency(TF) and Inverse Document Frequency(IDF).
- //TfIdf.java
- package com.computergodzilla.tfidf;
- import java.util.List;
- /**
- * Class to calculate TfIdf of term.
- * @author Mubin Shrestha
- */
- public class TfIdf {
- /**
- * Calculates the tf of term termToCheck
- * @param totalterms : Array of all the words under processing document
- * @param termToCheck : term of which tf is to be calculated.
- * @return tf(term frequency) of term termToCheck
- */
- public double tfCalculator(String[] totalterms, String termToCheck) {
- double count = 0; //to count the overall occurrence of the term termToCheck
- for (String s : totalterms) {
- if (s.equalsIgnoreCase(termToCheck)) {
- count++;
- }
- }
- return count / totalterms.length;
- }
- /**
- * Calculates idf of term termToCheck
- * @param allTerms : all the terms of all the documents
- * @param termToCheck
- * @return idf(inverse document frequency) score
- */
- public double idfCalculator(List<String[]> allTerms, String termToCheck) {
- double count = 0;
- for (String[] ss : allTerms) {
- for (String s : ss) {
- if (s.equalsIgnoreCase(termToCheck)) {
- count++;
- break;
- }
- }
- }
- return 1 + Math.log(allTerms.size() / count);
- }
- }
The class shown below parsed the text documents and split them into
tokens. This class will communicate with TfIdf.java class to calculated
TfIdf. It also calls CosineSimilarity.java class to calculated the
similarity between the passed documents.
- //DocumentParser.java
- package com.computergodzilla.tfidf;
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileNotFoundException;
- import java.io.FileReader;
- import java.io.IOException;
- import java.util.ArrayList;
- import java.util.List;
- /**
- * Class to read documents
- *
- * @author Mubin Shrestha
- */
- public class DocumentParser {
- //This variable will hold all terms of each document in an array.
- private List<String[]> termsDocsArray = new ArrayList<String[]>();
- private List<String> allTerms = new ArrayList<String>(); //to hold all terms
- private List<double[]> tfidfDocsVector = new ArrayList<double[]>();
- /**
- * Method to read files and store in array.
- * @param filePath : source file path
- * @throws FileNotFoundException
- * @throws IOException
- */
- public void parseFiles(String filePath) throws FileNotFoundException, IOException {
- File[] allfiles = new File(filePath).listFiles();
- BufferedReader in = null;
- for (File f : allfiles) {
- if (f.getName().endsWith(“.txt”)) {
- in = new BufferedReader(new FileReader(f));
- StringBuilder sb = new StringBuilder();
- String s = null;
- while ((s = in.readLine()) != null) {
- sb.append(s);
- }
- String[] tokenizedTerms = sb.toString().replaceAll(“[\\W&&[^\\s]]”, “”).split(“\\W+”); //to get individual terms
- for (String term : tokenizedTerms) {
- if (!allTerms.contains(term)) { //avoid duplicate entry
- allTerms.add(term);
- }
- }
- termsDocsArray.add(tokenizedTerms);
- }
- }
- }
- /**
- * Method to create termVector according to its tfidf score.
- */
- public void tfIdfCalculator() {
- double tf; //term frequency
- double idf; //inverse document frequency
- double tfidf; //term requency inverse document frequency
- for (String[] docTermsArray : termsDocsArray) {
- double[] tfidfvectors = new double[allTerms.size()];
- int count = 0;
- for (String terms : allTerms) {
- tf = new TfIdf().tfCalculator(docTermsArray, terms);
- idf = new TfIdf().idfCalculator(termsDocsArray, terms);
- tfidf = tf * idf;
- tfidfvectors[count] = tfidf;
- count++;
- }
- tfidfDocsVector.add(tfidfvectors); //storing document vectors;
- }
- }
- /**
- * Method to calculate cosine similarity between all the documents.
- */
- public void getCosineSimilarity() {
- for (int i = 0; i < tfidfDocsVector.size(); i++) {
- for (int j = 0; j < tfidfDocsVector.size(); j++) {
- System.out.println(“between ” + i + “ and ” + j + “ = ”
- + new CosineSimilarity().cosineSimilarity
- (
- tfidfDocsVector.get(i),
- tfidfDocsVector.get(j)
- )
- );
- }
- }
- }
- }
This is the class that calculates Cosine Similarity:
- //CosineSimilarity.java
- /*
- * To change this template, choose Tools | Templates
- * and open the template in the editor.
- */
- package com.computergodzilla.tfidf;
- /**
- * Cosine similarity calculator class
- * @author Mubin Shrestha
- */
- public class CosineSimilarity {
- /**
- * Method to calculate cosine similarity between two documents.
- * @param docVector1 : document vector 1 (a)
- * @param docVector2 : document vector 2 (b)
- * @return
- */
- public double cosineSimilarity(double[] docVector1, double[] docVector2) {
- double dotProduct = 0.0;
- double magnitude1 = 0.0;
- double magnitude2 = 0.0;
- double cosineSimilarity = 0.0;
- for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length
- {
- dotProduct += docVector1[i] * docVector2[i]; //a.b
- magnitude1 += Math.pow(docVector1[i], 2); //(a^2)
- magnitude2 += Math.pow(docVector2[i], 2); //(b^2)
- }
- magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)
- magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)
- if (magnitude1 != 0.0 | magnitude2 != 0.0) {
- cosineSimilarity = dotProduct / (magnitude1 * magnitude2);
- } else {
- return 0.0;
- }
- return cosineSimilarity;
- }
- }
Here’s the main class to run the code:
- //TfIdfMain.java
- package com.computergodzilla.tfidf;
- import java.io.FileNotFoundException;
- import java.io.IOException;
- /**
- *
- * @author Mubin Shrestha
- */
- public class TfIdfMain {
- /**
- * Main method
- * @param args
- * @throws FileNotFoundException
- * @throws IOException
- */
- public static void main(String args[]) throws FileNotFoundException, IOException
- {
- DocumentParser dp = new DocumentParser();
- dp.parseFiles(“D:\\FolderToCalculateCosineSimilarityOf”); // give the location of source file
- dp.tfIdfCalculator(); //calculates tfidf
- dp.getCosineSimilarity(); //calculates cosine similarity
- }
- }
You can also download the whole source code from here: Download. (Google Drive)
Overall what I did is, I first calculate the TfIdf matrix of all the
documents and then document vectors of each documents. Then I used those
document vectors to calculate cosine similarity.
You think clarification is not enough. Hit me..
Happy Text-Mining!!
from: http://jacoxu.com/?p=1619
利用JAVA计算TFIDF和Cosine相似度-学习版本的更多相关文章
- 利用sklearn进行tfidf计算
转自:http://blog.csdn.net/liuxuejiang158blog/article/details/31360765?utm_source=tuicool 在文本处理中,TF-IDF ...
- 利用sklearn计算文本相似性
利用sklearn计算文本相似性,并将文本之间的相似度矩阵保存到文件当中.这里提取文本TF-IDF特征值进行文本的相似性计算. #!/usr/bin/python # -*- coding: utf- ...
- Java计算计算活了多少天
Java计算计算活了多少天 思路: 1.输入你的出现日期: 2.利用日期转换,将字符串转换成date类型 3.然后将date时间换成毫秒时间 4.然后获取当前毫秒时间: 5.最后计算出来到这个时间多少 ...
- 使用不同的方法计算TF-IDF值
摘要 这篇文章主要介绍了计算TF-IDF的不同方法实现,主要有三种方法: 用gensim库来计算tfidf值 用sklearn库来计算tfidf值 用python手动实现tfidf的计算 总结 之所以 ...
- 第二次作业利用java语言编写计算器进行四则运算
随着第一次作业的完成,助教 牛老师又布置了第二次作业:用java语言编写一个程序然后进行四则运算用户用键盘输入一个字符来结束程序显示统计结果.一开始看到这个题目我也着实吓了一跳 因为不知道如何下手而且 ...
- SparkGraphx计算指定节点的N度关系节点
直接上代码: package horizon.graphx.util import java.security.InvalidParameterException import horizon.gra ...
- 初学Hadoop之计算TF-IDF值
1.词频 TF(term frequency)词频,就是该分词在该文档中出现的频率,算法是:(该分词在该文档出现的次数)/(该文档分词的总数),这个值越大表示这个词越重要,即权重就越大. 例如:一篇文 ...
- 基于熵的方法计算query与docs相似度
一.简单总结 其实相似度计算方法也是老生常谈,比如常用的有: 1.常规方法 a.编辑距离 b.Jaccard c.余弦距离 d.曼哈顿距离 e.欧氏距离 f.皮尔逊相关系数 2.语义方法 a.LSA ...
- 利用Java动态生成 PDF 文档
利用Java动态生成 PDF 文档,则需要开源的API.首先我们先想象需求,在企业应用中,客户会提出一些复杂的需求,比如会针对具体的业务,构建比较典型的具备文档性质的内容,一般会导出PDF进行存档.那 ...
随机推荐
- Linux命令echo -e
在Linux命令中 echo -e 这个参数e是什么意思. echo –e “I will use ‘touch’ command to create 3 files.” 这里参数e的作用是什么 ma ...
- 1.PHP内核探索:从SAPI接口开始
SAPI:Server Application Programming Interface 服务器端应用编程端口.研究过PHP架构的同学应该知道这个东东的重要性,它提供了一个接口,使得PHP可以和其他 ...
- php7安装
# 配置参数 ./configure --prefix=/usr/local/php7 \ --with-config-file-path=/usr/local/php7/etc \ --with-m ...
- laravel 取sql语句
\DB::connection()->enableQueryLog(); some sql action... $query = \DB::getQueryLog(); $lastQuery = ...
- 6月辞职->帝都生活
---恢复内容开始--- 5月初送走了静,有点伤心,但还是忍住没哭. 纠结了一下上哪个班,上不上基础班,不能再拖了,果断交钱报6月份的ios基础班.之前还有个电话面试,怕怕的,考了很多函数的知识,好多 ...
- WIN7 64位系统下的服务程序更新失败问题解决
自己用DELPHI做了个小的服务在后台运行,帮助自己做一些琐事,今天修改了一下代码结果重启服务的时候一直还是以前的状态,新加的代码没任何效果. 1.检查程序没问题呀 2.关闭SSD缓存硬盘问题仍旧 3 ...
- html之内联元素与块状元素;
html之内联元素与块状元素 一.html之内联元素与块状元素 1.块状元素一般比较霸道,它排斥与其他元素位于同一行内.比如div,并且width与height对它起作用. 2.内联元素只能容纳文本或 ...
- [Error Code: 1290. The MySQL server is running with the --secure-file-priv option so it cannot execute this statement]错误解决
1.配置文件中将这行注销“secure-file-priv="C:/ProgramData/MySQL/MySQL Server 5.7/Uploads" ”:很多人添加权限依然不 ...
- socketlog
说明 SocketLog适合Ajax调试和API调试, 举一个常见的场景,用SocketLog来做微信调试, 我们在做微信API开发的时候,如果API有bug,微信只提示“改公众账号暂时无法提供服务, ...
- MySQL- 锁(3)
InnoDB在不同隔离级别下的一致性读及锁的差异 前面讲过,锁和多版本数据是InnoDB实现一致性读和ISO/ANSI SQL92隔离级别的手段,因此,在不同的隔离级别下,InnoDB处理SQL时采用 ...