一、TF-IDF

词项频率:

df:term frequency。 term在文档中出现的频率.tf越大,词项越重要.

文档频率:

tf:document frequecy。有多少文档包括此term，df越大词项越不重要.

词项权重计算公式：

tf-idf=tf(t,d)*log(N/df(t))

W(t,d):the weight of the term in document d
tf(t,d):the frequency of term t in document d
N:the number of documents
df(t):the number of documents that contain term t

二、JAVA实现

package com.javacore.algorithm;

import java.util.Arrays;

import java.util.List;

/**

 * Created by bee on 17/3/13.

 * @version 1.0

 * @author blog.csdn.net/napoay

 */

public class TfIdfCal {

    /**

     *calculate the word frequency

     * @param doc word vector of a doc

     * @param term  a word

     * @return the word frequency of a doc

     */

    public double tf(List<String> doc, String term) {

        double termFrequency = 0;

        for (String str : doc) {

            if (str.equalsIgnoreCase(term)) {

                termFrequency++;

            }

        }

        return termFrequency / doc.size();

    }

    /**

     *calculate the document frequency

     * @param docs the set of all docs

     * @param term a word

     * @return the number of docs which contain the word

     */

    public int df(List<List<String>> docs, String term) {

        int n = 0;

        if (term != null && term != "") {

            for (List<String> doc : docs) {

                for (String word : doc) {

                    if (term.equalsIgnoreCase(word)) {

                        n++;

                        break;

                    }

                }

            }

        } else {

            System.out.println("term不能为null或者空串");

        }

        return n;

    }

    /**

     *calculate the inverse document frequency

     * @param docs  the set of all docs

     * @param term  a word

     * @return  idf

     */

    public double idf(List<List<String>> docs, String term) {

        System.out.println("N:"+docs.size());

        System.out.println("DF:"+df(docs,term));

        return  Math.log(docs.size()/(double)df(docs,term));

    }

    /**

     * calculate tf-idf

     * @param doc a doc

     * @param docs document set

     * @param term a word

     * @return inverse document frequency

     */

    public double tfIdf(List<String> doc, List<List<String>> docs, String term) {

        return tf(doc, term) * idf(docs, term);

    }

    public static void main(String[] args) {

        List<String> doc1 = Arrays.asList("人工", "智能", "成为", "互联网", "大会", "焦点");

        List<String> doc2 = Arrays.asList("谷歌", "推出", "开源", "人工", "智能", "系统", "工具");

        List<String> doc3 = Arrays.asList("互联网", "的", "未来", "在", "人工", "智能");

        List<String> doc4 = Arrays.asList("谷歌", "开源", "机器", "学习", "工具");

        List<List<String>> documents = Arrays.asList(doc1, doc2, doc3,doc4);

        TfIdfCal calculator = new TfIdfCal();

        System.out.println(calculator.tf(doc2, "开源"));

        System.out.println(calculator.df(documents, "开源"));

        double tfidf = calculator.tfIdf(doc2, documents, "谷歌");

        System.out.println("TF-IDF (谷歌) = " + tfidf);

        System.out.println(Math.log(4/2)*1.0/7);

    }

}

执行结果:

0.14285714285714285

2

N:4

DF:2

TF-IDF (谷歌) = 0.09902102579427789

TF-IDF词项权重计算的更多相关文章

TF/IDF（term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
文本分类学习（三）特征权重（TF/IDF）和特征提取
上一篇中,主要说的就是词袋模型.回顾一下,在进行文本分类之前,我们需要把待分类文本先用词袋模型进行文本表示.首先是将训练集中的所有单词经过去停用词之后组合成一个词袋,或者叫做字典,实际上一个维度很大的 ...
ElasticStack学习（九）：深入ElasticSearch搜索之词项、全文本、结构化搜索及相关性算分
一.基于词项与全文的搜索 1.词项 Term(词项)是表达语意的最小单位,搜索和利用统计语言模型进行自然语言处理都需要处理Term. Term的使用说明: 1)Term Level Query:Ter ...
关键词权重计算算法：TF-IDF
TF-IDF(Term Frequency–Inverse Document Frequency)是一种用于资讯检索与文本挖掘的常用加权技术.TF-IDF是一种统计方法,用以评估一字词对于一个文件集或 ...
TF/IDF计算方法
FROM:http://blog.csdn.net/pennyliang/article/details/1231028 我们已经谈过了如何自动下载网页.如何建立索引.如何衡量网页的质量(Page R ...
信息检索中的TF/IDF概念与算法的解释
https://blog.csdn.net/class_brick/article/details/79135909 概念 TF-IDF(term frequency–inverse document ...
（6）文本挖掘（三）——文本特征TFIDF权重计算及文本向量空间VSM表示
建立文本数据数学描写叙述的过程分为三个步骤:文本预处理.建立向量空间模型和优化文本向量. 文本预处理主要採用分词.停用词过滤等技术将原始的文本字符串转化为词条串或者特点的符号串.文本预处理之后,每个文 ...
tf-idf 词条权重计算
在文本分类问题中,某些高频词一直出现,这样的词对区分文档的作用不大,例如: D1: 'Job was the chairman of Apple Inc.' D2: 'I like to use ...
tf–idf算法解释及其python代码实现(下)
tf–idf算法python代码实现这是我写的一个tf-idf的简单实现的代码,我们知道tfidf=tf*idf,所以可以分别计算tf和idf值在相乘,首先我们创建一个简单的语料库,作为例子,只有四 ...

随机推荐

IDEA快捷键收集
生成set 和get方法 .生产重写方法Alt+Insert 查看类的所有方法alt + 7 去掉多余的引用包alt + ctrl + O ctrl + alt + t 生成try 语句自动导入包 ...
xbox360 双65厚机自制系统无硬盘 U盘玩游戏方法
因为没有硬盘,又没有光盘.所以想把游戏放在U盘里面.用U盘来做为硬盘玩游戏. 现有的自制系统主要是FSD,但是FSD要用硬盘才能安装,理论上U盘也可以,但是我没有尝试了. 这里介绍的是玩xex格式的游 ...
redis 3.2.3的源码安装
Install necessary packages On CentOS : yum install wget make gcc tcl On CentOS yum install wget make ...
使用gradle多渠道打包
以友盟的多渠道打包为例,如果我们须要打包出例如以下渠道:UMENG, WANDOUJIA, YINGYONGBAO. 第一种方法.是须要创建文件的. 我们在写完我们的代码之后,在app/src以下.分 ...
centos 7安装cppman
cppman是一个在命令行查询c和c++语法及标准库函数的工具,非常好用,python3编写,记录一下安装过程: yum update yum install python34 yum install ...
Subclipse和TortoiseSVN版本不一致导致升到高版本的project后，低版本svn客户端无法使用。
mysql获得60天前unix时间示例
在mysql中获取多少天前的unix时间的方法.首先根据now()获得当前时间,使用adddate()方法获得60天前时间,使用unix_timestamp()方法转换时间类型 select UNIX ...
360wifi: 手机锁屏360wifi掉线的解决方法
如遇到iphone锁屏断网的情况,按照以下操作步骤可以解决一部分用户的问题 (该问题并不是360WifFi问题,与苹果机制有关)如有安卓手机掉线,请确保手机连接其他Wifi并不会掉线,然后尝试粉色字体 ...
采用Oracle的dbms_obfuscation_toolkit的加密
create or replace function MD5 (vpassword in varchar2) return varchar2 is retval varchar2(32); begin ...
每日英语：How Often Do Gamblers Really Win?
The casino billboards lining America's roadways tantalize with the lure of riches. 'Easy Street. It' ...

TF-IDF词项权重计算

一、TF-IDF

二、JAVA实现

TF-IDF词项权重计算的更多相关文章

随机推荐

热门专题