c++实现之 -- 文章TF-IDF值的计算

首先，是关键词的选取：

好吧这个我这模型实在是太简单了，但还是讲一讲比较好呢。。。

我们现在手头有的是一堆百度百科词条w的DF(w, c)值，c是整个百科词条。。。原因是。。。方便嘛~（而且人家现成的只有介个了啦~）

我们发现有830W+的词条数目，都存下来显然是不理智、不科学、不魔法的。所以选取一部分作为关键词。

如何选取关键词呢？我选择了DF值在[100, 5000]之间的词。虽然也很不理智、不科学、不魔法，但是比直接存下来理智、科学、魔法多了，恩！

于是就全读进来，然后找到需要的词语，顺便计算下IDF值什么的输出到新的文件里去。

 #include <cstdio>

 #include <iostream>

 #include <iomanip>

 #include <cmath>

 #include <string>

 #include <algorithm>

 using namespace std;

 typedef double lf;

 const int cnt_id = ;

 const lf tot_file = ;

 const lf eps = 1e-;

 struct data {

     int id;

     lf IDF;

     string st;

     data() {}

     data(int _id, lf _IDF, string _st) : id(_id), IDF(_IDF), st(_st) {}

     inline bool operator < (const data &a) const {

         return IDF > a.IDF;

     }

 } a[cnt_id];

 inline bool cmp_id(data a, data b) {

     return a.id < b.id;

 }

 string st;

 int id, cnt;

 int St, Ed;

 lf NUM_max, NUM_min;

 inline lf calc(int x) {

     return (lf) log((lf) tot_file / (x + eps));

 }

 int main() {

     int i, DF;

     freopen("data", "r", stdin);

     freopen("data_new", "w", stdout);

     ios::sync_with_stdio(true);

     while (cin >> id >> st >> DF)

         a[++cnt] = data(id, (lf) calc(DF), st);

     sort(a + , a + cnt + );

     NUM_max = calc(), NUM_min = calc();

     for (i = ; i <= cnt; ++i)

         if (a[i].IDF < NUM_max) break;

     St = i;

     for ( ; i <= cnt; ++i)

         if (a[i].IDF < NUM_min) break;

     Ed = i;

     sort(a + St, a + Ed, cmp_id);

     cout << Ed - St << endl;

     for (i = St; i < Ed; ++i)

         cout << a[i].id << ' ' << a[i].st << ' ' << setprecision() << a[i].IDF << endl;

     return ;

 }

这样子我们就选出来了339,896个数作为关键词，占全部词条的4.1%，数量的减少，可以大幅之后的程序提高效率。

（p.s. 这里使用了一个小技巧，就是setprecision(x)，表示在cout里，小数输出多少位关键字）

好了，关键词选选取完毕，接下来就是读入文章（已分词），并且计算出TF-IDF值啦！

我们可以边读边做，顺便达到节省空间且提高效率的目的。（data和passage两个map可以只剩下一个）

具体实现甚是蛋疼，各种搞不定。最后搞定了也不知道是怎么搞定的。。。反正现在是没什么问题，以后有没有就布吉岛了

 #include <cstdio>

 #include <iostream>

 #include <cmath>

 #include <string>

 #include <cstring>

 #include <algorithm>

 #include <map>

 using namespace std;

 typedef double lf;

 const int mod1 = ;

 const int mod2 = ;

 const int bin =  << ;

 struct TF_IDF {

     int TF;

     lf IDF, TF_IDF;

 };

 struct Word {

     string st;

     int h1, h2;

     inline bool operator < (const Word &x) const {

         return h1 == x.h1 ? h2 < x.h2 : h1 < x.h1;

     }

     inline bool operator == (const Word &x) const {

         return h1 == x.h1 && h2 == x.h2;

     }

     #define x (int) st[i]

     #define Weight 3001

     inline void calc_hash() {

         int len = st.length(), tmp, i;

         for (i = tmp = ; i < len; ++i)

             ((tmp *= Weight) += (x <  ? x + bin : x)) %= mod1;

         h1 = tmp;

         for (i = tmp = ; i < len; ++i)

             ((tmp *= Weight) += (x <  ? x + bin : x)) %= mod2;

         h2 = tmp;

     }

     #undef x

     #undef Weight

 } w;

 typedef map <Word, TF_IDF> map_for_words;

 typedef map_for_words :: iterator iter_for_words;

 map_for_words passage;

 void read_in_passage() {

     Word w;

     freopen("E:\\test\\test.in", "r", stdin);

     while (cin >> w.st) {

         w.calc_hash();

         passage[w].TF += ;

     }

     fclose(stdin);

 }

 void read_in_IDF_and_work() {

     int id, tot = , i;

     lf IDF;

     string st;

     Word w;

     iter_for_words it;

     freopen("E:\\test\\new.dat", "r", stdin);

     ios::sync_with_stdio(false);

     cin >> tot;

     for (i = ; i <= tot; ++i) {

         cin >> id >> w.st >> IDF;

         w.calc_hash();

         it = passage.find(w);

         if (it != passage.end()) {

             it -> second.IDF = IDF;

             it -> second.TF_IDF = (lf) it -> second.TF * it -> second.IDF;

         }

     }

     fclose(stdin);

 }

 void print() {

     iter_for_words it;

     cout << passage.size() << endl;

     for (it = passage.begin(); it != passage.end(); ++it)

         cout << it -> first.st << ' ' << it -> second.TF << ' ' << it -> second.IDF << ' ' << it -> second.TF_IDF << endl;

 }

 int main() {

     freopen("E:\\test\\test.out", "w", stdout);

     read_in_passage();

     read_in_IDF_and_work();

     print();

     return ;

 }

特别被坑死的点：

第一次打开test.in不能加上"ios::sync_with_stdio(false);"，但是第二次必须加上"ios::sync_with_stdio(false);"

否则第二次是可以打开文件的，但是什么都读不到= =

谁能告诉我这是什么坑货？、、、跪求巨神解答。。。

c++实现之 -- 文章TF-IDF值的计算的更多相关文章

使用solr的函数查询,并获取tf*idf值
1. 使用函数df(field,keyword) 和idf(field,keyword). http://118.85.207.11:11100/solr/mobile/select?q={!func ...
文本分类学习（三）特征权重（TF/IDF）和特征提取
上一篇中,主要说的就是词袋模型.回顾一下,在进行文本分类之前,我们需要把待分类文本先用词袋模型进行文本表示.首先是将训练集中的所有单词经过去停用词之后组合成一个词袋,或者叫做字典,实际上一个维度很大的 ...
信息检索中的TF/IDF概念与算法的解释
https://blog.csdn.net/class_brick/article/details/79135909 概念 TF-IDF(term frequency–inverse document ...
tf idf公式及sklearn中TfidfVectorizer
在文本挖掘预处理之向量化与Hash Trick中我们讲到在文本挖掘的预处理中,向量化之后一般都伴随着TF-IDF的处理,那么什么是TF-IDF,为什么一般我们要加这一步预处理呢?这里就对TF-IDF的 ...
TF/IDF（term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
TF/IDF计算方法
FROM:http://blog.csdn.net/pennyliang/article/details/1231028 我们已经谈过了如何自动下载网页.如何建立索引.如何衡量网页的质量(Page R ...
tf–idf算法解释及其python代码实现(下)
tf–idf算法python代码实现这是我写的一个tf-idf的简单实现的代码,我们知道tfidf=tf*idf,所以可以分别计算tf和idf值在相乘,首先我们创建一个简单的语料库,作为例子,只有四 ...
tf–idf算法解释及其python代码实现(上)
tf–idf算法解释 tf–idf, 是term frequency–inverse document frequency的缩写,它通常用来衡量一个词对在一个语料库中对它所在的文档有多重要,常用在信息 ...
Elasticsearch由浅入深（十）搜索引擎：相关度评分 TF&IDF算法、doc value正排索引、解密query、fetch phrase原理、Bouncing Results问题、基于scoll技术滚动搜索大量数据
相关度评分 TF&IDF算法 Elasticsearch的相关度评分(relevance score)算法采用的是term frequency/inverse document frequen ...

随机推荐

monkeyrunner自动登录脚本
自己写了个平时测试的app的自动登录脚本,亲测可运行.读者参照时只需要改包名.activity名称.坐标值.账号和密码即可查看坐标是多少的方法:使用手机的指针位置来实现:系统设置---开发者选项-- ...
ManualResetEvent & AutoResetEvent
参考资料: 1. https://msdn.microsoft.com/en-us/library/system.threading.manualresetevent.aspx 2. https: ...
使用一个封装的json删除方法
 <script type="text/javascript"> functi ...
linux学习笔记2-命令总结3
文件搜索命令 1.文件搜索命令 find 2.其他文件搜索命令 grep - 在文件中搜索字串匹配的行并输出 locate - 在文件资料库中查找文件 whereis - 搜索命令所在目录及帮助文档路 ...
poj2208Pyramids（四面体面积--公式）
链接一公式题.. 证明讲解参照http://www.cnblogs.com/dgsrz/articles/2590309.html 注意对棱顺序 #include <iostream> ...
python paramiko模块SSH自动登录linux系统进行操作
1). Linux系统首先要开启SSH服务:service ssh status 如果没安装的话,则要:apt-get install openssh-server service ssh resta ...
python中super关键字的用法
http://python.jobbole.com/86787/ class A: def __init__(self): print "enter A" print ...
C++中的虚继承 & 重载隐藏覆盖的讨论
虚继承这个东西用的真不多.估计也就是面试的时候会用到吧.. 可以看这篇文章:<关于C++中的虚拟继承的一些总结> 虚拟基类是为解决多重继承而出现的. 如:类D继承自类B1.B2,而类B1. ...
使用Python获取Linux系统的各种信息
哪个Python版本? 当我提及Python,所指的就是CPython 2(准确的是2.7).我会显式提醒那些相同的代码在CPython 3 (3.3)上是不工作的,以及提供一份解释不同之处的备选代码 ...
自我总结（六）---(学习j2ee+j2ee第一阶段项目)
自我完善的过程就是在不断的自我总结不断的改进. 学习了Struts2 Spring Hibernate. 十天前结束了这个课程.也考试了.这次考试老师说机试考的还不错.其实就是一个简单的用户登录,进行 ...

c++实现之 -- 文章TF-IDF值的计算

c++实现之 -- 文章TF-IDF值的计算的更多相关文章

随机推荐

热门专题