TF-IDF介绍

  TF-IDF是NLP中一种常用的统计方法,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度,通常用于提取文本的特征,即关键词。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

  在NLP中,TF-IDF的计算公式如下:

\[tfidf = tf*idf.
\]

其中,tf是词频(Term Frequency),idf为逆向文件频率(Inverse Document Frequency)。

  tf为词频,即一个词语在文档中的出现频率,假设一个词语在整个文档中出现了i次,而整个文档有N个词语,则tf的值为i/N.

  idf为逆向文件频率,假设整个文档有n篇文章,而一个词语在k篇文章中出现,则idf值为

\[idf=\log_{2}(\frac{n}{k}).
\]

当然,不同地方的idf值计算公式会有稍微的不同。比如有些地方会在分母的k上加1,防止分母为0,还有些地方会让分子,分母都加上1,这是smoothing技巧。在本文中,还是采用最原始的idf值计算公式,因为这与gensim里面的计算公式一致。

  假设整个文档有D篇文章,则单词i在第j篇文章中的tfidf值为

  以上就是TF-IDF的计算方法。

文本介绍及预处理

  我们将采用以下三个示例文本:

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.
Unqualified, the word football is understood to refer to whichever form of football is the most popular
in the regional context in which the word appears. Sports commonly called football in certain places
include association football (known as soccer in some countries); gridiron football (specifically American
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union);
and Gaelic football. These different variations of football are known as football codes.
""" text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court,
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter)
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period
of play (overtime) is mandated.
""" text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

这三篇文章分别是关于足球,篮球,排球的介绍,它们组成一篇文档。

  接下来是文本的预处理部分。

  首先是对文本去掉换行符,然后是分句,分词,再去掉其中的标点,完整的Python代码如下,输入的参数为文章text:

import nltk
import string # 文本预处理
# 函数:text文件分句,分词,并去掉标点
def get_tokens(text):
text = text.replace('\n', '')
sents = nltk.sent_tokenize(text) # 分句
tokens = []
for sent in sents:
for word in nltk.word_tokenize(sent): # 分词
if word not in string.punctuation: # 去掉标点
tokens.append(word)
return tokens

  接着,去掉文章中的通用词(stopwords),然后统计每个单词的出现次数,完整的Python代码如下,输入的参数为文章text:

from nltk.corpus import stopwords     #停用词

# 对原始的text文件去掉停用词
# 生成count字典,即每个单词的出现次数
def make_count(text):
tokens = get_tokens(text)
filtered = [w for w in tokens if not w in stopwords.words('english')] #去掉停用词
count = Counter(filtered)
return count

以text3为例,生成的count字典如下:

Counter({'ball': 4, 'net': 4, 'teammate': 3, 'returned': 2, 'bat': 2, 'court': 2, 'team': 2, 'across': 2, 'touches': 2, 'back': 2, 'players': 2, 'touch': 1, 'must': 1, 'usually': 1, 'side': 1, 'player': 1, 'area': 1, 'Volleyball': 1, 'hands': 1, 'may': 1, 'toward': 1, 'A': 1, 'third': 1, 'two': 1, 'six': 1, 'opposing': 1, 'within': 1, 'prevent': 1, 'allowed': 1, '’': 1, 'playing': 1, 'played': 1, 'volley': 1, 'surface—that': 1, 'volleys': 1, 'opponents': 1, 'use': 1, 'high': 1, 'teams': 1, 'bats': 1, 'To': 1, 'game': 1, 'make': 1, 'forth': 1, 'three': 1, 'trying': 1})

Gensim中的TF-IDF

  对文本进行预处理后,对于以上三个示例文本,我们都会得到一个count字典,里面是每个文本中单词的出现次数。下面,我们将用gensim中的已实现的TF-IDF模型,来输出每篇文章中TF-IDF排名前三的单词及它们的tfidf值,完整的代码如下:

from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils #training by gensim's Ifidf Model
def get_words(text):
tokens = get_tokens(text)
filtered = [w for w in tokens if not w in stopwords.words('english')]
return filtered # get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2] # output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
print("Top words in document %d"%(i + 1))
sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list
for num, score in sorted_words[:3]:
print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

输出的结果如下:

Training by gensim Tfidf Model.......

Top words in document 1
Word: football, TF-IDF: 0.84766
Word: rugby, TF-IDF: 0.21192
Word: known, TF-IDF: 0.14128
Top words in document 2
Word: play, TF-IDF: 0.29872
Word: cm, TF-IDF: 0.19915
Word: diameter, TF-IDF: 0.19915
Top words in document 3
Word: net, TF-IDF: 0.45775
Word: teammate, TF-IDF: 0.34331
Word: across, TF-IDF: 0.22888

输出的结果还是比较符合我们的预期的,比如关于足球的文章中提取了football, rugby关键词,关于篮球的文章中提取了plat, cm关键词,关于排球的文章中提取了net, teammate关键词。

自己动手实践TF-IDF模型

  有了以上我们对TF-IDF模型的理解,其实我们自己也可以动手实践一把,这是学习算法的最佳方式!

  以下是笔者实践TF-IDF的代码(接文本预处理代码):

import math

# 计算tf
def tf(word, count):
return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
return sum(1 for count in count_list if word in count) # 计算idf
def idf(word, count_list):
return math.log2(len(count_list) / (n_containing(word, count_list))) #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
return tf(word, count) * idf(word, count_list) # TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
print("Top words in document %d"%(i + 1))
scores = {word: tfidf(word, count, countlist) for word in count}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list
# sorted_words = matutils.unitvec(sorted_words)
for word, score in sorted_words[:3]:
print(" Word: %s, TF-IDF: %s"%(word, round(score, 5)))

输出结果如下:

Training by original algorithm......

Top words in document 1
Word: football, TF-IDF: 0.30677
Word: rugby, TF-IDF: 0.07669
Word: known, TF-IDF: 0.05113
Top words in document 2
Word: play, TF-IDF: 0.05283
Word: inches, TF-IDF: 0.03522
Word: worth, TF-IDF: 0.03522
Top words in document 3
Word: net, TF-IDF: 0.10226
Word: teammate, TF-IDF: 0.07669
Word: across, TF-IDF: 0.05113

可以看到,笔者自己动手实践的TF-IDF模型提取的关键词与gensim一致,至于篮球中为什么后两个单词不一致,是因为这些单词的tfidf一样,随机选择的结果不同而已。但是有一个问题,那就是计算得到的tfidf值不一样,这是什么原因呢?

  查阅gensim中计算tf-idf值的源代码(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py):

也就是说,gensim对得到的tf-idf向量做了规范化(normalize),将其转化为单位向量。因此,我们需要在刚才的代码中加入规范化这一步,代码如下:

import numpy as np

# 对向量做规范化, normalize
def unitvec(sorted_words):
lst = [item[1] for item in sorted_words]
L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
return unit_vector # TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
print("Top words in document %d"%(i + 1))
scores = {word: tfidf(word, count, countlist) for word in count}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list
sorted_words = unitvec(sorted_words) # normalize
for word, score in sorted_words[:3]:
print(" Word: %s, TF-IDF: %s"%(word, round(score, 5)))

输出结果如下:

Training by original algorithm......

Top words in document 1
Word: football, TF-IDF: 0.84766
Word: rugby, TF-IDF: 0.21192
Word: known, TF-IDF: 0.14128
Top words in document 2
Word: play, TF-IDF: 0.29872
Word: shooting, TF-IDF: 0.19915
Word: diameter, TF-IDF: 0.19915
Top words in document 3
Word: net, TF-IDF: 0.45775
Word: teammate, TF-IDF: 0.34331
Word: back, TF-IDF: 0.22888

现在的输出结果与gensim得到的结果一致!

总结

  Gensim是Python做NLP时鼎鼎大名的模块,有空还是多读读源码吧!以后,我们还会继续介绍TF-IDF在其它方面的应用,欢迎大家交流~

注意:本人现已开通微信公众号: Python爬虫与算法(微信号为:easy_web_scrape), 欢迎大家关注哦~~

本文的完整代码如下:

import nltk
import math
import string
from nltk.corpus import stopwords #停用词
from collections import Counter #计数
from gensim import corpora, models, matutils text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.
Unqualified, the word football is understood to refer to whichever form of football is the most popular
in the regional context in which the word appears. Sports commonly called football in certain places
include association football (known as soccer in some countries); gridiron football (specifically American
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union);
and Gaelic football. These different variations of football are known as football codes.
""" text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court,
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter)
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period
of play (overtime) is mandated.
""" text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across
the net. A team is allowed only three touches of the ball before it must be returned over the net.
""" # 文本预处理
# 函数:text文件分句,分词,并去掉标点
def get_tokens(text):
text = text.replace('\n', '')
sents = nltk.sent_tokenize(text) # 分句
tokens = []
for sent in sents:
for word in nltk.word_tokenize(sent): # 分词
if word not in string.punctuation: # 去掉标点
tokens.append(word)
return tokens # 对原始的text文件去掉停用词
# 生成count字典,即每个单词的出现次数
def make_count(text):
tokens = get_tokens(text)
filtered = [w for w in tokens if not w in stopwords.words('english')] #去掉停用词
count = Counter(filtered)
return count # 计算tf
def tf(word, count):
return count[word] / sum(count.values())
# 计算count_list有多少个文件包含word
def n_containing(word, count_list):
return sum(1 for count in count_list if word in count) # 计算idf
def idf(word, count_list):
return math.log2(len(count_list) / (n_containing(word, count_list))) #对数以2为底
# 计算tf-idf
def tfidf(word, count, count_list):
return tf(word, count) * idf(word, count_list) import numpy as np # 对向量做规范化, normalize
def unitvec(sorted_words):
lst = [item[1] for item in sorted_words]
L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
return unit_vector # TF-IDF测试
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
print("Top words in document %d"%(i + 1))
scores = {word: tfidf(word, count, countlist) for word in count}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list
sorted_words = unitvec(sorted_words) # normalize
for word, score in sorted_words[:3]:
print(" Word: %s, TF-IDF: %s"%(word, round(score, 5))) #training by gensim's Ifidf Model
def get_words(text):
tokens = get_tokens(text)
filtered = [w for w in tokens if not w in stopwords.words('english')]
return filtered # get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2] # output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
print("Top words in document %d"%(i + 1))
sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list
for num, score in sorted_words[:3]:
print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5))) """
输出结果: Training by original algorithm...... Top words in document 1
Word: football, TF-IDF: 0.84766
Word: rugby, TF-IDF: 0.21192
Word: word, TF-IDF: 0.14128
Top words in document 2
Word: play, TF-IDF: 0.29872
Word: inches, TF-IDF: 0.19915
Word: points, TF-IDF: 0.19915
Top words in document 3
Word: net, TF-IDF: 0.45775
Word: teammate, TF-IDF: 0.34331
Word: bat, TF-IDF: 0.22888 Training by gensim Tfidf Model....... Top words in document 1
Word: football, TF-IDF: 0.84766
Word: rugby, TF-IDF: 0.21192
Word: known, TF-IDF: 0.14128
Top words in document 2
Word: play, TF-IDF: 0.29872
Word: cm, TF-IDF: 0.19915
Word: diameter, TF-IDF: 0.19915
Top words in document 3
Word: net, TF-IDF: 0.45775
Word: teammate, TF-IDF: 0.34331
Word: across, TF-IDF: 0.22888
"""

NLP入门(二)探究TF-IDF的原理的更多相关文章

  1. 基于TF/IDF的聚类算法原理

        一.TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表示一个term与某个document的相关性. 公式为这个term在document中出 ...

  2. TF/IDF(term frequency/inverse document frequency)

    TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...

  3. 25.TF&IDF算法以及向量空间模型算法

    主要知识点: boolean model IF/IDF vector space model     一.boolean model     在es做各种搜索进行打分排序时,会先用boolean mo ...

  4. Elasticsearch由浅入深(十)搜索引擎:相关度评分 TF&IDF算法、doc value正排索引、解密query、fetch phrase原理、Bouncing Results问题、基于scoll技术滚动搜索大量数据

    相关度评分 TF&IDF算法 Elasticsearch的相关度评分(relevance score)算法采用的是term frequency/inverse document frequen ...

  5. 信息检索中的TF/IDF概念与算法的解释

    https://blog.csdn.net/class_brick/article/details/79135909 概念 TF-IDF(term frequency–inverse document ...

  6. redis入门(二)

    目录 redis入门(二) 前言 持久化 RDB AOF 持久化文件加载 高可用 哨兵 流程 安装部署 配置技巧 集群 原理 集群搭建 参考文档 redis入门(二) 前言 在redis入门(一)简单 ...

  7. IM开发者的零基础通信技术入门(二):通信交换技术的百年发展史(下)

    1.系列文章引言 1.1 适合谁来阅读? 本系列文章尽量使用最浅显易懂的文字.图片来组织内容,力求通信技术零基础的人群也能看懂.但个人建议,至少稍微了解过网络通信方面的知识后再看,会更有收获.如果您大 ...

  8. 脑残式网络编程入门(二):我们在读写Socket时,究竟在读写什么?

    1.引言 本文接上篇<脑残式网络编程入门(一):跟着动画来学TCP三次握手和四次挥手>,继续脑残式的网络编程知识学习 ^_^. 套接字socket是大多数程序员都非常熟悉的概念,它是计算机 ...

  9. Python下探究随机数的产生原理和算法

    资源下载 #本文PDF版下载 Python下探究随机数的产生原理和算法(或者单击我博客园右上角的github小标,找到lab102的W7目录下即可) #本文代码下载 几种随机数算法集合(和下文出现过的 ...

随机推荐

  1. 【转】python3 内循环中遍历map,遍历一遍后再次进入内循环,map为空

    今天在使用python map的过程中,发现了一个奇怪问题,map遍历完成后,再次访问map,发现map为空了,特记录下来,以备日后查看. 如下代码,期望的结果是每次从外循环进入内循环,map都从头开 ...

  2. 公用表表达式 (CTE)、递归、所有子节点、sqlserver

    指定临时命名的结果集,这些结果集称为公用表表达式 (CTE).公用表表达式可以包括对自身的引用.这种表达式称为递归公用表表达式. 对于递归公用表达式来说,实现原理也是相同的,同样需要在语句中定义两部分 ...

  3. nginx三种安装方法(转载)

    Nginx是一款轻量级的网页服务器.反向代理服务器.相较于Apache.lighttpd具有占有内存少,稳定性高等优势.它最常的用途是提供反向代理服务. 1.安装包编译安装 2.yum源安装 3.使用 ...

  4. ie页面数据导入共享版

    为了解决自动输入号码的正确率,原来的版本一直采用鼠标检测的方法.但是这个方法在其他ie平台的使用不太方便.于是直接检测ie的方法.现在的这个版本完全不需要鼠标的检测.方便而且快速精准可靠. 经过作者的 ...

  5. grep 笔记

    -a :将 binary 文件以 text 文件的方式搜寻数据-c :计算找到 '搜寻字符串' 的次数-i :忽略大小写的不同,所以大小写视为相同-n :顺便输出行号-v :反向选择,亦即显示出没有 ...

  6. bash基础特性1

    shell俗称壳(用来区别于内核),是指“提供使用者使用界面”的软件,就是一个命令行解释器. BASH是SHELL的一种,是大多数LINUX发行版默认的SHELL,除BASH SHELL外还有C SH ...

  7. 架构(二)Maven安装以及Nexus配置

    一 Maven安装配置 1.1 下载 http://mirrors.tuna.tsinghua.edu.cn/apache/maven/maven-3/3.5.4/binaries/apache-ma ...

  8. 如何将自己的jar包发布到mavan中央仓库

    最近自己写了一个关于网关限流的插件,然后想着肯定会有很多兄弟也需要使用到,所以就想着把jar包上传到Maven的中央仓库上让大家可以更方便的使用 现在咱们来看一下这个流程是什么样的呢. 首先呢,你得去 ...

  9. HTML百宝箱(1从0开始)

    标准格式(XHTML) l   元素必须正确嵌套 l   元素必须始终关闭 l   元素名和属性名必须小写 l   文档必须有且仅有一个根元素 l   属性值必须使用双引号括起来 l   声明文档为标 ...

  10. HBase之HRegionServer启动(含与HMaster交互)

    在我的博文<HBase——HMaster启动之一>.<HBase——HMaster启动之二>中已经详细介绍过HMaster在启动过程中调用的各种方法.下面,单就HRegionS ...