【346】TF-IDF

>>> from sklearn.feature_extraction.text import TfidfTransformer

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> corpus=["I come to China to travel",

    "This is a car polupar in China",

    "I love tea and Apple ",

    "The work is to write some papers in science"]

>>> vectorizer=CountVectorizer()

>>> transformer = TfidfTransformer()

>>> tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

>>> print(tfidf)

  (0, 16)	0.4424621378947393

  (0, 15)	0.697684463383976

  (0, 4)	0.4424621378947393

  (0, 3)	0.348842231691988

  (1, 14)	0.45338639737285463

  (1, 9)	0.45338639737285463

  (1, 6)	0.3574550433419527

  (1, 5)	0.3574550433419527

  (1, 3)	0.3574550433419527

  (1, 2)	0.45338639737285463

  (2, 12)	0.5

  (2, 7)	0.5

  (2, 1)	0.5

  (2, 0)	0.5

  (3, 18)	0.3565798233381452

  (3, 17)	0.3565798233381452

  (3, 15)	0.2811316284405006

  (3, 13)	0.3565798233381452

  (3, 11)	0.3565798233381452

  (3, 10)	0.3565798233381452

  (3, 8)	0.3565798233381452

  (3, 6)	0.2811316284405006

  (3, 5)	0.2811316284405006

>>> print(vectorizer.get_feature_names())

['and', 'apple', 'car', 'china', 'come', 'in', 'is', 'love', 'papers', 'polupar', 'science', 'some', 'tea', 'the', 'this', 'to', 'travel', 'work', 'write']

说明：其中 (0, 16) 表示第一行文本，索引为 16 的词，对应的是“travel”，以此类推。

继续上面的信息，获取对应 term 的 tfidf 值，tfidf 变量对应的是 (4, 19) 矩阵的值，对应不同的句子，不同的 term。

>>> tfidf_array = tfidf.toarray()    #获取array，然后遍历array，并分别转为list

>>> names_list = vectorizer.get_feature_names()    #获取names的list

>>> for i in range(0, len(corpus)):

	print(corpus[i],'\n')

	tmp_list = tfidf_array[i].tolist()

	for j in range(0, len(names_list)):

		if tmp_list[j] != 0:

			if len(names_list[j])>=7:

				print(names_list[j],'\t',tmp_list[j])

			else:

				print(names_list[j],'\t\t',tmp_list[j])

	print('')

I come to China to travel 

china 		 0.348842231691988

come 		 0.4424621378947393

to 		 0.697684463383976

travel 		 0.4424621378947393

This is a car polupar in China 

car 		 0.45338639737285463

china 		 0.3574550433419527

in 		 0.3574550433419527

is 		 0.3574550433419527

polupar 	 0.45338639737285463

this 		 0.45338639737285463

I love tea and Apple  

and 		 0.5

apple 		 0.5

love 		 0.5

tea 		 0.5

The work is to write some papers in science 

in 		 0.2811316284405006

is 		 0.2811316284405006

papers 		 0.3565798233381452

science 	 0.3565798233381452

some 		 0.3565798233381452

the 		 0.3565798233381452

to 		 0.2811316284405006

work 		 0.3565798233381452

write 		 0.3565798233381452

>>>

获取 TF(Term Frequency)

>>> X = vectorizer.fit_transform(corpus)

>>> X.toarray()

array([[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0],

       [0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0],

       [1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],

       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1]],

      dtype=int64)

>>> vector_array = X.toarray()

>>> for i in range(0, len(corpus)):

	print(corpus[i],'\n')

	tmp_list = vector_array[i].tolist()

	for j in range(0, len(names_list)):

		if tmp_list[j] != 0:

			if len(names_list[j])>=7:

				print(names_list[j],'\t',tmp_list[j])

			else:

				print(names_list[j],'\t\t',tmp_list[j])

	print('')

I come to China to travel 

china 		 1

come 		 1

to 		 2

travel 		 1

This is a car polupar in China 

car 		 1

china 		 1

in 		 1

is 		 1

polupar 	 1

this 		 1

I love tea and Apple  

and 		 1

apple 		 1

love 		 1

tea 		 1

The work is to write some papers in science 

in 		 1

is 		 1

papers 		 1

science 	 1

some 		 1

the 		 1

to 		 1

work 		 1

write 		 1

>>>

【346】TF-IDF的更多相关文章

【TensorFlow】tf.nn.softmax_cross_entropy_with_logits的用法
在计算loss的时候,最常见的一句话就是 tf.nn.softmax_cross_entropy_with_logits ,那么它到底是怎么做的呢? 首先明确一点,loss是代价值,也就是我们要最小化 ...
【TensorFlow】tf.nn.max_pool实现池化操作
max pooling是CNN当中的最大值池化操作,其实用法和卷积很类似有些地方可以从卷积去参考[TensorFlow]tf.nn.conv2d是怎样实现卷积的? tf.nn.max_pool(va ...
【转载】 tf.ConfigProto和tf.GPUOptions用法总结
原文地址: https://blog.csdn.net/C_chuxin/article/details/84990176 -------------------------------------- ...
【Tensorflow】tf.nn.depthwise_conv2d如何实现深度卷积?
版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/mao_xiao_feng/article/ ...
【Tensorflow】tf.nn.atrous_conv2d如何实现空洞卷积？膨胀卷积
介绍关于空洞卷积的理论可以查看以下链接,这里我们不详细讲理论: 1.Long J, Shelhamer E, Darrell T, et al. Fully convolutional network ...
【六】tf和cgi进行联合试验，完成日志服务器
[任务6]tf和cgi进行联合试验,完成日志服务器 [任务6]tf和cgi进行联合试验,完成日志服务器改装gen-cpp目录下client.cpp文件启动Nginx服务和gen-cpp目录下编译后 ...
【转载】 tf.train.slice_input_producer()和tf.train.batch()
原文地址: https://www.jianshu.com/p/8ba9cfc738c2 ------------------------------------------------------- ...
【TensorFlow】tf.nn.embedding_lookup函数的用法
tf.nn.embedding_lookup函数的用法主要是选取一个张量里面索引对应的元素.tf.nn.embedding_lookup(tensor, id):tensor就是输入张量,id就是张量 ...
【TensorFlow】tf.nn.conv2d是怎样实现卷积的？
tf.nn.conv2d是TensorFlow里面实现卷积的函数,参考文档对它的介绍并不是很详细,实际上这是搭建卷积神经网络比较核心的一个方法,非常重要 tf.nn.conv2d(input, fil ...

随机推荐

Ionic slides 轮播图
1. 创建界面 <ion-content> <ion-slides pager class="myslides"> <ion-slide> &l ...
C# List的深复制（转）
C# List的深复制 1.关于深拷贝和浅拷贝 C#支持两种类型:值类型和引用类型值类型(Value Type):如 char, int, float,枚举类型和结构类型引用类型(Referenc ...
Unigui的Grid添加汇总栏
ThinkJava-复用类
7 .2 继承语法例子: package com.cy.reusing; import static com.java.util.Print.*; class Cleanser { private ...
实验八 c排序算法
8.1 #include<stdio.h> int main(){ int a[5],i,j,k,t,z; //输入5个元素进入数组 for(i=0;i<5;i++) scanf(& ...
LBS（基于位置服务）
ylbtech-杂项:LBS(基于位置服务) 基于位置的服务,它是通过电信移动运营商的无线电通讯网络(如GSM网.CDMA网)或外部定位方式(如GPS)获取移动终端用户的位置信息(地理坐标,或大地坐标 ...
Android手机卸载第三方应用
测试机互相拆借,过多的应用占用手机空间,使用脚本将不需要的第三方应用卸载. #!/bin/sh #白名单 whiteName=( com.tencent.mobileqq com.tencent.mm ...
计算图像相似度——《Python也可以》之一
声明:本文最初发表于赖勇浩(恋花蝶)的博客http://blog.csdn.net/lanphaday 先将两张图片转化为直方图,图像的相似度计算就转化为直方图的距离计算了,本文依照如下公式进行直方图 ...
linux下一个启动和监测多个进程的shell脚本程序
#!/bin/sh# Author:tang# Date:2017-09-01 ProcessName=webcrawlerInstanceCount=6RuntimeLog='runtime.log ...
SPM——Using Maven+Junit to test Hello Wudi
Last week, ours teacher taught us 'Software Delivery and Build Management'. And in this class, our t ...

【346】TF-IDF

【346】TF-IDF的更多相关文章

随机推荐

热门专题