SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取

英文链接：http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。

输入分别是是tf-idf矩阵（NMF）和tf矩阵（LDA）。

输出是一系列的话题，每个话题由一系列的词组成。

默认的参数（n_samples/n_features/n_topics）会使这个例子运行数十秒。

你可以尝试修改问题的规模，但是要注意，NMF的时间复杂度是多项式级别的，LDA的时间复杂度与（n_samples*iterations）成正比。

几点注意事项:

（1）其中line 61的代码需要注释掉，才能看到输出结果。

（2）第一次运行代码，程序会从网上下载新闻数据，然后保存在一个缓存目录中，之后再运行代码，就不会重复下载了。

（3）关于NMF和LDA的参数设置，可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。

（4）该代码对应的sk-learn版本为 scikit-learn 0.17.1

代码：

 # Author: Olivier Grisel <olivier.grisel@ensta.org>

 #         Lars Buitinck <L.J.Buitinck@uva.nl>

 #         Chyi-Kwei Yau <chyikwei.yau@gmail.com>

 # License: BSD 3 clause

 from __future__ import print_function

 from time import time

 from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

 from sklearn.decomposition import NMF, LatentDirichletAllocation

 from sklearn.datasets import fetch_20newsgroups

 n_samples = 2000

 n_features = 1000

 n_topics = 10

 n_top_words = 20

 def print_top_words(model, feature_names, n_top_words):

     for topic_idx, topic in enumerate(model.components_):

         print("Topic #%d:" % topic_idx)

         print(" ".join([feature_names[i]

                         for i in topic.argsort()[:-n_top_words - 1:-1]]))

     print()

 # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics

 # to filter out useless terms early on: the posts are stripped of headers,

 # footers and quoted replies, and common English words, words occurring in

 # only one document or in at least 95% of the documents are removed.

 print("Loading dataset...")

 t0 = time()

 dataset = fetch_20newsgroups(shuffle=True, random_state=1,

                              remove=('headers', 'footers', 'quotes'))

 data_samples = dataset.data

 print("done in %0.3fs." % (time() - t0))

 # Use tf-idf features for NMF.

 print("Extracting tf-idf features for NMF...")

 tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,

                                    stop_words='english')

 t0 = time()

 tfidf = tfidf_vectorizer.fit_transform(data_samples)

 print("done in %0.3fs." % (time() - t0))

 # Use tf (raw term count) features for LDA.

 print("Extracting tf features for LDA...")

 tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,

                                 stop_words='english')

 t0 = time()

 tf = tf_vectorizer.fit_transform(data_samples)

 print("done in %0.3fs." % (time() - t0))

 # Fit the NMF model

 print("Fitting the NMF model with tf-idf features,"

       "n_samples=%d and n_features=%d..."

       % (n_samples, n_features))

 t0 = time()

 nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)

 exit()

 print("done in %0.3fs." % (time() - t0))

 print("\nTopics in NMF model:")

 tfidf_feature_names = tfidf_vectorizer.get_feature_names()

 print_top_words(nmf, tfidf_feature_names, n_top_words)

 print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."

       % (n_samples, n_features))

 lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,

                                 learning_method='online', learning_offset=50.,

                                 random_state=0)

 t0 = time()

 lda.fit(tf)

 print("done in %0.3fs." % (time() - t0))

 print("\nTopics in LDA model:")

 tf_feature_names = tf_vectorizer.get_feature_names()

 print_top_words(lda, tf_feature_names, n_top_words)

结果：

Loading dataset...

done in 2.222s.

Extracting tf-idf features for NMF...

done in 2.730s.

Extracting tf features for LDA...

done in 2.702s.

Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...

done in 1.904s.

Topics in NMF model:

Topic #0:

don just people think like know good time right ve say did make really way want going new year ll

Topic #1:

windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc

Topic #2:

drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal

Topic #3:

key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret

Topic #4:

00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested

Topic #5:

armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million

Topic #6:

god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence

Topic #7:

mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2

Topic #8:

space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars

Topic #9:

msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olney

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...

done in 22.548s.

Topics in LDA model:

Topic #0:

government people mr law gun state president states public use right rights national new control american security encryption health united

Topic #1:

drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software

Topic #2:

said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war

Topic #3:

year good just time game car team years like think don got new play games ago did season better ll

Topic #4:

10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40

Topic #5:

windows window program version file dos use files available display server using application set edu motif package code ms software

Topic #6:

edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet

Topic #7:

ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey

Topic #8:

god people jesus believe does say think israel christian true life jews did bible don just know world way church

Topic #9:

don know like just think ve want does use good people key time way make problem really work say need

SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取的更多相关文章

NMF非负矩阵分解
著名的科学杂志<Nature>于1999年刊登了两位科学家D.D.Lee和H.S.Seung对数学中非负矩阵研究的突出成果.该文提出了一种新的矩阵分解思想――非负矩阵分解(Non-nega ...
数据降维-NMF非负矩阵分解
1.什么是非负矩阵分解? NMF的基本思想可以简单描述为:对于任意给定的一个非负矩阵V,NMF算法能够寻找到一个非负矩阵W和一个非负矩阵H,使得满足 ,从而将一个非负的矩阵分解为左右两个非负矩阵的乘积 ...
主题模型（概率潜语义分析PLSA、隐含狄利克雷分布LDA）
一.pLSA模型 1.朴素贝叶斯的分析 (1)可以胜任许多文本分类问题.(2)无法解决语料中一词多义和多词一义的问题--它更像是词法分析,而非语义分析.(3)如果使用词向量作为文档的特征,一词多义和多 ...
非负矩阵分解NMF
http://blog.csdn.net/pipisorry/article/details/52098864 非负矩阵分解(NMF,Non-negative matrix factorization ...
文本主题模型之非负矩阵分解(NMF)
在文本主题模型之潜在语义索引(LSI)中,我们讲到LSI主题模型使用了奇异值分解,面临着高维度计算量太大的问题.这里我们就介绍另一种基于矩阵分解的主题模型:非负矩阵分解(NMF),它同样使用了矩阵分解 ...
浅谈隐语义模型和非负矩阵分解NMF
本文从基础介绍隐语义模型和NMF. 隐语义模型 ”隐语义模型“常常在推荐系统和文本分类中遇到,最初来源于IR领域的LSA(Latent Semantic Analysis),举两个case加快理解. ...
非负矩阵分解（4）：NMF算法和聚类算法的联系与区别
作者:桂. 时间:2017-04-14 06:22:26 链接:http://www.cnblogs.com/xingshansi/p/6685811.html 声明:欢迎被转载,不过记得注明出处 ...
推荐算法——非负矩阵分解(NMF)
一.矩阵分解回想在博文推荐算法--基于矩阵分解的推荐算法中,提到了将用户-商品矩阵进行分解.从而实现对未打分项进行打分. 矩阵分解是指将一个矩阵分解成两个或者多个矩阵的乘积.对于上述的用户-商品矩阵 ...
非负矩阵分解（NMF）原理及算法实现
一.矩阵分解回想矩阵分解是指将一个矩阵分解成两个或者多个矩阵的乘积.对于上述的用户-商品(评分矩阵),记为能够将其分解为两个或者多个矩阵的乘积,如果分解成两个矩阵和 .我们要使得矩阵和的乘积能够还 ...

随机推荐

【转】virtualenv -- python虚拟沙盒
有人说:virtualenv.fabric 和 pip 是 pythoneer 的三大神器. 不管认不认同,至少要先认识一下,pip现在倒是经常用到,virtualenv第一次听说,不过,总得尝试一下 ...
Xcode7.2 导入XMPP框架错误解决
1.修改Build Settings 在 Header Search Paths 中添加: "/usr/include/libxml2" 在Other Linker Flags 中 ...
html5 文件上传带进度条
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/ ...
Android 如何制造低内存环境
我们在复现问题的时候有时需要低内存的环境,此时我们可以在有root的手机中,往 /mnt/obb 目录下 push 文件,直到满足需要. 原理:/mnt/obb目录下挂载的是tmpfs文件系统,该文件 ...
Python爬虫（一）信息系统集成及服务资质网
警告:不要恶意的访问网站,仅供学习使用! 本教程实例只抓取信息系统集成及服务资质网的企业资质查询. 1. 抓包打开谷歌浏览器的开发者工具并访问该网站,过滤请求后找到请求数据的包. 1.1 找到相应封 ...
隐马尔科夫模型HMM学习最佳范例
谷歌路过这个专门介绍HMM及其相关算法的主页:http://rrurl.cn/vAgKhh 里面图文并茂动感十足,写得通俗易懂,可以说是介绍HMM很好的范例了.一个名为52nlp的博主(google ...
在cocos2d里面如何使用Texture Packer和像素格式来优化spritesheet
免责申明(必读!):本博客提供的所有教程的翻译原稿均来自于互联网,仅供学习交流之用,切勿进行商业传播.同时,转载时不要移除本申明.如产生任何纠纷,均与本博客所有人.发表该翻译稿之人无任何关系.谢谢合作 ...
Java文件获取路径方式：
转自:http://blog.csdn.net/appleprince88/article/details/11599805# 谢谢! 由于经常需要获取文件的路径,但是比较容易忘记,每次需要总需要查询 ...
hdu - 3952 Fruit Ninja(简单几何)
思路来自于:http://www.cnblogs.com/wuyiqi/archive/2011/11/06/2238530.html 枚举两个多边形的两个点组成的直线,判断能与几个多边形相交因为最 ...
nginx+php 安装手册
http://www.cnblogs.com/hxl2009/archive/2013/06/11/3131627.html [mysql安装] php 安装 1: wget http://ftp. ...

SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取

SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取的更多相关文章

随机推荐

热门专题