英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html

这是一个使用NMF和LDA对一个语料集进行话题抽取的例子。

输入分别是是tf-idf矩阵(NMF)和tf矩阵(LDA)。

输出是一系列的话题,每个话题由一系列的词组成。

默认的参数(n_samples/n_features/n_topics)会使这个例子运行数十秒。

你可以尝试修改问题的规模,但是要注意,NMF的时间复杂度是多项式级别的,LDA的时间复杂度与(n_samples*iterations)成正比。

几点注意事项:

(1)其中line 61的代码需要注释掉,才能看到输出结果。

(2)第一次运行代码,程序会从网上下载新闻数据,然后保存在一个缓存目录中,之后再运行代码,就不会重复下载了。

(3)关于NMF和LDA的参数设置,可以到sklearn的官网上查看【NMF官方文档】【LDA官方文档】。

(4)该代码对应的sk-learn版本为 scikit-learn 0.17.1

代码:

 # Author: Olivier Grisel <olivier.grisel@ensta.org>
# Lars Buitinck <L.J.Buitinck@uva.nl>
# Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause from __future__ import print_function
from time import time from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20 def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print() # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed. print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data
print("done in %0.3fs." % (time() - t0)) # Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0)) # Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0)) # Fit the NMF model
print("Fitting the NMF model with tf-idf features,"
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
exit()
print("done in %0.3fs." % (time() - t0)) print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words) print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
% (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online', learning_offset=50.,
random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0)) print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

结果:

Loading dataset...
done in 2.222s.
Extracting tf-idf features for NMF...
done in 2.730s.
Extracting tf features for LDA...
done in 2.702s.
Fitting the NMF model with tf-idf features,n_samples=2000 and n_features=1000...
done in 1.904s. Topics in NMF model:
Topic #0:
don just people think like know good time right ve say did make really way want going new year ll
Topic #1:
windows thanks file card does dos mail files know program use advance hi window help software looking ftp video pc
Topic #2:
drive scsi ide drives disk controller hard floppy bus hd cd boot mac cable card isa rom motherboard mb internal
Topic #3:
key chip encryption clipper keys escrow government algorithm security secure encrypted public nsa des enforcement law privacy bit use secret
Topic #4:
00 sale 50 shipping 20 10 price 15 new 25 30 dos offer condition 40 cover asking 75 01 interested
Topic #5:
armenian armenians turkish genocide armenia turks turkey soviet people muslim azerbaijan russian greek argic government serdar kurds population ottoman million
Topic #6:
god jesus bible christ faith believe christians christian heaven sin life hell church truth lord does say belief people existence
Topic #7:
mouse driver keyboard serial com1 port bus com3 irq button com sys microsoft ball problem modem adb drivers card com2
Topic #8:
space nasa shuttle launch station sci gov orbit moon earth lunar satellite program mission center cost research data solar mars
Topic #9:
msg food chinese flavor eat glutamate restaurant foods reaction taste restaurants salt effects carl brain people ingredients natural causes olney Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 22.548s. Topics in LDA model:
Topic #0:
government people mr law gun state president states public use right rights national new control american security encryption health united
Topic #1:
drive card disk bit scsi use mac memory thanks pc does video hard speed apple problem used data monitor software
Topic #2:
said people armenian armenians turkish did saw went came women killed children turkey told dead didn left started greek war
Topic #3:
year good just time game car team years like think don got new play games ago did season better ll
Topic #4:
10 00 15 25 12 11 20 14 17 16 db 13 18 24 30 19 27 50 21 40
Topic #5:
windows window program version file dos use files available display server using application set edu motif package code ms software
Topic #6:
edu file space com information mail data send available program ftp email entry info list output nasa address anonymous internet
Topic #7:
ax max b8f g9v a86 pl 145 1d9 0t 34u 1t 3t giz bhj wm 2di 75u 2tm bxn 7ey
Topic #8:
god people jesus believe does say think israel christian true life jews did bible don just know world way church
Topic #9:
don know like just think ve want does use good people key time way make problem really work say need

SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取的更多相关文章

  1. NMF非负矩阵分解

    著名的科学杂志<Nature>于1999年刊登了两位科学家D.D.Lee和H.S.Seung对数学中非负矩阵研究的突出成果.该文提出了一种新的矩阵分解思想――非负矩阵分解(Non-nega ...

  2. 数据降维-NMF非负矩阵分解

    1.什么是非负矩阵分解? NMF的基本思想可以简单描述为:对于任意给定的一个非负矩阵V,NMF算法能够寻找到一个非负矩阵W和一个非负矩阵H,使得满足 ,从而将一个非负的矩阵分解为左右两个非负矩阵的乘积 ...

  3. 主题模型(概率潜语义分析PLSA、隐含狄利克雷分布LDA)

    一.pLSA模型 1.朴素贝叶斯的分析 (1)可以胜任许多文本分类问题.(2)无法解决语料中一词多义和多词一义的问题--它更像是词法分析,而非语义分析.(3)如果使用词向量作为文档的特征,一词多义和多 ...

  4. 非负矩阵分解NMF

    http://blog.csdn.net/pipisorry/article/details/52098864 非负矩阵分解(NMF,Non-negative matrix factorization ...

  5. 文本主题模型之非负矩阵分解(NMF)

    在文本主题模型之潜在语义索引(LSI)中,我们讲到LSI主题模型使用了奇异值分解,面临着高维度计算量太大的问题.这里我们就介绍另一种基于矩阵分解的主题模型:非负矩阵分解(NMF),它同样使用了矩阵分解 ...

  6. 浅谈隐语义模型和非负矩阵分解NMF

    本文从基础介绍隐语义模型和NMF. 隐语义模型 ”隐语义模型“常常在推荐系统和文本分类中遇到,最初来源于IR领域的LSA(Latent Semantic Analysis),举两个case加快理解. ...

  7. 非负矩阵分解(4):NMF算法和聚类算法的联系与区别

    作者:桂. 时间:2017-04-14   06:22:26 链接:http://www.cnblogs.com/xingshansi/p/6685811.html 声明:欢迎被转载,不过记得注明出处 ...

  8. 推荐算法——非负矩阵分解(NMF)

    一.矩阵分解回想 在博文推荐算法--基于矩阵分解的推荐算法中,提到了将用户-商品矩阵进行分解.从而实现对未打分项进行打分. 矩阵分解是指将一个矩阵分解成两个或者多个矩阵的乘积.对于上述的用户-商品矩阵 ...

  9. 非负矩阵分解(NMF)原理及算法实现

    一.矩阵分解回想 矩阵分解是指将一个矩阵分解成两个或者多个矩阵的乘积.对于上述的用户-商品(评分矩阵),记为能够将其分解为两个或者多个矩阵的乘积,如果分解成两个矩阵和 .我们要使得矩阵和 的乘积能够还 ...

随机推荐

  1. php中英文截取无乱码 包括全角下的字符

    符合UTF-8下,如果GBK下  改为  $content .= $str[$sing].$str[$sing+1];        $sing += 3; 改为 $sing += 2; /**    ...

  2. java if语句练习

    第一题:求一元二次方程的根 public class Lianxi1 { public static void main(String[] args) { System.out.println(&qu ...

  3. Java学习笔记 第一章 入门<转>

    第一章 JAVA入门 一.基础常识 1.软件开发 什么是软件? 软件:一系列按照特定顺序组织的计算机数据和指令的集合 系统软件:DOS,Windows,Linux 应用软件:扫雷.QQ.迅雷 什么是开 ...

  4. 【实战】初识ListView及提高效率

    简介: ListView是手机上最常用的控件之一,几乎所有的程序都会用到,手机屏幕空间有限,当需要显示大量数据的时候,就需要借助ListView来实现,允许用户通过手指上下滑动的方式将屏幕外的数据滚动 ...

  5. STL源码--Allocator学习

    内存的分配需要解决的几个问题: 1. 向系统的heap空间请求空间: 2. 考虑多线程的状态问题: 3. 考虑内存空间不足时的应对策略: 4. 考虑过多“小内存块”的碎片问题. SGI的STL底层使用 ...

  6. Delete characters

    Description In this exercise, you will get two strings A and B in each test group and the length of ...

  7. Linux下paste命令

    paste 用于将多个文件按照列队列进行合并. 该命令主要用来将多个文件的内容合并,与cut命令完成的功能刚好相反. 1.原文件: 1>a.txt [root@localhost home]# ...

  8. Oracle必须死之奇怪的ORA-06502错误

    作为熟练.Net码农以及非熟练Oracle用户很多时候Oracle总给我一种这货就是存心恶心我们的感觉. 虽然不得不承认Oracle是个很(an)好(gui)的产品,但是总有那么好几下被恶心到了.比如 ...

  9. 利用Formdata实现form提交文件上传不跳转页面

    作者:幻月九十链接:https://www.zhihu.com/question/19631256/answer/119911045来源:知乎著作权归作者所有,转载请联系作者获得授权. $('form ...

  10. 设计模式之美:Factory Method(工厂方法)

    索引 别名 意图 结构 参与者 适用性 缺点 效果 相关模式 命名约定 实现 实现方式(一):Creator 类是一个抽象类并且不提供它所声明的工厂方法的实现. 实现方式(二):Creator 类是一 ...