"""
=======================================================================================
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
======================================================================================= This is an example of applying Non-negative Matrix Factorization
and Latent Dirichlet Allocation on a corpus of documents and
extract additive models of the topic structure of the corpus.
The output is a list of topics, each represented as a list of terms
(weights are not shown). The default parameters (n_samples / n_features / n_topics) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).
""" # Author: Olivier Grisel <olivier.grisel@ensta.org>
# Lars Buitinck <L.J.Buitinck@uva.nl>
# Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause from __future__ import print_function
from time import time from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20 def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print() # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed. print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data
print("done in %0.3fs." % (time() - t0)) # Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0)) # Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0)) # Fit the NMF model
print("Fitting the NMF model with tf-idf features,"
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
exit()
print("done in %0.3fs." % (time() - t0)) print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words) print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
% (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online', learning_offset=50.,
random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0)) print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

TopicsExtraction with NMF & LDA的更多相关文章

  1. 机器学习SVD笔记

    机器学习中SVD总结 矩阵分解的方法 特征值分解. PCA(Principal Component Analysis)分解,作用:降维.压缩. SVD(Singular Value Decomposi ...

  2. SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取

    英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html 这 ...

  3. 文本主题模型之LDA(一) LDA基础

    文本主题模型之LDA(一) LDA基础 文本主题模型之LDA(二) LDA求解之Gibbs采样算法 文本主题模型之LDA(三) LDA求解之变分推断EM算法(TODO) 在前面我们讲到了基于矩阵分解的 ...

  4. 文本主题模型之非负矩阵分解(NMF)

    在文本主题模型之潜在语义索引(LSI)中,我们讲到LSI主题模型使用了奇异值分解,面临着高维度计算量太大的问题.这里我们就介绍另一种基于矩阵分解的主题模型:非负矩阵分解(NMF),它同样使用了矩阵分解 ...

  5. KNN PCA LDA

    http://blog.csdn.net/scyscyao/article/details/5987581 这学期选了门模式识别的课.发现最常见的一种情况就是,书上写的老师ppt上写的都看不懂,然后绕 ...

  6. LDA主题模型评估方法–Perplexity

    在LDA主题模型之后,需要对模型的好坏进行评估,以此依据,判断改进的参数或者算法的建模能力. Blei先生在论文<Latent Dirichlet Allocation>实验中用的是Per ...

  7. 用scikit-learn进行LDA降维

    在线性判别分析LDA原理总结中,我们对LDA降维的原理做了总结,这里我们就对scikit-learn中LDA的降维使用做一个总结. 1. 对scikit-learn中LDA类概述 在scikit-le ...

  8. 线性判别分析LDA原理总结

    在主成分分析(PCA)原理总结中,我们对降维算法PCA做了总结.这里我们就对另外一种经典的降维方法线性判别分析(Linear Discriminant Analysis, 以下简称LDA)做一个总结. ...

  9. word2vec参数调整 及lda调参

     一.word2vec调参   ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -neg ...

随机推荐

  1. winform窗体运行时的大小和设计时不一致

    窗体设置的尺寸为1946*850,而电脑分辨率是1920*1280 按说宽度已经超过屏幕大小很多了,应该显示占满屏幕宽度才对,但是运行时宽度只有设计时的一半 高度最多只能是1946像素,再拉大也不管用 ...

  2. mysql in 子查询 效率慢,对比

    desc SELECT id,detail,groupId from hs_knowledge_point where groupId in ( UNION all ) UNION ALL SELEC ...

  3. 【Spring】SpringMVCの環境構築(簡)(Version3.1)

    ■Mavenでプロジェクトの新規 ■プロジェクトのイメージ ■必要なラブリア ■ソース ①pom.xml <?xml version="1.0" encoding=" ...

  4. 含有Date属性的对象转化为Json

    含有Date类型属性的对象,转化为Json,Date属性并不是时间戳格式. 解决方法: 使用Jackson的注解@JsonFormat,添加到对象属性上方即可. 我们的北京时间会相差8个小时,因为我们 ...

  5. 20175314 实验二 Java面向对象程序设计

    20175314 实验二 Java面向对象程序设计 一.实验内容 初步掌握单元测试和TDD 理解并掌握面向对象三要素:封装.继承.多态 初步掌握UML建模 熟悉S.O.L.I.D原则 了解设计模式 二 ...

  6. vue小结

    一:MVVM模型的理解 Model:数据模型,数据和业务逻辑都在这里定义:View代表视图,负责数据的展示:ViewModel:负责监听model中数据的改变并且控制视图的更新,处理用户交互操作:Mo ...

  7. 学习一下sticky-footer

    什么是sticky-footer? 当页面长度不够长的时候,页脚粘贴在视窗底部:如果页面足够长时页脚会被内容向下推送. 实现方式: 1.负margin布局方式 给内容div加一个父div,设置父div ...

  8. project2

    [概念] 要好好理解并且背下来记住 Java基础,呵呵呵.自己查吧. local host搞错了,整个跑不出来.真尴尬.不理解啊. static原来是全局的意思啊,好吧.以前都忘了,这次该记住了.st ...

  9. JavaSE基础知识(5)—面向对象(方法的重写与重载)

    一.重写 1.说明 子类对继承过来的父类的方法进行改造,这种现象称为方法的重写或覆盖或覆写(Override) 2.要求 方法签名完全一致,jdk5.0之后,允许返回类型可以是子类类型,权限修饰符可以 ...

  10. Putty6.0 提示Access denied

    1.如果putty能正常使用,解决方法很简单: 只要在Putty的configuration里面Connection->SSH->Auth->GSSAPI的配置中,去掉默认的Atte ...