"""
=======================================================================================
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
======================================================================================= This is an example of applying Non-negative Matrix Factorization
and Latent Dirichlet Allocation on a corpus of documents and
extract additive models of the topic structure of the corpus.
The output is a list of topics, each represented as a list of terms
(weights are not shown). The default parameters (n_samples / n_features / n_topics) should make
the example runnable in a couple of tens of seconds. You can try to
increase the dimensions of the problem, but be aware that the time
complexity is polynomial in NMF. In LDA, the time complexity is
proportional to (n_samples * iterations).
""" # Author: Olivier Grisel <olivier.grisel@ensta.org>
# Lars Buitinck <L.J.Buitinck@uva.nl>
# Chyi-Kwei Yau <chyikwei.yau@gmail.com>
# License: BSD 3 clause from __future__ import print_function
from time import time from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups n_samples = 2000
n_features = 1000
n_topics = 10
n_top_words = 20 def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print() # Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed. print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data
print("done in %0.3fs." % (time() - t0)) # Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,
stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0)) # Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,
stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0)) # Fit the NMF model
print("Fitting the NMF model with tf-idf features,"
"n_samples=%d and n_features=%d..."
% (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
exit()
print("done in %0.3fs." % (time() - t0)) print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words) print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
% (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
learning_method='online', learning_offset=50.,
random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0)) print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

TopicsExtraction with NMF & LDA的更多相关文章

  1. 机器学习SVD笔记

    机器学习中SVD总结 矩阵分解的方法 特征值分解. PCA(Principal Component Analysis)分解,作用:降维.压缩. SVD(Singular Value Decomposi ...

  2. SK-Learn使用NMF(非负矩阵分解)和LDA(隐含狄利克雷分布)进行话题抽取

    英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html 这 ...

  3. 文本主题模型之LDA(一) LDA基础

    文本主题模型之LDA(一) LDA基础 文本主题模型之LDA(二) LDA求解之Gibbs采样算法 文本主题模型之LDA(三) LDA求解之变分推断EM算法(TODO) 在前面我们讲到了基于矩阵分解的 ...

  4. 文本主题模型之非负矩阵分解(NMF)

    在文本主题模型之潜在语义索引(LSI)中,我们讲到LSI主题模型使用了奇异值分解,面临着高维度计算量太大的问题.这里我们就介绍另一种基于矩阵分解的主题模型:非负矩阵分解(NMF),它同样使用了矩阵分解 ...

  5. KNN PCA LDA

    http://blog.csdn.net/scyscyao/article/details/5987581 这学期选了门模式识别的课.发现最常见的一种情况就是,书上写的老师ppt上写的都看不懂,然后绕 ...

  6. LDA主题模型评估方法–Perplexity

    在LDA主题模型之后,需要对模型的好坏进行评估,以此依据,判断改进的参数或者算法的建模能力. Blei先生在论文<Latent Dirichlet Allocation>实验中用的是Per ...

  7. 用scikit-learn进行LDA降维

    在线性判别分析LDA原理总结中,我们对LDA降维的原理做了总结,这里我们就对scikit-learn中LDA的降维使用做一个总结. 1. 对scikit-learn中LDA类概述 在scikit-le ...

  8. 线性判别分析LDA原理总结

    在主成分分析(PCA)原理总结中,我们对降维算法PCA做了总结.这里我们就对另外一种经典的降维方法线性判别分析(Linear Discriminant Analysis, 以下简称LDA)做一个总结. ...

  9. word2vec参数调整 及lda调参

     一.word2vec调参   ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -neg ...

随机推荐

  1. js,jquery备忘录

    1.var s = str.charCodeAt();转ASCII码 2.String.fromCharCode(65);转字母 3.es6 ... (扩展运算符),将一个数组转化成由逗号分割的队列. ...

  2. 25. Reverse Nodes in k-Group (JAVA)

    Given a linked list, reverse the nodes of a linked list k at a time and return its modified list. k  ...

  3. mac相关功能

    打开和关闭索引功能 打开:sudo mdutil -a -i on 关闭:sudo mdutil -a -i off 关闭后则无法搜

  4. JS继承(一)

    突然发现自己很久没写过什么东西了 其实从博客更新的速度上就可以看出一个人近期有没有成长 对 …… 我没有成长 也可以由此看出自己选择的企业是不是对的 对 …… 我不会离职…… 略略略 来咬我啊…… 于 ...

  5. netty服务器端启动

    package com.imooc.netty.ch3; import com.imooc.netty.ch6.AuthHandler; import io.netty.bootstrap.Serve ...

  6. Openssl asn1parse命令

    一.简介 asn1parse命令是一种用来诊断ASN.1结构的工具,也能用于从ASN1.1数据中提取数据 二.语法 openssl asn1parse [-inform PEM|DER] [-in f ...

  7. Samtools在Linux上非root权限的安装

    第一次在Linux上不用root权限安装软件,查看了很多博客,并实践安装成功.大致总结了一下samtools的安装过程,仅供大家参考,如有不对的地方,欢迎指正~ samtools安装过程中依赖于lzm ...

  8. mysql备份最近8天的数据库,老的自动删除方案

    服务器上的处理脚本记录: [root@mysql01 test]# crontab -l0 2 * * * /bin/sh /script/sqlbackup.sh >/dev/null 2&g ...

  9. 服务管理之openssh

    1. 使用 SSH 访问远程命令行 1.1 OpenSSH 简介 OpenSSH这一术语指系统中使用的Secure Shell软件的软件实施.用于在远程系统上安全运行shell.如果您在可提供ssh服 ...

  10. 关于签名sign的坑

    在有些支付文档或其他文档都会要求签名验证 /** * description:签名 * function name:sign * @param $data * @return string */fun ...