TopicsExtraction with NMF & LDA

"""

=======================================================================================

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

=======================================================================================

This is an example of applying Non-negative Matrix Factorization

and Latent Dirichlet Allocation on a corpus of documents and

extract additive models of the topic structure of the corpus.

The output is a list of topics, each represented as a list of terms

(weights are not shown).

The default parameters (n_samples / n_features / n_topics) should make

the example runnable in a couple of tens of seconds. You can try to

increase the dimensions of the problem, but be aware that the time

complexity is polynomial in NMF. In LDA, the time complexity is

proportional to (n_samples * iterations).

"""

# Author: Olivier Grisel <olivier.grisel@ensta.org>

#         Lars Buitinck <L.J.Buitinck@uva.nl>

#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>

# License: BSD 3 clause

from __future__ import print_function

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.decomposition import NMF, LatentDirichletAllocation

from sklearn.datasets import fetch_20newsgroups

n_samples = 2000

n_features = 1000

n_topics = 10

n_top_words = 20

def print_top_words(model, feature_names, n_top_words):

    for topic_idx, topic in enumerate(model.components_):

        print("Topic #%d:" % topic_idx)

        print(" ".join([feature_names[i]

                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

    print()

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics

# to filter out useless terms early on: the posts are stripped of headers,

# footers and quoted replies, and common English words, words occurring in

# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")

t0 = time()

dataset = fetch_20newsgroups(shuffle=True, random_state=1,

                             remove=('headers', 'footers', 'quotes'))

data_samples = dataset.data

print("done in %0.3fs." % (time() - t0))

# Use tf-idf features for NMF.

print("Extracting tf-idf features for NMF...")

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,

                                   stop_words='english')

t0 = time()

tfidf = tfidf_vectorizer.fit_transform(data_samples)

print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.

print("Extracting tf features for LDA...")

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,

                                stop_words='english')

t0 = time()

tf = tf_vectorizer.fit_transform(data_samples)

print("done in %0.3fs." % (time() - t0))

# Fit the NMF model

print("Fitting the NMF model with tf-idf features,"

      "n_samples=%d and n_features=%d..."

      % (n_samples, n_features))

t0 = time()

nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)

exit()

print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")

tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."

      % (n_samples, n_features))

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,

                                learning_method='online', learning_offset=50.,

                                random_state=0)

t0 = time()

lda.fit(tf)

print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")

tf_feature_names = tf_vectorizer.get_feature_names()

print_top_words(lda, tf_feature_names, n_top_words)

TopicsExtraction with NMF & LDA的更多相关文章

机器学习SVD笔记
机器学习中SVD总结矩阵分解的方法特征值分解. PCA(Principal Component Analysis)分解,作用:降维.压缩. SVD(Singular Value Decomposi ...
SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取
英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html 这 ...
文本主题模型之LDA(一) LDA基础
文本主题模型之LDA(一) LDA基础文本主题模型之LDA(二) LDA求解之Gibbs采样算法文本主题模型之LDA(三) LDA求解之变分推断EM算法(TODO) 在前面我们讲到了基于矩阵分解的 ...
文本主题模型之非负矩阵分解(NMF)
在文本主题模型之潜在语义索引(LSI)中,我们讲到LSI主题模型使用了奇异值分解,面临着高维度计算量太大的问题.这里我们就介绍另一种基于矩阵分解的主题模型:非负矩阵分解(NMF),它同样使用了矩阵分解 ...
KNN PCA LDA
http://blog.csdn.net/scyscyao/article/details/5987581 这学期选了门模式识别的课.发现最常见的一种情况就是,书上写的老师ppt上写的都看不懂,然后绕 ...
LDA主题模型评估方法–Perplexity
在LDA主题模型之后,需要对模型的好坏进行评估,以此依据,判断改进的参数或者算法的建模能力. Blei先生在论文<Latent Dirichlet Allocation>实验中用的是Per ...
用scikit-learn进行LDA降维
在线性判别分析LDA原理总结中,我们对LDA降维的原理做了总结,这里我们就对scikit-learn中LDA的降维使用做一个总结. 1. 对scikit-learn中LDA类概述在scikit-le ...
线性判别分析LDA原理总结
在主成分分析(PCA)原理总结中,我们对降维算法PCA做了总结.这里我们就对另外一种经典的降维方法线性判别分析(Linear Discriminant Analysis, 以下简称LDA)做一个总结. ...
word2vec参数调整及lda调参
一.word2vec调参 ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -neg ...

随机推荐

js，jquery备忘录
1.var s = str.charCodeAt();转ASCII码 2.String.fromCharCode(65);转字母 3.es6 ... (扩展运算符),将一个数组转化成由逗号分割的队列. ...
25. Reverse Nodes in k-Group (JAVA)
Given a linked list, reverse the nodes of a linked list k at a time and return its modified list. k ...
mac相关功能
打开和关闭索引功能打开:sudo mdutil -a -i on 关闭:sudo mdutil -a -i off 关闭后则无法搜
JS继承（一）
突然发现自己很久没写过什么东西了其实从博客更新的速度上就可以看出一个人近期有没有成长对 …… 我没有成长也可以由此看出自己选择的企业是不是对的对 …… 我不会离职…… 略略略来咬我啊…… 于 ...
netty服务器端启动
package com.imooc.netty.ch3; import com.imooc.netty.ch6.AuthHandler; import io.netty.bootstrap.Serve ...
Openssl asn1parse命令
一.简介 asn1parse命令是一种用来诊断ASN.1结构的工具,也能用于从ASN1.1数据中提取数据二.语法 openssl asn1parse [-inform PEM|DER] [-in f ...
Samtools在Linux上非root权限的安装
第一次在Linux上不用root权限安装软件,查看了很多博客,并实践安装成功.大致总结了一下samtools的安装过程,仅供大家参考,如有不对的地方,欢迎指正~ samtools安装过程中依赖于lzm ...
mysql备份最近8天的数据库，老的自动删除方案
服务器上的处理脚本记录: [root@mysql01 test]# crontab -l0 2 * * * /bin/sh /script/sqlbackup.sh >/dev/null 2&g ...
服务管理之openssh
1. 使用 SSH 访问远程命令行 1.1 OpenSSH 简介 OpenSSH这一术语指系统中使用的Secure Shell软件的软件实施.用于在远程系统上安全运行shell.如果您在可提供ssh服 ...
关于签名sign的坑
在有些支付文档或其他文档都会要求签名验证 /** * description:签名 * function name:sign * @param $data * @return string */fun ...

TopicsExtraction with NMF & LDA

TopicsExtraction with NMF & LDA的更多相关文章

随机推荐

热门专题