TopicsExtraction with NMF & LDA

"""

=======================================================================================

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

=======================================================================================

This is an example of applying Non-negative Matrix Factorization

and Latent Dirichlet Allocation on a corpus of documents and

extract additive models of the topic structure of the corpus.

The output is a list of topics, each represented as a list of terms

(weights are not shown).

The default parameters (n_samples / n_features / n_topics) should make

the example runnable in a couple of tens of seconds. You can try to

increase the dimensions of the problem, but be aware that the time

complexity is polynomial in NMF. In LDA, the time complexity is

proportional to (n_samples * iterations).

"""

# Author: Olivier Grisel <olivier.grisel@ensta.org>

#         Lars Buitinck <L.J.Buitinck@uva.nl>

#         Chyi-Kwei Yau <chyikwei.yau@gmail.com>

# License: BSD 3 clause

from __future__ import print_function

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.decomposition import NMF, LatentDirichletAllocation

from sklearn.datasets import fetch_20newsgroups

n_samples = 2000

n_features = 1000

n_topics = 10

n_top_words = 20

def print_top_words(model, feature_names, n_top_words):

    for topic_idx, topic in enumerate(model.components_):

        print("Topic #%d:" % topic_idx)

        print(" ".join([feature_names[i]

                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

    print()

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics

# to filter out useless terms early on: the posts are stripped of headers,

# footers and quoted replies, and common English words, words occurring in

# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")

t0 = time()

dataset = fetch_20newsgroups(shuffle=True, random_state=1,

                             remove=('headers', 'footers', 'quotes'))

data_samples = dataset.data

print("done in %0.3fs." % (time() - t0))

# Use tf-idf features for NMF.

print("Extracting tf-idf features for NMF...")

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, #max_features=n_features,

                                   stop_words='english')

t0 = time()

tfidf = tfidf_vectorizer.fit_transform(data_samples)

print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.

print("Extracting tf features for LDA...")

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features,

                                stop_words='english')

t0 = time()

tf = tf_vectorizer.fit_transform(data_samples)

print("done in %0.3fs." % (time() - t0))

# Fit the NMF model

print("Fitting the NMF model with tf-idf features,"

      "n_samples=%d and n_features=%d..."

      % (n_samples, n_features))

t0 = time()

nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)

exit()

print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")

tfidf_feature_names = tfidf_vectorizer.get_feature_names()

print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, n_samples=%d and n_features=%d..."

      % (n_samples, n_features))

lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,

                                learning_method='online', learning_offset=50.,

                                random_state=0)

t0 = time()

lda.fit(tf)

print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")

tf_feature_names = tf_vectorizer.get_feature_names()

print_top_words(lda, tf_feature_names, n_top_words)

TopicsExtraction with NMF & LDA的更多相关文章

机器学习SVD笔记
机器学习中SVD总结矩阵分解的方法特征值分解. PCA(Principal Component Analysis)分解,作用:降维.压缩. SVD(Singular Value Decomposi ...
SK-Learn使用NMF（非负矩阵分解）和LDA（隐含狄利克雷分布）进行话题抽取
英文链接:http://scikit-learn.org/stable/auto_examples/applications/topics_extraction_with_nmf_lda.html 这 ...
文本主题模型之LDA(一) LDA基础
文本主题模型之LDA(一) LDA基础文本主题模型之LDA(二) LDA求解之Gibbs采样算法文本主题模型之LDA(三) LDA求解之变分推断EM算法(TODO) 在前面我们讲到了基于矩阵分解的 ...
文本主题模型之非负矩阵分解(NMF)
在文本主题模型之潜在语义索引(LSI)中,我们讲到LSI主题模型使用了奇异值分解,面临着高维度计算量太大的问题.这里我们就介绍另一种基于矩阵分解的主题模型:非负矩阵分解(NMF),它同样使用了矩阵分解 ...
KNN PCA LDA
http://blog.csdn.net/scyscyao/article/details/5987581 这学期选了门模式识别的课.发现最常见的一种情况就是,书上写的老师ppt上写的都看不懂,然后绕 ...
LDA主题模型评估方法–Perplexity
在LDA主题模型之后,需要对模型的好坏进行评估,以此依据,判断改进的参数或者算法的建模能力. Blei先生在论文<Latent Dirichlet Allocation>实验中用的是Per ...
用scikit-learn进行LDA降维
在线性判别分析LDA原理总结中,我们对LDA降维的原理做了总结,这里我们就对scikit-learn中LDA的降维使用做一个总结. 1. 对scikit-learn中LDA类概述在scikit-le ...
线性判别分析LDA原理总结
在主成分分析(PCA)原理总结中,我们对降维算法PCA做了总结.这里我们就对另外一种经典的降维方法线性判别分析(Linear Discriminant Analysis, 以下简称LDA)做一个总结. ...
word2vec参数调整及lda调参
一.word2vec调参 ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -neg ...

随机推荐

1_translation_1
It is always difficult to start describing a programming language because little details do not make ...
Redis安装教程及可视化工具RedisDesktopManager下载安装
Redis安装教程: 1. Windows下安装教程: 下载:https://github.com/MSOpenTech/redis/releases Redis 支持 32 位和 64 位.这个需要 ...
Flipping an Image
Given a binary matrix A, we want to flip the image horizontally, then invert it, and return the resu ...
Python·——进程1
1.进程背景知识顾名思义,进程即正在执行的一个过程.进程是对正在运行程序(的一个抽象). 进程的概念起源于操作系统,是操作系统最核心的概念,也是操作系统提供的最古老也是最重要的抽象概念之一.操作系统 ...
JMeter之自动重定向和跟随重定向的区别
自动重定向:只针对Get和Head请求,自动重定向转向到最终目标页面,但是Jmeter不记录重定向的中间页面过程,只记录最终页面返回结果.在结果树中,只能看到最终页面的服务器返回. 跟随重定向:是ht ...
编写程序,使用while循环将50到100的整数相加
#include<iostream> int main(int argc, char const *argv[]) { using std::cout; ,b=; ){ a++; b=+b ...
20175126《Java程序设计》第四周学习总结
# 20175126 2016-2017-2 <Java程序设计>第四周学习总结 ## 教材学习内容总结 - 本周学习方式主要为手动敲打教材代码和观看APP上的视频资源自学. - 学习内容 ...
IDEA配置
关于IDEA的配置配置注释模板 CTRL_SHIFT_S,在Live Templates中新增一个TemplateGroup,然后再新建两个模板,如下图: 新增cc-ClassComment /** ...
Flask-WTForms 简单使用
安装 wtforms 2.2.1 直接上代码: app.py 文件: from flask import Flask, render_template, request from wtforms im ...
获得32位UUID字符串和指定数目的UUID
在common包中创建类文件UUIDUtils.java package sinosoft.bjredcross.common; import java.util.UUID; public class ...

TopicsExtraction with NMF & LDA

TopicsExtraction with NMF & LDA的更多相关文章

随机推荐

热门专题