Gensim LDA主题模型实验

本文利用gensim进行LDA主题模型实验，第一部分是基于前文的wiki语料，第二部分是基于Sogou新闻语料。

1. 基于wiki语料的LDA实验

上一文得到了wiki纯文本已分词语料 wiki.zh.seg.utf.txt，去停止词后可进行LDA实验。

import codecs

from gensim.models import LdaModel

from gensim.corpora import Dictionary

train = []

stopwords = codecs.open('stopwords.txt','r',encoding='utf8').readlines()
stopwords = [ w.strip() for w in stopwords ]

fp = codecs.open('wiki.zh.seg.utf.txt','r',encoding='utf8')

for line in fp:

    line = line.split()

    train.append([ w for w in line if w not in stopwords ])

dictionary = corpora.Dictionary(train)

corpus = [ dictionary.doc2bow(text) for text in train ]

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=100)

同时gensim也提供了对wiki压缩包直接进行抽取并保存为稀疏矩阵的脚本 make_wiki，可在bash运行下面命令查看用法。

python -m gensim.scripts.make_wiki

#USAGE: make_wiki.py WIKI_XML_DUMP OUTPUT_PREFIX [VOCABULARY_SIZE]

python -m gensim.scripts.make_wiki zhwiki-latest-pages-articles.xml.bz2 zhwiki

运行时间比较久，具体情况可以看gensim官网，结果如下，mm后缀表示Matrix Market格式保存的稀疏矩阵：

-rw-r--r--  chenbingjin data 172M  7月   : zhwiki_bow.mm

-rw-r--r--  chenbingjin data 1.3M  7月   : zhwiki_bow.mm.index

-rw-r--r--  chenbingjin data 333M  7月   : zhwiki_tfidf.mm

-rw-r--r--  chenbingjin data 1.3M  7月   : zhwiki_tfidf.mm.index

-rw-r--r--  chenbingjin data 1.9M  7月   : zhwiki_wordids.txt

利用 tfidf.mm 及wordids.txt 训练LDA模型

# -*- coding: utf-8 -*-

from gensim import corpora, models

# 语料导入

id2word = corpora.Dictionary.load_from_text('zhwiki_wordids.txt')

mm = corpora.MmCorpus('zhwiki_tfidf.mm')

# 模型训练，耗时28m

lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100)

模型结果

训练过程指定参数 num_topics=100, 即训练100个主题，通过print_topics() 和print_topic() 可查看各个主题下的词分布，也可通过save/load 进行模型保存加载。

# 打印前20个topic的词分布

lda.print_topics(20)

# 打印id为20的topic的词分布

lda.print_topic(20)

#模型的保存/ 加载

lda.save('zhwiki_lda.model')

lda = models.ldamodel.LdaModel.load('zhwiki_lda.model')

对新文档，转换成bag-of-word后，可进行主题预测。

模型差别主要在于主题数的设置，以及语料本身，wiki语料是全领域语料，主题分布并不明显，而且这里使用的语料没有去停止词，得到的结果差强人意。

test_doc = list(jieba.cut(test_doc))　　  #新文档进行分词

doc_bow = id2word.doc2bow(test_doc)      #文档转换成bow

doc_lda = lda[doc_bow]                   #得到新文档的主题分布

#输出新文档的主题分布

print doc_lda

for topic in doc_lda:

    print "%s\t%f\n"%(lda.print_topic(topic[0]), topic[1])

2. 基于Sogou新闻语料的LDA实验

Sogou实验室提供了很多中文语料的下载，全网新闻数据(SogouCA)，来自若干新闻站点2012年6月—7月期间国内，国际，体育，社会，娱乐等18个频道的新闻数据，提供URL和正文信息。

这里使用的是2008精简版(一个月数据, 437MB)

数据转码处理，由于数据是Ascii文件，容易出现乱码情况，使用iconv命令转成utf8，由于XML文件处理时需要有顶级tag，这里使用sed 命令在文件的首行前插入<root>，在尾行后插入</root>

#!/bin/bash

#将文件夹下的Ascii文件转成utf8

#Usage: ./iconv_encode.sh indir outdir

#@chenbingjin --

function conv_encode() {

    all=`ls ${indir}`

    for ffile in ${all}

    do

        ifile="${indir}${ffile}"

        ofile="${outdir}${ffile}"

        echo "iconv $ifile to $ofile"

        iconv -c -f gb2312 -t utf8 "$ifile" > "$ofile"

        sed -i '1i <root>' "$ofile"

        sed -i '$a </root>' "$ofile"

    done

}

if [ $# -ne  ]; then

    echo "Usage: ./iconv_encode.sh indir outdir"

    exit

fi

indir=$

outdir=$

if [ ! -d $outdir ]; then

    echo "mkdir ${outdir}"

    mkdir $outdir

fi

time conv_encode

iconv_encode.sh

总共128个文件，存放在Sogou_data/ 文件夹下，使用iconv_encode.sh 进行处理，新文件保存在out文件夹，结果如下：

$ ./iconv_encode.sh Sogou_data/ out/

mkdir out/

iconv Sogou_data/news.allsites..txt to out/news.allsites..txt

iconv Sogou_data/news.allsites..txt to out/news.allsites..txt

iconv Sogou_data/news.allsites..txt to out/news.allsites..txt

iconv Sogou_data/news.allsites..txt to out/news.allsites..txt

......

real    0m27.255s

user    0m6.720s

sys    0m8.924s

接下来需要对xml格式的数据进行预处理，这里使用lxml.etree，lxm 是Python的一个html/xml解析并建立dom的库，比python自带的XML解析快。

防止出现异常 XMLSyntaxError: internal error: Huge input lookup ，设置XMLParser参数 huge_tree=True，详细见代码：

# -*- coding: utf-8 -*-

import os

import codecs

import logging

from lxml import etree

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

'''

Sogou新闻语料预处理

@chenbingjin 2016-07-01

'''

train = []

# huge_tree=True, 防止文件过大时出错 XMLSyntaxError: internal error: Huge input lookup

parser = etree.XMLParser(encoding='utf8',huge_tree=True)

def load_data(dirname):

    global train

    files = os.listdir(dirname)

    for fi in files:

        logging.info("deal with "+fi)

        text = codecs.open(dirname+fi, 'r', encoding='utf8').read()

        # xml自身问题，存在&符号容易报错, 用&amp;代替

        text = text.replace('&', '&amp;')

        # 解析xml，提取新闻标题及内容

        root = etree.fromstring(text, parser=parser)

        docs = root.findall('doc')

        for doc in docs:

            tmp = ""

            for chi in doc.getchildren():

                if chi.tag == "contenttitle" or chi.tag == "content":

                    if chi.text != None and chi.text != "":

                        tmp += chi.text

            if tmp != "":

                train.append(tmp)

preprocess.py

得到train训练语料后，分词并去停止词后，便可以进行LDA实验

from gensim.corpora import Dictionary

from gensim.models import LdaModel

stopwords = codecs.open('stopwords.txt','r',encoding='utf8').readlines()
stopwords = [ w.strip() for w in stopwords ]

train_set = []

for line in train:

    line = list(jieba.cut(line))
    train_set.append([ w for w in line if w not in stopwords ])

# 构建训练语料

dictionary = Dictionary(train_set)

corpus = [ dictionary.doc2bow(text) for text in train_set]

# lda模型训练

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)

lda.print_topics(20)

实验结果：训练时间久，可以使用 ldamulticore ，整体效果还不错，可以看出08年新闻主题主要是奥运，地震，经济等

得到的LDA模型可用于主题预测，给定新的文档预测文档主题分布，可用于分类。训练文档中每个词会分配一个主题，有paper就将这种主题信息做Topic Word Embedding，一定程度上解决一词多义问题。

参考

1. gensim：Experiments on the English Wikipedia

2. Sogou：全网新闻数据

Gensim LDA主题模型实验的更多相关文章

机器学习入门-贝叶斯构造LDA主题模型，构造word2vec 1.gensim.corpora.Dictionary(构造映射字典) 2.dictionary.doc2vec(做映射) 3.gensim.model.ldamodel.LdaModel(构建主题模型)4lda.print_topics(打印主题).
1.dictionary = gensim.corpora.Dictionary(clean_content) 对输入的列表做一个数字映射字典, 2. corpus = [dictionary,do ...
自然语言处理之LDA主题模型
1.LDA概述在机器学习领域,LDA是两个常用模型的简称:线性判别分析(Linear Discriminant Analysis)和隐含狄利克雷分布(Latent Dirichlet Alloca ...
LDA主题模型评估方法–Perplexity
在LDA主题模型之后,需要对模型的好坏进行评估,以此依据,判断改进的参数或者算法的建模能力. Blei先生在论文<Latent Dirichlet Allocation>实验中用的是Per ...
[综] Latent Dirichlet Allocation(LDA)主题模型算法
多项分布 http://szjc.math168.com/book/ebookdetail.aspx?cateid=1&&sectionid=983 二项分布和多项分布 http:// ...
用scikit-learn学习LDA主题模型
在LDA模型原理篇我们总结了LDA主题模型的原理,这里我们就从应用的角度来使用scikit-learn来学习LDA主题模型.除了scikit-learn, 还有spark MLlib和gensim库 ...
Spark：聚类算法之LDA主题模型算法
http://blog.csdn.net/pipisorry/article/details/52912179 Spark上实现LDA原理 LDA主题模型算法 [主题模型TopicModel:隐含狄利 ...
理解 LDA 主题模型
前言 gamma函数 0 整体把握LDA 1 gamma函数 beta分布 1 beta分布 2 Beta-Binomial 共轭 3 共轭先验分布 4 从beta分布推广到Dirichlet 分布 ...
LDA主题模型三连击-入门/理论/代码
目录概况为什么需要 LDA是什么 LDA的应用 gensim应用数学原理预备知识抽取模型样本生成代码编写本文将从三个方面介绍LDA主题模型--整体概况.数学推导.动手实现. 关于LDA ...
机器学习-LDA主题模型笔记
LDA常见的应用方向: 信息提取和搜索(语义分析):文档分类/聚类.文章摘要.社区挖掘:基于内容的图像聚类.目标识别(以及其他计算机视觉应用):生物信息数据的应用; 对于朴素贝叶斯模型来说,可以胜任许 ...

随机推荐

Web前端开发规范手册
一.规范目的 1.1 概述为提高团队协作效率, 便于后台人员添加功能及前端后期优化维护, 输出高质量的文档, 特制订此文档. 本规范文档一经确认, 前端开发人员必须按本文档规范进行前台页面开发. ...
键盘对应的ASCII码
ESC键 VK_ESCAPE (27)回车键: VK_RETURN (13)TAB键: VK_TAB (9)Caps Lock键: VK_CAPITAL (20)Shift键: VK_SHIFT ($ ...
<java基础学习>JAVA 对象和类
Java is an Object-Oriented Language. As a language that has the Object Oriented feature, Java suppor ...
给vs2010换皮肤
http://www.cnblogs.com/aolinwxfx/articles/2379252.html O(∩_∩)O哈哈~,很不错哦
nginx 页面乱码问题
在配置nginx时常常遇到网页乱码的问题如图: 这时需要在server段里面添加两行: default_type 'text/html'; charset utf-8; 然后执行测试重启操作 ng ...
idea 中利用maven创建java web 项目
转自:http://www.linuxidc.com/Linux/2014-04/99687.htm 本文主要使用图解介绍了使用IntelliJ IDEA 12创建Maven管理的Java Web项目 ...
PowerDesigner从Physical Data Model转Excel
参考资料:http://www.cnblogs.com/hggc/archive/2013/10/15/3369857.html 由于有把ER图转Excel的需求,幸运地找到一个可用脚本,稍做修改完成 ...
json字符串返回到js中乱码
Ajax 的post请求值返回到js中时出现中文乱码的情况,但是在action中写入时并未乱码,解决办法在action中写入前,加上这两行: request.setCharacterEncoding( ...
Java在JFinal中出现Can not create instance of class: com.keesail.web.config.WebConfig异常处理方式
编译的时候一直出现如下问题: 后面查了许多资料说是build项目的时候web.xml没有输出到class目录.后面试了很多方式不行.后面自己摸索出如下方式解决问题: 改成默认输出目录.
SQL Server 在数据库中查找字符串（不知道表名的情况下查找字符串）
declare @key varchar(30)set @key = '广州' --替换为要查找的字符串DECLARE @tabName VARCHAR(40),@colName VARCHAR(40 ...

Gensim LDA主题模型实验

Gensim LDA主题模型实验的更多相关文章

随机推荐

热门专题