Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors:
Firstly, your data set may have an intrinsic number of topics, i.e., may derive
from some natural clusters that your data have. This number will in the best
case make your ppx minimal. A non-parametric approach like HDP would ideally
result in the same K as the one that minimises ppx for LDA. The second type of
influence is that of the hyperparameters. If you fix the Dirichlet parameters
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and
(phi | beta)), you bias the optimum K. For instance, larger alpha will force
more " "decisive" choices of z for each token, leading to a concentration of
theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

asked Sep 14 '12 at 5:22
 
1  
Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54
1  
@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27
1  
Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52
 
How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22
 
@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

 

lda topic number的更多相关文章

  1. 如何确定LDA的主题个数

    本文参考自:https://www.zhihu.com/question/32286630 LDA中topic个数的确定是一个困难的问题. 当各个topic之间的相似度的最小的时候,就可以算是找到了合 ...

  2. (转) Parameter estimation for text analysis 暨LDA学习小结

    Reading Note : Parameter estimation for text analysis 暨LDA学习小结 原文:http://www.xperseverance.net/blogs ...

  3. Spark MLlib LDA 源代码解析

    1.Spark MLlib LDA源代码解析 http://blog.csdn.net/sunbow0 Spark MLlib LDA 应该算是比較难理解的,当中涉及到大量的概率与统计的相关知识,并且 ...

  4. 【转】LDA数学八卦

    转自LDA数学八卦 在 Machine Learning 中,LDA 是两个常用模型的简称: Linear Discriminant Analysis 和 Latent Dirichlet Alloc ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. 用python+selenium抓取微博24小时热门话题的前15个并保存到txt中

    抓取微博24小时热门话题的前15个,抓取的内容请保存至txt文件中,需要抓取排行.话题和阅读数 #coding=utf-8 from selenium import webdriver import ...

  7. 【原创】Kakfa api包源代码分析

    既然包名是api,说明里面肯定都是一些常用的Kafka API了. 一.ApiUtils.scala 顾名思义,就是一些常见的api辅助类,定义的方法包括: 1. readShortString: 从 ...

  8. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  9. Citect:How do I translate Citect error messages?

    http://www.opcsupport.com/link/portal/4164/4590/ArticleFolder/51/Citect   To decode the error messag ...

随机推荐

  1. python的json模块的dumps,loads,dump,load方法介绍

    dumps和loads方法都在内存中转换, dump和load的方法会多一个步骤,dump是把序列化后的字符串写到一个文件中,而load是从一个文件中读取字符串 将列表转为字符串 >>&g ...

  2. java 之2D过气游戏类的写法

    2D游戏中各对象的父类 package cn.littlepage.game; import java.awt.Graphics; import java.awt.Image; import java ...

  3. matplotlib.transforms

    来自:龙哥盟飞龙 变换教程 像任何图形包一样,matplotlib建立在变换框架之上,以便在坐标系,用户数据坐标系,轴域者坐标系,图形坐标系和显示坐标系之间轻易变换.在95%的绘图中,你不需要考虑这一 ...

  4. Linux下的压缩解压缩命令详解及实例

    实例:压缩服务器上当前目录的内容为xxx.zip文件 zip -r xxx.zip ./* 解压zip文件到当前目录 unzip filename.zip ====================== ...

  5. DRF中的APIView、GenericAPIView、ViewSet

    1.APIView(rest_framework.views import APIView),是REST framework提供的所有视图的基类,继承自Django的View. 传入到视图方法中的是R ...

  6. redux与redux-react使用示例

    redux使用 <script type="text/babel"> var Counter=React.createClass({ incrementIfOdd:fu ...

  7. 原生JS操作iframe里的dom

    转:http://www.css88.com/archives/2343 一.父级窗口操作iframe里的dom JS操作iframe里的dom可是使用contentWindow属性,contentW ...

  8. TP3.2.3框架与已有模板做结合

    具体实现步骤: a.  复制模板文件到View指定目录 b. 复制到css.img.js静态资源文件到系统指定目录 c. 把静态资源(css.img.js)文件的路径设置为"常量" ...

  9. Program Option Modifiers

    Some option are 'boolean' and control behavior that can be turned on or off. --column-names option d ...

  10. 雷林鹏分享:C# 程序结构

    C# 程序结构 在我们学习 C# 编程语言的基础构件块之前,让我们先看一下 C# 的最小的程序结构,以便作为接下来章节的参考. C# Hello World 实例 一个 C# 程序主要包括以下部分: ...