Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors:
Firstly, your data set may have an intrinsic number of topics, i.e., may derive
from some natural clusters that your data have. This number will in the best
case make your ppx minimal. A non-parametric approach like HDP would ideally
result in the same K as the one that minimises ppx for LDA. The second type of
influence is that of the hyperparameters. If you fix the Dirichlet parameters
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and
(phi | beta)), you bias the optimum K. For instance, larger alpha will force
more " "decisive" choices of z for each token, leading to a concentration of
theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

asked Sep 14 '12 at 5:22
 
1  
Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54
1  
@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27
1  
Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52
 
How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22
 
@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

 

lda topic number的更多相关文章

  1. 如何确定LDA的主题个数

    本文参考自:https://www.zhihu.com/question/32286630 LDA中topic个数的确定是一个困难的问题. 当各个topic之间的相似度的最小的时候,就可以算是找到了合 ...

  2. (转) Parameter estimation for text analysis 暨LDA学习小结

    Reading Note : Parameter estimation for text analysis 暨LDA学习小结 原文:http://www.xperseverance.net/blogs ...

  3. Spark MLlib LDA 源代码解析

    1.Spark MLlib LDA源代码解析 http://blog.csdn.net/sunbow0 Spark MLlib LDA 应该算是比較难理解的,当中涉及到大量的概率与统计的相关知识,并且 ...

  4. 【转】LDA数学八卦

    转自LDA数学八卦 在 Machine Learning 中,LDA 是两个常用模型的简称: Linear Discriminant Analysis 和 Latent Dirichlet Alloc ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. 用python+selenium抓取微博24小时热门话题的前15个并保存到txt中

    抓取微博24小时热门话题的前15个,抓取的内容请保存至txt文件中,需要抓取排行.话题和阅读数 #coding=utf-8 from selenium import webdriver import ...

  7. 【原创】Kakfa api包源代码分析

    既然包名是api,说明里面肯定都是一些常用的Kafka API了. 一.ApiUtils.scala 顾名思义,就是一些常见的api辅助类,定义的方法包括: 1. readShortString: 从 ...

  8. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  9. Citect:How do I translate Citect error messages?

    http://www.opcsupport.com/link/portal/4164/4590/ArticleFolder/51/Citect   To decode the error messag ...

随机推荐

  1. springmvc处理过程理解(一)

    DispatcherServlet前端控制器:接收request,进行response HandlerMapping处理器映射器:根据url查找Handler.(可以通过xml配置方式,注解方式) H ...

  2. np.zeros

    np.zeros构造一个全部由0组成的矩阵 用法:zeros(shape, dtype = float, order = 'C') 参数: shape:形状 dtype类型: t ,位域,如t4代表4 ...

  3. 将矩阵数据转换为栅格图 filled.contour()

    require(grDevices) # for colours filled.contour(volcano, color = terrain.colors, asp = 1) # simple x ...

  4. css3动画实现伪弹幕效果

    如图所示: 效果还可以直接用麦唱APP把一首歌分享到微信里面看到,方法类似全民K歌的方法,都是用css3动画实现的, 代码如下:(这是我做真实效果前的一个dome) 直接粘代码就可以看到效果,里面有两 ...

  5. XLua访问C#中的List或者数组

    直接访问即可 以下截图是C#中的List与数组: 现在通过XLua修复一下 RequestRoomListRes 方法(这里主要关注list和数组在XLua中的访问方式,对数组与List的遍历用了两种 ...

  6. Android 如何将手机屏幕投影到 PC 屏幕上或者投影仪上做演示?

    Android 如何将手机屏幕投影到 PC 屏幕上或者投影仪上做演示? 公司开发款APP,要给领导演示,总不能用手机面对面演示吧.所以找了好久,找到一款体验超好的: Total Control-帮助你 ...

  7. ggplot2画图

    早在N年前就听说这个包画图不错,一直没机会用,终于等到了.相比前面trendline这个包的可视化功能强大得多. ggplot2需要使用dataframe,其实就是一个N维数组, install.pa ...

  8. induced pluripotent stem cell (iPSC) 诱导性多能干细胞

    参考: 诱导性多能干细胞 Induced pluripotent stem cell Induced Pluripotent Stem Cells: Problems and Advantages w ...

  9. c# HTML中提取图片地址

    public class HtmlHelper    {        /// <summary>        /// HTML中提取图片地址        /// </summa ...

  10. You Don't Know JS: this & Object Prototypes (第6章 Behavior Delegation)附加的ES6 class未读

    本章深挖原型机制. [[Prototype]]比类更直接和简单! https://github.com/getify/You-Dont-Know-JS/blob/master/this%20%26%2 ...