Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors:
Firstly, your data set may have an intrinsic number of topics, i.e., may derive
from some natural clusters that your data have. This number will in the best
case make your ppx minimal. A non-parametric approach like HDP would ideally
result in the same K as the one that minimises ppx for LDA. The second type of
influence is that of the hyperparameters. If you fix the Dirichlet parameters
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and
(phi | beta)), you bias the optimum K. For instance, larger alpha will force
more " "decisive" choices of z for each token, leading to a concentration of
theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

asked Sep 14 '12 at 5:22
 
1  
Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54
1  
@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27
1  
Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52
 
How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22
 
@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

 

lda topic number的更多相关文章

  1. 如何确定LDA的主题个数

    本文参考自:https://www.zhihu.com/question/32286630 LDA中topic个数的确定是一个困难的问题. 当各个topic之间的相似度的最小的时候,就可以算是找到了合 ...

  2. (转) Parameter estimation for text analysis 暨LDA学习小结

    Reading Note : Parameter estimation for text analysis 暨LDA学习小结 原文:http://www.xperseverance.net/blogs ...

  3. Spark MLlib LDA 源代码解析

    1.Spark MLlib LDA源代码解析 http://blog.csdn.net/sunbow0 Spark MLlib LDA 应该算是比較难理解的,当中涉及到大量的概率与统计的相关知识,并且 ...

  4. 【转】LDA数学八卦

    转自LDA数学八卦 在 Machine Learning 中,LDA 是两个常用模型的简称: Linear Discriminant Analysis 和 Latent Dirichlet Alloc ...

  5. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  6. 用python+selenium抓取微博24小时热门话题的前15个并保存到txt中

    抓取微博24小时热门话题的前15个,抓取的内容请保存至txt文件中,需要抓取排行.话题和阅读数 #coding=utf-8 from selenium import webdriver import ...

  7. 【原创】Kakfa api包源代码分析

    既然包名是api,说明里面肯定都是一些常用的Kafka API了. 一.ApiUtils.scala 顾名思义,就是一些常见的api辅助类,定义的方法包括: 1. readShortString: 从 ...

  8. Machine and Deep Learning with Python

    Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...

  9. Citect:How do I translate Citect error messages?

    http://www.opcsupport.com/link/portal/4164/4590/ArticleFolder/51/Citect   To decode the error messag ...

随机推荐

  1. Javascript中点击(click)事件的3种写法

    方法一 <!DOCTYPE html> <html> <head> <title>Javascript中点击事件方法一</title> &l ...

  2. [osg][osgEarth][osgGA][原] EarthManipulator------基于oe的相机漫游器(浅析)

    知识基础:osg漫游器基础 class OSGEARTHUTIL_EXPORT EarthManipulator : public osgGA::CameraManipulator EarthMani ...

  3. Discrete Log Algorithms :Baby-step giant-step 【二】

    import gmpy2 def discreteLog(g,p,a): #离散对数,求 g^x=a mod p中的x table={} sq=gmpy2.isqrt(p-1) m=gmpy2.add ...

  4. C# 图片人脸识别

    此程序基于 虹软人脸识别进行的开发 前提条件从虹软官网下载获取ArcFace引擎应用开发包,及其对应的激活码(App_id, SDK_key)将获取到的开发包导入到您的应用中 App_id与SDK_k ...

  5. 基于iOS用CoreImage实现人脸识别

    2018-09-04更新: 很久没有更新文章了,工作之余花时间看了之前写的这篇文章并运行了之前写的配套Demo,通过打印人脸特征CIFaceFeature的属性,发现识别的效果并不是很好,具体说明见文 ...

  6. 2018 AICCSA Programming Contest

    2018 AICCSA Programming Contest A Tree Game B Rectangles 思路:如果存在大于0的交面积的话, 那么肯定能找到一条水平的直线 和 一条垂直的直线, ...

  7. Django - Python3 配置 MySQL

    在使用 PyMySQL 之前,我们需要确保 PyMySQL 已安装 具体安装使用方法,可参考 Python3 - MySQL适配器 PyMySQL Django 如何链接 MySQL 数据库, 需要在 ...

  8. (转)c# String与StringBuilder

    阅读目录 1.什么时候用String?什么时候用StringBuilder? 2.String与StringBuilder的区别 总结   1.什么时候用String?什么时候用StringBuild ...

  9. (转)winform之RichTextBox

    RichTextBox是一种可用于显示.输入和操作格式文本,除了可以实现TextBox的所有功能,还能提供富文本的显示功能. 控件除具有TextBox 控件的所有功能外,还能设定文字颜色.字体和段落格 ...

  10. SSH免密钥登陆

    local ipaddress:10.47.39.7:remote ipaddress:10.47.39.8 1.生成公钥和私钥 [root@local ~]# ssh-keygen -t rsa  ...