lda topic number
Hi Vikas -- the optimum number of topics (K in LDA) is dependent on a at least two factors:
Firstly, your data set may have an intrinsic number of topics, i.e., may derive
from some natural clusters that your data have. This number will in the best
case make your ppx minimal. A non-parametric approach like HDP would ideally
result in the same K as the one that minimises ppx for LDA. The second type of
influence is that of the hyperparameters. If you fix the Dirichlet parameters
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and
(phi | beta)), you bias the optimum K. For instance, larger alpha will force
more " "decisive" choices of z for each token, leading to a concentration of
theta to fewer weights, which influences K.
Trouble minimizing perplexity in LDA
|
I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong. Definition and details regarding calculation of perplexity of a topic model is explained in this post. Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters. |
|||||||||||||||||
|
|
You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results. In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit. |
lda topic number的更多相关文章
- 如何确定LDA的主题个数
本文参考自:https://www.zhihu.com/question/32286630 LDA中topic个数的确定是一个困难的问题. 当各个topic之间的相似度的最小的时候,就可以算是找到了合 ...
- (转) Parameter estimation for text analysis 暨LDA学习小结
Reading Note : Parameter estimation for text analysis 暨LDA学习小结 原文:http://www.xperseverance.net/blogs ...
- Spark MLlib LDA 源代码解析
1.Spark MLlib LDA源代码解析 http://blog.csdn.net/sunbow0 Spark MLlib LDA 应该算是比較难理解的,当中涉及到大量的概率与统计的相关知识,并且 ...
- 【转】LDA数学八卦
转自LDA数学八卦 在 Machine Learning 中,LDA 是两个常用模型的简称: Linear Discriminant Analysis 和 Latent Dirichlet Alloc ...
- cvpr2015papers
@http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...
- 用python+selenium抓取微博24小时热门话题的前15个并保存到txt中
抓取微博24小时热门话题的前15个,抓取的内容请保存至txt文件中,需要抓取排行.话题和阅读数 #coding=utf-8 from selenium import webdriver import ...
- 【原创】Kakfa api包源代码分析
既然包名是api,说明里面肯定都是一些常用的Kafka API了. 一.ApiUtils.scala 顾名思义,就是一些常见的api辅助类,定义的方法包括: 1. readShortString: 从 ...
- Machine and Deep Learning with Python
Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstiti ...
- Citect:How do I translate Citect error messages?
http://www.opcsupport.com/link/portal/4164/4590/ArticleFolder/51/Citect To decode the error messag ...
随机推荐
- CentOS6.5下搭建ftp服务器(三种认证模式:匿名用户、本地用户、虚拟用户)
CentOS 6.5下搭建ftp服务器 vsftpd(very secure ftp daemon,非常安全的FTP守护进程)是一款运行在Linux操作系统上的FTP服务程序,不仅完全开源而且免费,此 ...
- Eclipse调试DEBUG时快速查看某个变量的值的快捷键、快速跳转到某行的快捷键
Eclipse调试DEBUG时快速查看某个变量的值的快捷键 Ctrl + Shift + i
- [原][osg][oe]分析一块倾斜摄影瓦片的数据
RangeMode PIXEL_SIZE_ON_SCREEN 首先我们看看原始数据的构成: 第12层:(第一层) 第23层:(最后一层) pagelod下面有N多的pagelod一层包裹一层 通过os ...
- mysql 清空表——truncate 与delete的区别
清空表 truncate table [表名]: delete from [表名]: 注: truncate是整体删除(速度较快), delete是逐条删除(速度较慢). truncate不写服务器l ...
- 如何备份MySQL数据库
在MySQL中进行数据备份的方法有两种: 1. mysqlhotcopy 这个命令会在拷贝文件之前会把表锁住,并把数据同步到数据文件中,以避免拷贝到不完整的数据文件,是最安全快捷的备份方法. 命令的使 ...
- leecode第五题(最长回文子串)
class Solution { public: string longestPalindrome(string s) { int len = s.length(); || len == ) retu ...
- spring cloud: zuul(四): 正则表达式匹配其他微服务(给其他微服务加版本号)
spring cloud: zuul(四): 正则表达式匹配其他微服务(给其他微服务加版本号) 比如我原来有,spring-boot-user微服务,后台进行迭代更新,另外其了一个微服务: sprin ...
- 2017-2018-2 20165303 实验三《Java面向对象程序设计》实验报告
实验三 敏捷开发与XP实践-1 实验要求 实验三 敏捷开发与XP实践 http://www.cnblogs.com/rocedu/p/4795776.html, Eclipse的内容替换成IDEA 参 ...
- 20180428 xlVBA自动设置成绩条行高
'自动设置行高 传入工作表Sht 和 每页打印的行数RowsInOnePage Public Sub AutoSetRowHeight(ByVal Sht As Worksheet, Optional ...
- codeforces708b// Recover the String //AIM Tech Round 3 (Div. 1)
题意:有一个01组成的串,告知所有长度为2的子序列中,即00,01,10,11,的个数a,b,c,d.输出一种可能的串. 先求串中0,1的数目x,y. 首先,如果00的个数a不是0的话,设串中有x个0 ...