Another attempt about LSI
Last week I was here Natural Language Processing in NZ.
Someone asked a question, is there any existed library or solution which can extract certain information out of a natural language dataset of a specific topic, for example, to find objective facts and subjective sentimental information out of a bunch of customer complaints.
I was doing something similar in this realm, and I think LSI or Word2Vec is the best solution of this problem now. But my poor english speaking could not match either my confidence or LSI's power, my explanation of LSI was so terrible that someone gave me a few hints very politely and cryptically. And I could tell some of you were far from convinced.
I felt sorry for LSI, it was not its fault that I didn't present it in a proper way. Thanks to Alyona and everyone in the meetup, I appreciate the opportunity to meet nice people and interesting ideas there. So, here is my 2 cent.
We have been trying to build different models of natural language. The problem of querying natural language dataset is the problem of how to build a model which could interpret a query into results. The question raised at last meetup was how to build such a model which can extract information from natural language dataset based on specific syntactic structure and some specific semantics.
We have plenty natural language parsers now, so I guess syntactic structure is not the problem here.
As for semantics, there are some rule-based approach, such as, WordNet , FrameNet, however, I found it was difficult to map a word to similar words in semantics because a word usaually has multiple meanings, and there is no way to find the right semantics of a word in a specific context with these approach. And you could end up with different results with different meaning of a word. Furthermore, there is not any threshold which can determine how far you can go in a WordNet graph, at least not in a mathematically decidable way.
LSI is a tool of finding similarity among documents, it stems from SVD. After mapping all documents into a space, we could find similarities between documents in different dimension. And of course we can get a distance between any two words in a specific dimension.
Word2Vec is based on the same idea but the algorithm is different.
With these tools, by training a model with a dataset of some topics, we can determine the distance of any two words in a specific dataset(or topic) with this model. Then we can map a query into a set of sentences with this similarity or distance in respect to the specific topic.
LSI or similar algorithm is actually another piece of the puzzle.
'''Note'''
We still have to deal with the problem of how to model multi meanings of a word
Another attempt about LSI的更多相关文章
- yarn关于app max attempt深度解析,针对长服务appmaster平滑重启
在YARN上开发长服务,需要注意fault-tolerance,本篇文章对appmaster的平滑重启的一个参数做了解析,如何设置可以有助于达到appmaster平滑重启. 在yarn-site.xm ...
- ORA-14450: attempt to access a transactional temp table already in use
在ORACLE数据中修改会话级临时表时,有可能会遇到ORA-14550错误,那么为什么会话级全局临时表会报ORA-14450错误呢,如下所示,我们先从一个小小案例入手: 案例1: SQL> CR ...
- Attempt to fetch logical page (...) in database 2 failed. It belongs to allocation unit xxxx not to xxx
今天一个同事说在一个生产库执行某个存储过程,遇到了错误: Fatal error 605 occurred at jul 29 2014 我试着执行该存储过程,结果出现下面错误,每次执行该存储过程,得 ...
- 错误信息:attempt to create saveOrUpdate event with null entity
错误信息:attempt to create saveOrUpdate event with null entity; 这个错误网上答案比较多,我也不多说了. 我遇到的问题是在前台传过来的参数是nul ...
- lua协程一则报错解决“attempt to yield across metamethod/C-call boundary”
问题 attempt to yield across metamethod/C-call boundary 需求跟如下帖子中描述一致: http://bbs.chinaunix.net/forum.p ...
- Warning: Attempt to present on whose view is not in the window hierarchy!
当我想从一个VC跳转到另一个VC的时候,一般会用 - (void)presentViewController:(UIViewController *)viewControllerToPresent a ...
- tomcat提示警告: An attempt was made to authenticate the locked user"tomcat"
启动tomcat7之后,运行正常,但是运行一段时间就会提示以下警告: 十二月 04, 2013 5:10:15 下午 org.apache.catalina.realm.LockOutRealm au ...
- Attempt to present <vc> on <vc> which is already presenting <vc>/(null)
在给 tableViewCell 添加长按手势弹出一个 popViewController 的时候,遇到的这个变态问题: Warning: Attempt to present <UINavig ...
- An attempt was made to load a program with an incorrect format
用.net调用一个C++ 32位的DLL, 编译的时候选择x86, 在部署到一个64位的机器上的时候报错:"An attempt was made to load a program w ...
随机推荐
- java实现文件编码监测
java实现文件编码监测 最近在做一个文档的翻译项目,可文档的编码不知道,听头疼的.尝试了很多方法最后发现JCharDet这个工具可以轻松解决这个问题.于是作此笔记希望日后提醒自己以及帮助又需要的人. ...
- bzoj1864 [Zjoi2006]三色二叉树
Description Input 仅有一行,不超过500000个字符,表示一个二叉树序列. Output 输出文件也只有一行,包含两个数,依次表示最多和最少有多少个点能够被染成绿色. Sample ...
- Java连接各类数据库
几种常用数据库的连接,以及Dao层的实现. 1.加载JDBC驱动: 1 加载JDBC驱动,并将其注册到DriverManager中: 2 //MySQL数据库 3 Class.forName(&quo ...
- poj 3216 (最小路径覆盖)
题意:有n个地方,m个任务,每个任务给出地点,开始的时间和完成需要的时间,问最少派多少工人去可以完成所有的任务.给出任意两点直接到达需要的时间,-1代表不能到达. 思路:很明显的最小路径覆盖问题,刚开 ...
- C++中的初始化列表中可以对那些变量或对象进行初始化
构造函数与其函数体之间可以添加初始化列表,能对某些对象进行初始化.格式为 类名() : 变量1(参数1),变量2(参数2) { } 1. 父类的对象的构造必须在初始化列表中,如: 子类名(): ...
- 多点触控插件Hammer.js
插件描述:Hammer.js是一个开源的,轻量级的javascript库,它可以在不需要依赖其他东西的情况下识别触摸,鼠标事件. 使用方法: <script src=<span class ...
- ubuntu系统分区方案
一.各文件及文件夹的定义 /bin:bin是binary(二进制)的缩写.存放必要的命令 存放增加的用户程序. /bin分区,存放标准系统实用程序./boot:这里存放的是启动LINUX时使用的一些核 ...
- mysq优化
MySQL调优可以从几个方面来做:1. 架构层:做从库,实现读写分离: 2.系统层次:增加内存:给磁盘做raid0或者raid5以增加磁盘的读写速度:可以重新挂载磁盘,并加上noatime参数,这样可 ...
- ios app相互调用
被调用app配置 - (BOOL)application:(UIApplication *)application openURL:(NSURL *)url sourceApplication:(NS ...
- 关于OC-省市区习题
对于省市区的问题,关键在于搞清楚数组嵌套字典,字典里面装数组的多重嵌套关系,沉下心来,捋清楚思路, 实在看不懂就多打几遍,这道题理解了,熟练了对之后学习很有好处. 代码如下: NSString *pa ...