信息检索及DM必备知识总结：luncene

原文链接：http://blog.csdn.net/htw2012/article/details/17734529

有少量修改！如有疑问，请访问原作者.

一：信息检索领域：

信息检索和网络数据领域（WWW, SIGIR, CIKM, WSDM, ACL, EMNLP等）的论文中常用的模型和技术总结(为什么概率是可靠的，概率隐藏了大部分事实，而给予我们可以看得见的部分.)

引子：对于这个领域的博士生来说，看懂论文是入行了解大家在做什么的研究基础，通常我们会去看一本书。看一本书固然是好，但是有一个很大的缺点：一本书本身自成体系，所以包含太多东西，很多内容看了，但是实际上却用不到。这虽然不能说是一种浪费，但是却没有把有限力气花在刀口上。

我所处的领域是关于网络数据的处理（国际会议WWW, SIGIR, CIKM, WSDM, ACL, EMNLP,等）

我列了一个我自己认为的在我们这个领域常常遇到的模型或者技术的列表，希望对大家节省时间有所帮助：
1. 概率论初步
    主要常用到如下概念：初等概率定义的三个条件，全概率公式，贝叶斯公式，链式法则，常用概率分布（Dirichlet 分布，高斯分布，多项式分布，玻松分布m）
虽然概率论的内容很多，但是在实际中用到的其实主要就是上述的几个概念。基于测度论的高等概率论，几大会议（www，sigir等等）中出现的论文中基本都不会出现。
2. 信息论基础
    主要常用的概念：熵，条件熵，KL散度，以及这三者之间的关系，最大熵原理，信息增益(information gain)
3. 分类
    朴素贝叶斯，KNN，支持向量机，最大熵模型，决策树的基本原理，以及优缺点，知道常用的软件包
4. 聚类
    非层次聚类的K-means算法，层次聚类的类型及其区别，以及算距离的方法（如single，complete的区别a），知道常用的软件包
5. EM算法
    理解不完全数据的推断的困难，理解EM原理和推理过程
6. 蒙特卡洛算法（特别是Gibbs采样算法）

    知道蒙特卡洛算法的基本原理，特别了解Gibbs算法的采样过程；Markov 随机过程和Markov chain等
7. 图模型

     图模型最近几年非常的热，也非常重要，因为它能把之前的很多研究都包括在内，同时具有直观之意义。如CRF, HMM，topic model都是图模型的应用和特例。

    a. 了解图模型的一般表示（有向图和无向图模型x），通用的学习算法（learning）和推断算法（inference），如Sum-product算法，传播算法等

    b. 熟悉HMM模型，包括它的假设条件，以及前向和后向算法；

    c. 熟悉LDA模型，包括它的图模型表示i，以及它的Gibbs 推理算法；变分推断算法不要求掌握。

    d. 了解CRF模型，主要是了解它的图模型表示，如果有时间和兴趣a，可以了解推理算法；

    e. 理解HMM,LDA, CRF和图模型的一般表示，通用学习算法和推理算法之间的联系和差别；

    f. 了解Markov logic network（MLN），这是建构在图模型和一阶逻辑基础上的一种语言，可以用来描述很多现实问题，初步的了解，可以帮助理解图模型；
8. topic model

    这个模型的思想被广泛地应用，全看完没有必有也没有时间，推荐如下：

    a. 深入理解pLSA和LDA，同时理解pLSA和LDA之间的联系和区别；这两个模型理解后，大部分的topic model的论文都是可以理解的了，特别是应用到NLP上的topic

         model。同时，也可以自己设计自己需要的非层次topic model了。

    b. 如果想继续深入，继续理解hLDA模型，特别是理解背后的数学原理Dirichlet Process，这样你就可以自己设计层次topic model了;

    c. 对于有监督的topic model，一定要理解s-LDA和LLDA两个模型，这两个模型体现了完全不同的设计思想，可以细细体会，然后自己设计自己需要的topic model；

    d. 对于这些模型的理解，Gibbs 采样算法是绕不开的坎；
9. 最优化和随机过程

a. 理解约束条件是等号的最优化问题及其lagrange乘子法求解；

b. 理解约束条件是不等号的凸优化问题，理解单纯形法；

c. 理解梯度下降法，模拟退火算法；

d. 理解爬山法等最优化求解的思想

e. 随机过程需要了解随机游走，排队论等基本随机过程（论文中偶尔会有，但不是太常见n），理解Markov 随机过程（非常重要，采样理论中常用l）；
10. 贝叶斯学习

   目前越来越多的方法或模型采用贝叶斯学派的思想来处理数据，因此了解相关的内容非常必要。

   a. 理解贝叶斯学派和统计学派的在思想和原理上的差别和联系；

   b. 理解损失函数，及其在贝叶斯学习中的作用；记住常用的损失函数；

   c. 理解贝叶斯先验的概念和四种常用的选取贝叶斯先验的方法；

   d. 理解参数和超参数的概念，以及区别；

   e. 通过LDA的先验选取（或者其它模型i）来理解贝叶斯数据处理的思想；
11. 信息检索模型和工具

a. 理解常用的检索模型；

    b. 了解常用的开源工具（lemur，lucene等ng）

12. 模型选择和特征选取

a. 理解常用的特征选择方法，从而选择有效特征来训练模型;

b. 看几个模型选择的例子，理解如何选择一个合适模型；（这玩意只能通过例子来意会了）

13. 论文写作中的tricks

技巧是很多的，这里略。

二：lucene 加速检索：

Here are some things to try to speed up the seaching speed of your Lucene application. Please seeImproveIndexingSpeed
for how to speed up indexing.

Be sure you really need to speed things up.
Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your searching speed is indeed too slow and the slowness
is indeed within Lucene.
Make sure you are using the latest version of Lucene.
Use a local filesystem.
Remote filesystems are typically quite a bit slower for searching. If the index must be remote, try to mount the remote filesystem as a "readonly" mount. In some cases this could
improve performance.
Get faster hardware, especially a faster IO system.Flash-based Solid State Drives
works very well for Lucene searches. As seek-times for SSD's are about 100 times faster than traditional platter-based harddrives, the usual penalty for seeking is virtually eliminated. This means that SSD-equipped machines need less RAM for file caching and
that searchers require less warm-up time before they respond quickly.
Tune the OS

One tunable that stands out on Linux is swappiness (http://kerneltrap.org/node/3000), which
controls how aggressively the OS will swap out RAM used by processes in favor of the IO Cache. Most Linux distros default this to a highish number (meaning, aggressive) but this can easily cause horrible search latency, especially if you are searching a large
index with a low query rate. Experiment by turning swappiness down or off entirely (by setting it to 0). Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs
or System Cache, that's likely doing something similar.
Open the
IndexReader with readOnly=true. This makes a big difference when multiple threads are sharing the same reader, as it removes certain sources of thread contention.
On non-Windows platform, using NIOFSDirectory instead of FSDirectory.

This also removes sources of contention when accessing the underlying files. Unfortunately, due to a longstanding bug on Windows in Sun's JRE (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734
-- feel particularly free to go vote for it), NIOFSDirectory gets poor performance on Windows.
Add RAM to your hardware and/or increase the heap size for the JVM.For a large index,
searching can use alot of RAM. If you don't have enough RAM or your JVM is not running with a large enough HEAP size then the JVM can hit swapping and thrashing at which point everything will run slowly.
Use one instance of
IndexSearcher.

Share a single
IndexSearcher across queries and across threads in your application.
When measuring performance, disregard the first query.

The first query to a searcher pays the price of initializing caches (especially when sorting by fields) and thus will skew your results (assuming you re-use the searcher for many queries).
On the other hand, if you re-run the same query again and again, results won't be realistic either, because the operating system will use its cache to speed up IO operations. On Linux (kernel 2.6.16 and later) you can clean the disk cache usingsync ; echo 3 > /proc/sys/vm/drop_caches.
Seehttp://linux-mm.org/Drop_Caches for details.
Re-open the
IndexSearcher only when necessary.

You must re-open the
IndexSearcher in order to make newly committed changes visible to searching. However, re-opening the searcher has a certain overhead (noticeable mostly with large indexes and with sorting turned on) and should thus be minimized. Consider using a so calledwarming
technique which allows the searcher to warm up its caches before the first query hits.
Decrease
mergeFactor. Smaller mergeFactors mean fewer segments and searching will be faster. However, this will slow down indexing speed, so you should test values to strike
an appropriate balance for your application.
Limit usage of stored fields and term vectors.Retrieving these from the index is
quite costly. Typically you should only retrieve these for the current "page" the user will see, not for all documents in the full result set. For each document retrieved, Lucene must seek to a different location in various files. Try sorting the documents
you need to retrieve by docID order first.
Use
FieldSelector to carefully pick which fields are loaded, and how they are loaded, when you retrieve a document.
Don't iterate over more hits than needed.

Iterating over all hits is slow for two reasons. Firstly, the search() method that returns a Hits object re-executes the search internally when you need more than 100 hits. Solution:
use the search method that takes a HitCollector instead. Secondly, the hits will probably be spread over the disk so accessing them all requires much I/O activity. This cannot easily be avoided unless the index is small enough to be loaded into RAM. If you
don't need the complete documents but only one (small) field you could also use the FieldCache class to cache that one field and have fast access to it.
When using fuzzy queries use a minimum prefix length.

Fuzzy queries perform CPU-intensive string comparisons - avoid comparing all unique terms with the user input by only examining terms starting with the first "N" characters. This prefix
length is a property on bothQueryParser and
FuzzyQuery - default is zero so ALL terms are compared.
Consider using
filters. It can be much more efficient to restrict results to a part of the index using a cached bit set filter rather than using a query clause. This is especially
true for restrictions that match a great number of documents of a large index. Filters are typically used to restrict the results to a category but could in many cases be used to replace any query clause. One difference between using a Query and a Filter is
that the Query has an impact on the score while a Filter does not.
Find the bottleneck.

Complex query analysis or heavy post-processing of results are examples of hidden bottlenecks for searches. Profiling with at tool such asVisualVM
helps locating the problem

信息检索及DM必备知识总结：luncene的更多相关文章

Java面试必备知识
JAVA面试必备知识第一,谈谈final, finally, finalize的区别. 第二,Anonymous Inner Class (匿名内部类) 是否可以extends(继承)其它类,是否可 ...
Apache Tomcat8必备知识
Apache Tomcat8必备知识作者:chszs,转载需注明.博客主页: http://blog.csdn.net/chszs 一.Apache Tomcat 8介绍 Apache Tomcat ...
<转载>Div+Css布局教程（-）CSS必备知识
目录: 1.Div+Css布局教程(-)CSS必备知识注:本教程要求对html和css有基础了解. 一.CSS布局属性 Width:设置对象的宽度(width:45px). Height:设置对象的 ...
微软实战训练营(X)重点班第(1)课：SOA必备知识之ASP.NET Web Service开发实战
微软实战训练营上海交大(A)实验班.(X)重点班内部课程资料链接:http://pan.baidu.com/s/1jGsTjq2 password:0wmf <微软实战训练营(X)重点班第 ...
移动web开发（一）——移动web开发必备知识
参考: 移动终端开发必备知识.http://isux.tencent.com/mobile-development-essential-knowledge.html
Div+Css布局教程（-）CSS必备知识
目录: 1.Div+Css布局教程(-)CSS必备知识注:本教程要求对html和css有基础了解. 一.CSS布局属性 Width:设置对象的宽度(width:45px). Height:设置对象的 ...
性能测试必备知识（2）- 查看 Linux 的 CPU 相关信息
做性能测试的必备知识系列,可以看下面链接的文章哦 https://www.cnblogs.com/poloyy/category/1806772.html 查看系统 CPU 信息 cat /proc/ ...
性能测试必备知识（4）- 使用 stress 和 sysstat
做性能测试的必备知识系列,可以看下面链接的文章哦 https://www.cnblogs.com/poloyy/category/1806772.html stress 介绍 Linux 系统压力测试 ...
性能测试必备知识（5）- 深入理解“CPU 上下文切换”
做性能测试的必备知识系列,可以看下面链接的文章哦 https://www.cnblogs.com/poloyy/category/1806772.html 前言上一篇文章中,举例了大量进程等待 CP ...

随机推荐

Lua的五种变量类型、局部变量、全局变量、lua运算符、流程控制if语句_学习笔记02
Lua的五种变量类型.局部变量.全局变量 .lua运算符 .流程控制if语句 Lua代码的注释方式: --当行注释 --[[ 多行注释 ]]-- Lua的5种变量类型: 1.null 表示 ...
[Ynoi2016]谁的梦
题目大意: 给定$n$个序列,要你从每个序列中选一个非空子串然后拼起来,拼成的序列的贡献为不同元素个数. 支持单点修改,在开始时和每次修改完后,输出所有不同选取方案的贡献和. 解题思路: 窝又来切Yn ...
【习题 4-8 UVA - 12108】Extraordinarily Tired Students
[链接] 我是链接,点我呀:) [题意] [题解] 一个单位时间.一个单位时间地模拟就好. 然后对于每个人. 记录它所处的周期下标idx 每个单位时间都会让每个人的idx++ 注意从醒着到睡着的分界线 ...
【codeforces 797D】Broken BST
[题目链接]:http://codeforces.com/contest/797/problem/D [题意] 给你一个二叉树; 然后问你,对于二叉树中每个节点的权值; 如果尝试用BST的方法去找; ...
（17）Spring Boot普通类调用bean【从零开始学Spring Boot】
我们知道如果我们要在一个类使用spring提供的bean对象,我们需要把这个类注入到spring容器中,交给spring容器进行管理,但是在实际当中,我们往往会碰到在一个普通的Java类中,想直接使用 ...
洛谷 P2341 BZOJ 1051 [HAOI2006]受欢迎的牛
题目描述每头奶牛都梦想成为牛棚里的明星.被所有奶牛喜欢的奶牛就是一头明星奶牛.所有奶牛都是自恋狂,每头奶牛总是喜欢自己的.奶牛之间的“喜欢”是可以传递的——如果A喜欢B,B喜欢C,那么A也喜欢C ...
asp.net--CRSF
asp.net使用了token来防止CRSF攻击前台: 使用@Html.AntiForgeryToken(); 浏览器里面被存了一个cookie值,这个值是asp.net存给浏览器的,是readon ...
Cocos Code IDE里xcodeprojectlua脚本更新
lua脚本改动后xcode须要clean又一次编译才干更新,这个是xcode里的老毛病了,网上有一些脚本但不是针对Cocos Code IDE的project文件夹的,这里列出 cocos2dx版本号 ...
kvc和kvo的使用情况的了解
了解cocoa:Cocoa是苹果公司为Mac OS X所创建的原生面向对象的API,是Mac OS X上五大API之中的一个(其他四个是Carbon.POSIX.X11和Java). 苹果的面向对象开 ...
Huffman编码实现压缩解压缩
这是我们的课程中布置的作业.找一些资料将作业完毕,顺便将其写到博客,以后看起来也方便. 原理介绍什么是Huffman压缩 Huffman( 哈夫曼 ) 算法在上世纪五十年代初提出来了,它是一种无损压 ...

信息检索及DM必备知识总结：luncene

信息检索及DM必备知识总结：luncene的更多相关文章

随机推荐

热门专题