Lucene suggest [转]

The Big Data Zone is presented by Splunk, the maker of data analysis solutions such as Hunk, an analytics tool for Hadoop, and the Splunk Web Framework.

Live suggestions as you type into a search box, sometimes called suggest or autocomplete, is now a standard, essential search feature ever since Google set a high bar after going live just over four years ago.

In Lucene we have several different suggest implementations, under the suggest module; today I'm describing the new AnalyzingSuggester (to be committed soon; it should be available in 4.1).

To use it, you provide the set of suggest targets, which is the full set of strings and weights that may be suggested. The targets can come from anywhere; typically you'd process your query logs to create the targets, giving a higher weight to those queries that appear more frequently. If you sell movies you might use all movie titles with a weight according to sales popularity.

（提示词库，根据用户日志进行提取）

You also provide an analyzer, which is used to process each target into analyzed form. Under the hood, the analyzed form is indexed into an FST. At lookup time, the incoming query is processed by the same analyzer and the FST is searched for all completions sharing the analyzed form as a prefix.

Even though the matching is performed on the analyzed form, what's suggested is the original target (i.e., the unanalyzed input). Because Lucene has such a rich set of analyzer components, this can be used to create some useful suggesters:

With an analyzer that folds or normalizes case, accents, etc. (e.g., using ICUFoldingFilter), the suggestions will match irrespective of case and accents. For example, the query "ame..." would suggest Amélie.
With an analyzer that removes stopwords and normalizes case, the query "ghost..." would suggest "The Ghost of Christmas Past".
Even graph TokenStreams, such as SynonymFilter, will work: in such cases we enumerate and index all analyzed paths into the FST. If the analyzer recognizes "wifi" and "wireless network" as synonyms, and you have the suggest target "wifi router" then the user query "wire..." would suggest "wifi router".
Japanese suggesters may now be possible, with an analyzer that copies the reading (ReadingAttribute in the Kuromoji analyzer) as its output.

Given the diversity of analyzers, and the easy extensibility for applications to create their own analyzers, I'm sure there are many interesting use cases for this new AnalyzingSuggester: if you have an example please share with us on Lucene's user list (java-user@lucene.apache.org).

While this is a great step forward, there's still plenty to do with Lucene's suggesters. We need to allow for fuzzy matching on the query so we're more robust to typos (there's a rough prototype patch onLUCENE-3846). We need to predict based on only part of the query, instead of insisting on a full prefix match. There are a number of interesting elements to Google's autosuggest that we could draw inspiration from. As always, patches welcome!

FST

---http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html

That FST maps the sorted words mop, moth, pop, star, stop and top to their ordinal number (0, 1, 2, ...). As you traverse the arcs, you sum up the outputs, so stop hits 3 on the s and 1 on the o, so its output ordinal is 4. The outputs can be arbitrary numbers or byte sequences, or combinations, etc. -- it's pluggable.

Essentially, an FST is a SortedMap<ByteSequence,SomeOutput>, if the arcs are in sorted order. With the right representation, it requires far less RAM than other SortedMap implementations, but has a higher CPU cost during lookup. The low memory footprint is vital for Lucene since an index can easily have many millions (sometimes, billions!) of unique terms.

There's a great deal of theory behind FSTs. They generally support the same operations asFSMs (determinize, minimize, union, intersect, etc.). You can also compose them, where the outputs of one FST are intersected with the inputs of the next, resulting in a new FST.

There are some nice general-purpose FST toolkits (OpenFst looks great) that support all these operations, but for Lucene I decided to implement this neat algorithm which incrementally builds up the minimal unweighted FST from pre-sorted inputs. This is a perfect fit for Lucene since we already store all our terms in sorted (unicode) order.

The resulting implementation (currently a patch on LUCENE-2792) is fast and memory efficient: it builds the 9.8 million terms in a 10 million Wikipedia index in ~8 seconds (on a fast computer), requiring less than 256 MB heap. The resulting FST is 69 MB. It can also build a prefix trie, pruning by how many terms come through each node, with even less memory.

Note that because addition is commutative, an FST with numeric outputs is not guaranteed to be minimal in my implementation; perhaps if I could generalize the algorithm to a weighted FST instead, which also stores a weight on each arc, that would yield the minimal FST. But I don't expect this will be a problem in practice for Lucene.

In the patch I modified the SimpleText codec, which was loading all terms into a TreeMap mapping the BytesRef term to an int docFreq and long filePointer, to use an FST instead, and all tests pass!

There are lots of other potential places in Lucene where we could use FSTs, since we often need map the index terms to "something". For example, the terms index maps to a long file position; the field cache maps to ordinals; the terms dictionary maps to codec-specific metadata, etc. We also have multi-term queries (eg Prefix, Wildcard, Fuzzy, Regexp) that need to test a large number of terms, that could work directly via intersection with the FST instead (many apps could easily fit their entire terms dict in RAM as an FST since the format is so compact). The FST could be used for a key/value store. Lots of fun things to try!

参考资料：

lucene (42)源代码学习之FST(Finite State Transducer)在SynonymFilter中的实现思想

http://sbp810050504.blog.51cto.com/2799422/1361551

Finite State Transducer（FST）in NLP

http://www.douban.com/note/326761737/

Lucene suggest [转]的更多相关文章

lucene的suggest(搜索提示功能的实现）
1.首先引入依赖  <!-- ...
Solr Suggest组件的使用
使用suggest的原因,最主要就是相比于search速度快,In general, we need the autosuggest feature to satisfy two main requi ...
elasticsearch suggest 的几种使用-completion 的基本使用
在lucene里面,suggest 的支持非常完善,可以随心所欲的定制: 但是在es中使用起来就没有那么方便了. es给suggest 分类4类:term :phrase: completion: c ...
Lucene 4.x Spellcheck使用说明
Spellcheck是Lucene新版本的功能,在介绍spellcheck之前,我们需要弄清楚Spellcheck支持几种数据源.Spellcheck构造函数需要传入Dictionary接口: pac ...
谈谈个人网站的建立（二）—— lucene的使用
首先,帮忙点击一下我的网站http://www.wenzhihuai.com/ .谢谢啊,如果可以,GitHub上麻烦给个star,以后面试能讲讲这个项目,GitHub地址https://github ...
学习笔记（二）--Lucene简介
Lucene简介最受欢迎的java开源全文搜索引擎开发工具包.提供了完整的查询引擎和索引引擎,部分文本分词引擎(英文与德文两种西方语言).Lucene的目的是为软件开发人员提供一个简单易用的工具包, ...
在64位平台上的Lucene，应该使用MMapDirectory[转]
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html 从3.1版本开始,Lucene和Solr开始在64位的W ...
8 个基于 Lucene 的开源搜索引擎推荐
Lucene是一种功能强大且被广泛使用的搜索引擎,以下列出了8种基于Lucene的搜索引擎,你可以想象它们有多么强大. 1. Apache Solr Solr 是一个高性能,采用Java5开发,基于L ...
Lucene系列二：Lucene（Lucene介绍、Lucene架构、Lucene集成）
一.Lucene介绍 1. Lucene简介最受欢迎的java开源全文搜索引擎开发工具包.提供了完整的查询引擎和索引引擎,部分文本分词引擎(英文与德文两种西方语言).Lucene的目的是为软件开发人 ...

随机推荐

Gym - 101002D：Programming Team （01分数规划+树上依赖背包）
题意:给定一棵大小为N的点权树(si,pi),现在让你选敲好K个点,需要满足如果如果u被选了,那么fa[u]一定被选,现在要求他们的平均值(pi之和/si之和)最大. 思路:均值最大,显然需要01分数 ...
CDN是如何工作的？
CDN的原理非常简单.当浏览器请求一资源时,第一步是做DNS解析,DNS解析就像是从通讯录根据姓名找号码,浏览器发送域名,然后得到DNS服务器返回的IP地址.浏览器通过IP地址和服务器连接并获取资源( ...
UV纹理+修改器:VertexWeightEdit+修改器:Mask遮罩
UV纹理+修改器: VertexWeightEdit+修改器: Mask遮罩基本流程, 如下图,准备地图一份, 黑白色即可. 纹理使用颜色绘制权重. 白色为1, 黑色为0. 新增球体, 细分多次, ...
Singer 学习三使用Singer进行mongodb 2 postgres 数据转换
Singer 可以方便的进行数据的etl 处理,我们可以处理的数据可以是api 接口,也可以是数据库数据,或者是文件备注: 测试使用docker-compose 运行&&提供数据库 ...
围棋术语 & 中英文。
https://senseis.xmp.net/?ChineseGoTerms 一字二字三字四字一字长(nobi,solid extension),是指仅靠着自己的棋盘上已有棋子继续向前延伸 ...
vue的watcher 关于数组和对象
数组不能被监听到的情况 1.直接下标赋值(但对象直接修改原有属性值可以渲染视图,虽然也监听不到) 2.修改数组length 解决方法: this.$set(this.arr,index,val) p ...
：first ：first-child .first()和.get() .eq()
:first .first()只匹配一个元素,而 :first-child 将为每个父元素匹配一个子元素 .get()得到的是dom 元素 $('li').get()没有参数返回一个数组 .eq() ...
openstack--5--控制节点和计算节点安装配置nova
Nova相关介绍目前的Nova主要由API,Compute,Conductor,Scheduler组成 Compute:用来交互并管理虚拟机的生命周期: Scheduler:从可用池中根据各种策略选 ...
[转]JDK动态代理
代理模式代理模式是常用的java设计模式,他的特征是代理类与委托类有同样的接口,代理类主要负责为委托类预处理消息.过滤消息.把消息转发给委托类,以及事后处理消息等.代理类与委托类之间 ...
php+js实现重定向跳转并post传参
页面重定向跳转并post传参 $mdata=json_encode($mdata);//如果是字符串无需使用json echo " <form style='display:none; ...

Lucene suggest [转]

Lucene suggest [转]的更多相关文章

随机推荐

热门专题