Suggester

Suggester - a flexible "autocomplete" component.(搜索推荐)

A common need in search applications is suggesting query terms or phrases based on incomplete user input. These completions may come from a dictionary that is based upon the main index or upon any other arbitrary dictionary. It's often useful to be able to provide only top-N suggestions, either ranked alphabetically or according to their usefulness for an average user (e.g. popularity, or the number of returned results).(对搜索出的suggestion进行排序,可以按populatity排序,也可以按字母顺序排序,还可以自定义按用户搜索频率排序(热搜))。

Solr 3.1 includes a component called Suggester that provides this functionality.

Suggester reuses much of the SpellCheckComponent infrastructure, so it also reuses many common SpellCheck parameters, such as spellcheck=true orspellcheck.build=true, etc. The way this component is configured in solrconfig.xml is also very similar:

  <searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookupFactory</str>
<!-- Alternatives to lookupImpl:
org.apache.solr.spelling.suggest.fst.FSTLookupFactory [finite state automaton]
org.apache.solr.spelling.suggest.fst.WFSTLookupFactory [weighted finite state automaton]
org.apache.solr.spelling.suggest.jaspell.JaspellLookupFactory [default, jaspell-based]
org.apache.solr.spelling.suggest.tst.TSTLookupFactory [ternary trees]
-->
<str name="field">name</str> <!-- the indexed field to derive suggestions from -->
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
<!--
<str name="sourceLocation">american-english</str>
-->
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>

The look-up of matching suggestions in a dictionary is implemented by subclasses of the Lookup class - the implementations that are included in Solr are:

在查询字典当中,匹配推荐的查询实现类在solr中如下:

  • JaspellLookup - tree-based representation based on Jaspell,(是一种更为复杂的基于三叉树的查询)

  • TSTLookup - ternary tree based representation, capable of immediate data structure updates,(一种具备及时数据结构更新的三叉树)
  • FSTLookup - automaton based representation; slower to build, but consumes far less memory at runtime (see performance notes below).(特点:构建查询树比较慢,但运行时会消耗比较小的内存)
  • WFSTLookup - weighted(加权) automaton representation: an alternative to FSTLookup for more fine-grained ranking. Solr 3.6+(FSTLookup的替代,拥有更加精细的排序功能)

For practical purposes all of the above implementations will most likely run at similar speed when requests are made via the HTTP stack (which willbecome the bottleneck). Direct benchmarks of these classes indicate that (W)FSTLookup provides better performance compared to the other two methods, at a much lower memory cost.(这些对象的直接基准表明,(W)FSTLookup性能要好于前两个,在运行时消耗更少的内存。) JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's a one line change inJaspellLookup). Support for infix-suggestions is planned for FSTLookup (which would be the only structure to support these).

An example of an autosuggest request:

http://localhost:8983/solr/suggest?q=ac

And the corresponding response:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="ac">
<int name="numFound">2</int>
<int name="startOffset">0</int>
<int name="endOffset">2</int>
<arr name="suggestion">
<str>acquire</str>
<str>accommodate</str>
</arr>
</lst>
<str name="collation">acquire</str>
</lst>
</lst>
</response>

Configuration

The configuration snippet above shows a few common configuration parameters. A complete list of them is best found in the source code of each Lookup class, but here is an overview:

SpellCheckComponent configuration

searchComponent/@name - arbitrary name for this component

spellchecker list:

  • name - a symbolic name of this spellchecker (can be later referred to in URL parameters and in SearchHandler configuration - see the section below)

  • classname - Suggester, to provide the autocomplete functionality

  • lookupImpl - Lookup implementation. These in-memory implementations are available:(这些查询在内存中完成)

    • org.apache.solr.spelling.suggest.tst.TSTLookupFactory - a simple compact ternary trie based lookup

    • org.apache.solr.spelling.suggest.jaspell.JaspellLookupFactory - a more complex lookup based on a ternary trie from the JaSpell project.

    • org.apache.solr.spelling.suggest.fst.FSTLookupFactory - automaton-based lookup

    • org.apache.solr.spelling.suggest.fst.WFSTLookupFactory - weighted automaton-based lookup

  • buildOnCommit - if set to true then the Lookup data structure will be rebuilt after commit. If false (default) then the Lookup data will be built only when requested (by URL parameter spellcheck.build=true). NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit.

  • sourceLocation - location of the dictionary file. If not empty then this is a path to a dictionary file (see below). If this value is empty then the main index will be used as a source of terms and weights.

  • field - if sourceLocation is empty then terms from this field in the index will be used when building the trie.

  • threshold - threshold is a value in [0..1] representing the minimum fraction of documents (of the total) where a term should appear, in order to be added to the lookup dictionary.

  • storeDir - where to store the index data on the disk (else use in-memory).

Dictionary

When a file-based dictionary is used (non-empty sourceLocation parameter above) then it's expected to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a '#' are ignored. The remaining lines must consist of either a string without literal TAB (\u0007) character, or a string and a TAB separated floating-point weight.

Example:

# This is a sample dictionary file.
acquire
accidentally\t2.0
accommodate\t3.0

If weight is missing it's assumed to be equal 1.0. Weights affect the sorting of matching suggestions when spellcheck.onlyMorePopular=true is selected - weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights.

Please note that the format of the file is not limited to single terms but can also contain phrases - which is an improvement over the TermsComponentthat you could also use for a simple version of autocomplete functionality.

FSTLookup has a built-in mechanism to discretize (离散)weights into a fixed set of buckets (to speed up suggestions). The number of buckets is configurable.

WFSTLookup does not use buckets, but instead a shortest path algorithm. Note that it expects weights to be whole numbers.

Threshold parameter

As mentioned above, if the sourceLocation parameter is empty then the terms from a field indicated by the field parameter are used. It's often the case that due to imperfect source data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms, which means that the dictionary built indiscriminately from a real-life index would consist mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the size of in-memory structures it's best to set thethreshold parameter to a value slightly above zero (0.5% in the example above). This already vastly reduces the size of the dictionary by skipping "hapax legomena" while still preserving most of the common terms. This parameter has no effect when using a file-based dictionary - it's assumed that only useful terms are found there. (Threshold parameter的设置能够有效控制不受限制的或者基于不常用的terms的查询字典的构建,减少内存占用量)

SearchHandler configuration

In the example above we add a new handler that uses SearchHandler with a single SearchComponent that we just defined, namely the suggest component. Then we define a few defaults for this component (that can be overridden with URL parameters):

  • spellcheck=true - because we always want to run the Suggester for queries submitted to this handler.

  • spellcheck.dictionary=suggest - this is the name of the dictionary component that we configured above.

  • spellcheck.onlyMorePopular=true - if this parameter is set to true then the suggestions will be sorted by weight ("popularity") - the count parameter will effectively limit this to a top-N list of best suggestions. If this is set to false then suggestions are sorted alphabetically.

  • spellcheck.count=5 - specifies to return up to 5 suggestions.

  • spellcheck.collate=true - to provide a query collated with the first matching suggestion.

Tips and tricks(使用技巧)

* Use (W)FSTLookup to conserve memory (unless you need a more sophisticated matching, then look at JaspellLookup). There are some benchmarks of all four implementations: SOLR-1316 (outdated) and a bit newer here: SOLR-2378, and here: LUCENE-3714. The class to perform these benchmarks is in the source tree and is called LookupBenchmarkTest.

* Use threshold parameter to limit the size of the trie, to reduce the build time and to remove invalid/uncommon terms. Values below 0.01 should be sufficient, greater values can be used to limit the impact of terms that occur in a larger portion of documents. Values above 0.5 probably don't make much sense.

* Don't forget to invoke spellcheck.build=true after core reload. Or extend the Lookup class to do this on init(), or implement the load/save methods in Lookup to persist this data across core reloads.

* If you want to use a dictionary file that contains phrases (actually, strings that can be split into multiple tokens by the default QueryConverter) then define a different QueryConverter like this:

  <!--
The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens. Define a simple regular expression
in your QueryAnalyzer chain if you want to strip off field markup, boosts, ranges, etc.
-->
<queryConverter name="queryConverter" class="org.apache.solr.spelling.SuggestQueryConverter"/>

An example for setting up a typical case of auto-suggesting phrases (e.g. previous queries from query logs with associated score) is here:

下一篇博客中,将详细阐述搜索推荐的核心算法。

原创:Solr Wiki 中关于Suggester(搜索推荐)的简单解读的更多相关文章

  1. Solr Wiki文档

    相比ElasticSearch,Solr的文档详尽丰富,同时也显得冗余啰嗦. Solr的官方文档有两个地方: Solr官方教程 Solr社区维基 本文主要列出一些Solr Wiki中的主要讨论主题,方 ...

  2. solr注意事项-solrconfig中的默认搜索域会覆盖schema中的默认搜索域,注意copyfeild中被corp的字段搜索

    结论一:solrconfig.xml的默认搜索配置权限高于schema.xml中的默认搜索配置! 配置1:solrconfig.xml文件中关于select的配置: <requestHandle ...

  3. SolrCloud:依据Solr Wiki的译文

    本文是作者依据Apache Solr Document的译文.翻译不对或者理解不到位的地方欢迎大家指正!谢谢! Nodes, Cores, Cluster and Leaders Nodes and ...

  4. solr特点八:Spatial(空间搜索)

    前言 在美团CRM系统中,搜索商家的效率与公司的销售额息息相关,为了让BD们更便捷又直观地去搜索商家,美团CRM技术团队基于Solr提供了空间搜索功能,其中移动端周边商家搜索和PC端的地图模式搜索功能 ...

  5. solr服务中集成IKAnalyzer中文分词器、集成dataimportHandler插件

    昨天已经在Tomcat容器中成功的部署了solr全文检索引擎系统的服务:今天来分享一下solr服务在海量数据的网站中是如何实现数据的检索. 在solr服务中集成IKAnalyzer中文分词器的步骤: ...

  6. 基于Solr和Zookeeper的分布式搜索方案的配置

    1.1 什么是SolrCloud SolrCloud(solr 云)是Solr提供的分布式搜索方案,当你需要大规模,容错,分布式索引和检索能力时使用 SolrCloud.当一个系统的索引数据量少的时候 ...

  7. Solr查询中涉及到的Cache使用及相关的实现【转】

    转自:http://www.cnblogs.com/phinecos/archive/2012/05/24/2517018.html 本文将介绍Solr查询中涉及到的Cache使用及相关的实现.Sol ...

  8. 24.通过ngram分词机制实现index-time搜索推荐

    一.ngram和index-time搜索推荐原理     1.什么是ngram     假设有一个单词:quick,在5种长度下的ngram情况如下: ngram length=1,q u i c k ...

  9. 23.match_phrase_prefix实现search-time搜索推荐

    主要知识点: 搜索推荐的使用场景 用法 原理 一.搜索推荐的使用场景 搜索推荐,就是在你做搜索时,当你写出一部搜索词时,es会自提示接下来要写的词,比如当你在搜索hello w 时,如果es中有如下文 ...

随机推荐

  1. node-red 流程的导入导出

    流程的导入导出 流程的导出 选中所要导出的流程,点击右上角三条杠按钮 有两个选项,导出到剪贴板和库 1. 导出到剪贴板 导出到剪贴板可以复制,粘贴到任何地方 [{,,,,,,"wires&q ...

  2. REDISTEMPLATE如何注入到VALUEOPERATIONS

    REDISTEMPLATE如何注入到VALUEOPERATIONS 今天看到Spring操作redis  是可以将redisTemplate注入到ValueOperations,避免了ValueOpe ...

  3. EasyARM-iMX283 安装NFS

    1. 安装NFS软件包在 ubuntu 上请输入下面命令:[chenxibing@localhost ~]$ sudo apt-get install nfs-kernel-server[chenxi ...

  4. CMake配置VTK时Qt5_DIR-NOTFOUND的解决方法

    直接给解决方法了,不废话. Qt5的路径,请参考:C:\Program\IDE\Qt\Qt5.13.0\5.13.0\msvc2017_64\lib\cmake\Qt5 参考文章 CMake配置VTK ...

  5. Web 标准构成

    Web标准不是某一个标准,而是由W3C和其他标准化组织制定的一系列标准的集合.主要包括结构(Structure).表现(Presentation)和行为(Behavior)三个方面. 结构标准:结构用 ...

  6. CSS两列布局

    方法1:左边设置绝对定位,右边设置左外边距,大小和左边的宽度相等 //CSS部分: .contain{ position :relative; height: 300px; } .left{ posi ...

  7. Appscan漏洞之Authentication Bypass Using HTTP Verb Tampering

    本次针对 Appscan漏洞 Authentication Bypass Using HTTP Verb Tampering(HTTP动词篡改导致的认证旁路)进行总结,如下: 1. Authentic ...

  8. Linux磁盘管理——Ext2文件系统

    前言 通常而言,对于一块新磁盘我们不是直接使用,而是先分区,分区完毕后格式化,格式化后OS才能使用这个文件系统.分区可能会涉及到MBR和GPT问题.至于格式化和文件系统又有什么关系? 这里的格式化指的 ...

  9. 分布式结构化存储系统-HBase基本架构

    分布式结构化存储系统-HBase基本架构 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 在大数据领域中,除了直接以文件形式保存数据外,还有大量结构化和半结构化的数据,这类数据通常需 ...

  10. 白话解说TCP/IP协议三次握手和四次挥手

    白话解说TCP/IP协议三次握手和四次挥手 1.背景 和女朋友异地恋一年多,为了保持感情我提议每天晚上视频聊天一次. 从好上开始,到现在,一年多也算坚持下来了. 1.1.问题 有时候聊天的过程中,我的 ...