Facets with Lucene

Posted on August 1, 2014 by Pascal Dimassimo in Latest Articles

During the development of our latest product, Norconex Content Analytics, we decided to add facets to the search interface. They allow for exploring the indexed content easily. Solr and Elasticsearch both have facet implementations that work on top of Lucene. But Lucene also offers simple facet implementations that can be picked out of the box. And because Norconex Content Analytics is based on Lucene, we decided to go with those implementations.

We’ll look at those facet implementations in this blog post, but before, let’s talk about a new feature of Lucene 4 that is used by all of them.

DOCVALUES

DocValues are a new addition to Lucene 4. What are they? Simply put, they are a mean to associate a value for each document. We can have multiple DocValues per document. Later on, we can retrieve the value associated with a specific document.

You may wonder how a DocValue is different than a stored field. The difference is that they are not optimized for the same usage. Whereas all stored fields of a single document are meant to be loaded together (for example, when we need to display the document as a search result), DocValues are meant to be loaded at once for all documents. For example, if you need to retrieve all the values of a field for all documents, then iterating over all documents and retrieving a stored field would be slow. But if a DocValue is used, loading all of them for all documents is easy and efficient. A DocValue is stored in a column stride fashion, so all the values are kept together for easy access. You can learn more about DocValues in multiple places over the Web.

Lucene allows different kind of DocValues. We have numeric, binary, sorted (single string) and sorted set (multiple strings) DocValues. For example, a DocValue to store a “price” value should probably be a numeric DocValue. But if you want to save an alphanumeric identifier for each document, a sorted DocValues should be used.

Why is this important for faceting? Before DocValues, when an application wanted to do faceting, a common approach to build the facets values was to do field uninverting, that is, go over the values of an indexed field and rebuild the original association between terms and documents. This process needed to be redone every time new documents were indexed. But with DocValues, since the association between a document and a value is maintained in the index, it simplifies the work needed to build facets.

Let’s now look at the facets implementation in Lucene.

STRING FACET

The first facet implementation available in Lucene that we will look at is what we expect when we think of facets. It allows for counting the documents that share the same string value.

When indexing a document, you have to use a SortedSetDocValuesFacetField. Here is an example:

FacetsConfig config = new FacetsConfig();
config.setIndexFieldName("author", "facet_author"); Document doc = new Document();
doc.add(new SortedSetDocValuesFacetField("author", "Douglas Adams"));
writer.addDocument(config.build(doc));

With this, Lucene will create a “facet_author” field with the author value indexed in it. But Lucene will also create a DocValue named “facet_author” containing the value. When building the facets at search-time, this DocValue will be used.

You’ve probably also noticed the FacetsConfig object. It allows us to associate a dimension name (“author”) with a field name (“facet_author”). Actually, when Lucene indexed the value in the “facet_author” field and DocValue, it also prefixes the value with the dimension name. This would allow us to have different facets (dimensions) indexed in the same field and DocValue. If we would have omitted the call to setIndexFieldName, the facets would have been indexed in a field called “$facets” (and the same name for the DocValue).

At search time, here is the code we would use to gather the author facets:

SortedSetDocValuesReaderState state =
new DefaultSortedSetDocValuesReaderState(reader, "facet_author");
FacetsCollector fc = new FacetsCollector();
FacetsCollector.search(searcher, query, 10, fc);
Facets facets = new SortedSetDocValuesFacetCounts(state, fc);
FacetResult result = facets.getTopChildren(10, "author");
for (int i = 0; i < result.childCount; i++) {
LabelAndValue lv = result.labelValues[i];
System.out.println(String.format("%s (%s)", lv.label, lv.value));
}

Here, the DefaultSortedSetDocValuesReaderState will be responsible for loading all the dimensions from the specified DocValue (facet_author). Note that this “state” object is costly to build, so it should be re-used if possible. Then, SortedSetDocValuesFacetCounts will be able to load the values of a specific dimension using the “state” object and to compute the count for each distinct value.

You can find more code examples in the file SimpleSortedSetFacetsExample.java in the Lucene sources.

NUMERIC RANGE FACET

This next facet implementation is to be used with numbers to build range facets. For example, it would group documents of the same price range together.

When indexing a document, you have to add a numeric DocValue for each document. Like this:

doc.add(new NumericDocValuesField("price", 100L));

In that case, we only need to use a standard NumericDocValuesField and not a specialized FacetField.

When searching, we need to first define the set of ranges that we want. Here is how it could be built:

LongRange[] ranges = new LongRange[3];
ranges[0] = new LongRange("0-10", 0, true, 10, false);
ranges[1] = new LongRange("10-100", 10, true, 100, false);
ranges[2] = new LongRange(">100", 100, true, Long.MAX_VALUE, false);

With those ranges, we can build the facets:

FacetsCollector.search(searcher, query, 10, fc);
LongRangeFacetCounts facets = new LongRangeFacetCounts("price", fc, ranges);
FacetResult result = facets.getTopChildren(0, "price");
for (int i = 0; i < result.childCount; i++) {
LabelAndValue lv = result.labelValues[i];
System.out.println(String.format("%s (%s)", lv.label, lv.value));
}

Lucene will calculate the count for each range.

For code sample, see RangeFacetsExample.java in Lucene sources.

TAXONOMY FACET

This was the first facet implementation, and it was actually available before Lucene 4. This implementation is different than the others in several aspects. First, all the unique values for a taxonomy facet are stored in a separate Lucene index (often called the sidecar index). Second, this implementation supports hierarchical facets.

For example, imagine a “path” facet where “path” represents where a file was on a filesystem (or the Web). Imagine the file “/home/mike/work/report.txt”. If we were to store the path (“/home/mike/work”) as a taxonomy facet, it will actually be split into 3 unique values: “/home”, “/home/mike” and “/home/mike/work”. Those 3 values are stored in the sidecar index with each being assigned a unique ID. In the main index, a binary DocValue is created so that each document is assigned the ID of its corresponding path value (the ID from the sidecar index). In this example, if “/home/mike/work” was assigned ID 3 in the sidecar index, the DocValue for the document “/home/mike/work/report.txt” would be 3 in the main index. In the sidecar index, all values are linked together, so it is easy later on to retrieve the parents and children of each value. For example, “/home” would be the parent of “/home/mike”, which would be the parent of “/home/mike/work”. We’ll see how this information is used.

Here is some code to index the path facet of a file under “/home/mike/work”:

Directory dirTaxo = FSDirectory.open(pathTaxo);
taxo = new DirectoryTaxonomyWriter(dirTaxo); FacetsConfig config = new FacetsConfig();
config.setIndexFieldName("path", "facet_path");
config.setHierarchical("path", true); Document doc = new Document();
doc.add(new StringField("filename", "/home/mike/work/report.txt"));
doc.add(new FacetField("path", "home", "mike", "work"));
writer.addDocument(config.build(taxo, doc));

Notice here that we need to create a taxonomy writer, which is used to write in the sidecar index. After that, we can add the actual facets. Like with SortedSetDocValuesFacetField, we need to define the configuration of the facet field (dimension name and field name). We also have to indicate that the facets will be hierarchical. Once it is set, we can use FacetField with the dimension name and all the hierarchy of values for the facet. Finally, we add it to the main index (via the writer object), but we also need to pass the taxo writer object so that the sidecar index is also updated.

Here is some code to retrieve those facets:

DirectoryTaxonomyReader taxoReader =
new DirectoryTaxonomyReader(dirTaxo);
FacetsCollector fc = new FacetsCollector();
FacetsCollector.search(searcher, q, 10, fc);
Facets facetsFolder = new FastTaxonomyFacetCounts(
"facet_path", taxoReader, config, fc);
FacetResult result = facetsFolder.getTopChildren(10, "path");

For each matching document, the ID of the facet value is retrieved (via the DocValues). Lucene will count how much there is for each unique facet value by counting how many documents are assigned to each ID. After that, it can fetch the actual facet values from the sidecar index using those IDs.

In the last example, we did not specify any specific path, so all facets for all paths are returned (including all child paths). But we could restrict to a more specific path to get only the facets underneath it, for example “/home/mike/work”:

FacetResult result = facetsFolder.getTopChildren(
10, "path", "home", "mike", "work");

This is where the hierarchical aspect of the taxonomy facets gets interesting. Because of the relations kept between the facets in the sidecar index, Lucene is able to count the documents for the facets at different levels in the hierarchy.

Again, for more code example about taxonomy facets, see MultiCategoryListsFacetsExample.java in the Lucene sources.

CONCLUSION

So we’ve seen that Lucene offers facets implementations out of the box. A lot of interesting features can be built on top of them! For more info, refer to the Lucene sources and javadoc.

转自:

http://www.norconex.com/facets-with-lucene/

Facet with Lucene的更多相关文章

  1. what's the difference between grouping and facet in lucene 3.5

    I  found in lucene 3.5 contrib folder two plugins: one is grouping, the other is facet. In my option ...

  2. Lucene系列-facet

    1.facet的直观认识 facet:面.切面.方面.个人理解就是维度,在满足query的前提下,观察结果在各维度上的分布(一个维度下各子类的数目). 如jd上搜“手机”,得到4009个商品.其中品牌 ...

  3. Lucene 4.8 - Facet Demo

    package com.fox.facet; /* * Licensed to the Apache Software Foundation (ASF) under one or more * con ...

  4. Lucene 4.3 - Facet demo

    package com.fox.facet; import java.io.IOException; import java.util.ArrayList; import java.util.List ...

  5. lucene 4.0 - Facet demo

    package com.fox.facet; import java.io.File; import java.io.IOException; import java.util.ArrayList; ...

  6. lucene中facet实现统计分析的思路——本质上和word count计数无异,像splunk这种层层聚合(先filed1统计,再field2统计,最后field3统计)lucene是排序实现

    http://stackoverflow.com/questions/185697/the-most-efficient-way-to-find-top-k-frequent-words-in-a-b ...

  7. lucene搜索之facet查询原理和facet查询实例——TODO

    转自:http://www.lai18.com/content/7084969.html Facet说明 我们在浏览网站的时候,经常会遇到按某一类条件查询的情况,这种情况尤以电商网站最多,以天猫商城为 ...

  8. 用Lucene实现分组,facet功能,FieldCache

    假如你像用lucene来作分组,比如按类别分组,这种功能,好了你压力大了,lucene本身是不支持分组的. 当你想要这个功能的时候,就可能会用到基于lucene的搜索引擎solr. 不过也可以通过编码 ...

  9. lucene源码分析(3)facet实例

    简单的facet实例 public class SimpleFacetsExample { private final Directory indexDir = new RAMDirectory(); ...

随机推荐

  1. Spring mvc:配置不拦截指定的路径

    <!-- 访问拦截 --> <mvc:interceptors> <mvc:interceptor> <mvc:mapping path="/**/ ...

  2. group_concat的使用以及乱码

    1.group_concat子查询返回数字是乱码,既不是utf8也不是gbk,后来看了下子表的字段编码是gbk的,但sql整体返回的是utf8,group_concat前 把字段转换成utf8的,运行 ...

  3. C++学习(十三)(C语言部分)之 练习

    打印图像 要用循环 空心或者实心都可以1.矩形 菱形 三角形 梯形 六边形2.打印倒三角形的99乘法表 具体测试代码如下: /* 1.打印图像 要用循环 空心或者实心都可以 矩形 菱形 三角形 梯形 ...

  4. SVN :Unable to connect to a repository at URL

     编程之路刚刚开始,错误难免,希望大家能够指出. 单位换地方了,SVN的服务器和本机不在一个网段,原先的SVN文件夹进行“SVN Update”的时候报错了,如下: 网上一查,原来是DNS域名解析错误 ...

  5. Spring通知方法错误

    错误提示,主要最后一句话    ,花了2个小时 org.springframework.beans.factory.BeanCreationException: Error creating bean ...

  6. <------------------字符流--------------------->

    FileWriter 字符输出流: 方法: 写入:write 刷新:flush public static void main(String[] args) throws IOException { ...

  7. day6 python学习

    ---恢复内容开始--- 今日讲课内容: 1.  新内容: 字典  1.字典有无序性,没有顺序,2字典的键:key必须是可哈希的.可哈希表示key必须是不可变类型,如:数字.字符串.元组.不可变的,字 ...

  8. c#实现RGB字节数组生成图片

    我是要用c#来实现,现在已经知道了rgb数组,那么如何快速生成一张图片呢? 其实这个话题并不局限于是rgb字节数组的顺序,只要你能对于上表示红.绿.蓝的值,就可以生成图片.知道了原理,做什么都简单了. ...

  9. 在树莓派是安装并配置NTP服务

    我们都知道树莓派的小巧和省电节省空间等太多的优势,这里就不一一列举了,那么树莓派就需要长时间的运行,可以7×24的方式运行,那么我们就把树莓派当作一个小的服务器来运行,可以跑一些小的应用,例如可以在局 ...

  10. [转]JBoss7中domain、standalone模式介绍

    JBoss AS7 可实现为云做好准备的架构,并可使启动时间缩短十倍,提供更快的部署速度并降低内在的占用.JBoss Enterprise Application Platform 6 的核心是JBo ...