As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework around it, which has served us well so far. The rationale for moving to Solr is driven primarily by the need to expose our search tier as a service for our internal applications. While it would have been relatively simple (probably simpler) to slap on an HTTP interface over our current search tier, we also want to use the other Solr features such as incremental indexing and replication.

One of our challenges to using Solr is that the way we do search is quite different from the way Solr does search. A query string passed to the default Solr search handler is parsed into a Lucene query and a single search call is made on the underlying index. In our case, the query string is passed to our taxonomy, and depending on the type of query (as identified by the taxonomy), it is sent through one or more sub-handlers. Each sub-handler converts the query into a (different) Lucene query and executes the search against the underlying index. The results from each sub-handler are then layered together to present the final search result.

Conceptually, the customization is quite simple - simply create a custom subclass of RequestHandlerBase (as advised on this wiki page) and override the handleRequestBody(SolrQueryRequest, SolrQueryResponse) method. In reality, I had quite a tough time doing this, admittedly caused (at least partly) by my ignorance of Solr internals. However, I did succeed, so, in this post, I outline my solution, along with some advice I feel would be useful to others embarking on a similar route.

Configuration and Code

The handler is configured to trigger in response to a /solr/mysearch request. Here is the (rewritten for readability) XML snippet from my solrconfig.xml file. I used the "invariants" block to pass in configuration parameters for the handler.

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
  ...
<requestHandler name="/mysearch"
class="org.apache.solr.handler.ext.MyRequestHAndler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="fl">*,score</str>
<str name="wt">xml</str>
</lst>
<lst name="invariants">
<str name="prop1">value1</str>
<int name="prop2">value2</int>
<!-- ... more config items here ... -->
</lst>
</requestHandler>
...

And here is the (also rewritten for readability) code for the custom handler. I used the SearchHandler and MoreLikeThisHandler as my templates, but diverged from it in several ways in order to accomodate my requirements. I will describe them below.

  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
package org.apache.solr.handler.ext;

// imports omitted

public class MyRequestHandler extends RequestHandlerBase {

  private String prop1;
private String prop2;
...
private TaxoService taxoService; @Override
public void init(NamedList args) {
super.init(args);
this.prop1 = invariants.get("prop1");
this.prop2 = Integer.valueOf(invariants.get("prop2"));
...
this.taxoService = new TaxoService(prop1);
} @Override
public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
throws Exception { // extract params from request
SolrParams params = req.getParams();
String q = params.get(CommonParams.Q);
String[] fqs = params.getParams(CommonParams.FQ);
int start = 0;
try { start = Integer.parseInt(params.get(CommonParams.START)); }
catch (Exception e) { /* default */ }
int rows = 0;
try { rows = Integer.parseInt(params.get(CommonParams.ROWS)); }
catch (Exception e) { /* default */ }
SolrPluginUtils.setReturnFields(req, rsp); // build initial data structures
TaxoResult taxoResult = taxoService.getResult(q);
SolrDocumentList results = new SolrDocumentList();
SolrIndexSearcher searcher = req.getSearcher();
Map<String,SchemaField> fields = req.getSchema().getFields();
int ndocs = start + rows;
Filter filter = buildFilter(fqs, req);
Set<Integer> alreadyFound = new HashSet<Integer>(); // invoke the various sub-handlers in turn and return results
doSearch1(results, searcher, q, filter, taxoResult, ndocs, req,
fields, alreadyFound);
doSearch2(results, searcher, q, filter, taxoResult, ndocs, req,
fields, alreadyFound);
// ... more sub-handler calls here ... // build and write response
float maxScore = 0.0F;
int numFound = 0;
List<SolrDocument> slice = new ArrayList<SolrDocument>();
for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
SolrDocument sdoc = it.next();
Float score = (Float) sdoc.getFieldValue("score");
if (maxScore < score) {
maxScore = score;
}
if (numFound >= start && numFound < start + rows) {
slice.add(sdoc);
}
numFound++;
}
results.clear();
results.addAll(slice);
results.setNumFound(numFound);
results.setMaxScore(maxScore);
results.setStart(start);
rsp.add("response", results); } private Filter buildFilter(String[] fqs, SolrQueryRequest req)
throws IOException, ParseException {
if (fqs != null && fqs.length > 0) {
BooleanQuery fquery = new BooleanQuery();
for (int i = 0; i < fqs.length; i++) {
QParser parser = QParser.getParser(fqs[i], null, req);
fquery.add(parser.getQuery(), Occur.MUST);
}
return new CachingWrapperFilter(new QueryWrapperFilter(fquery));
}
return null;
} private void doSearch1(SolrDocumentList results,
SolrIndexSearcher searcher, String q, Filter filter,
TaxoResult taxoResult, int ndocs, SolrQueryRequest req,
Map<String,SchemaField> fields, Set<Integer> alreadyFound)
throws IOException {
// check entry condition
if (! canEnterSearch1(q, filter, taxoResult)) {
return;
}
// build custom query and extra fields
Query query = buildCustomQuery1(q, taxoResult);
Map<String,Object> extraFields = new HashMap<String,Object>();
extraFields.put("search_type", "search1");
boolean includeScore =
req.getParams().get(CommonParams.FL).contains("score"));
append(results, searcher.search(
query, filter, maxDocsPerSearcherType).scoreDocs,
alreadyFound, fields, extraFields, maprelScoreCutoff,
searcher.getReader(), includeScore);
} // ... more doSearchXXX() calls here ... private void append(SolrDocumentList results, ScoreDoc[] more,
Set<Integer> alreadyFound, Map<String,SchemaField> fields,
Map<String,Object> extraFields, float scoreCutoff,
SolrIndexReader reader, boolean includeScore) throws IOException {
for (ScoreDoc hit : more) {
if (alreadyFound.contains(hit.doc)) {
continue;
}
Document doc = reader.document(hit.doc);
SolrDocument sdoc = new SolrDocument();
for (String fieldname : fields.keySet()) {
SchemaField sf = fields.get(fieldname);
if (sf.stored()) {
sdoc.addField(fieldname, doc.get(fieldname));
}
}
for (String extraField : extraFields.keySet()) {
sdoc.addField(extraField, extraFields.get(extraField));
}
if (includeScore) {
sdoc.addField("score", hit.score);
}
results.add(sdoc);
alreadyFound.add(hit.doc);
}
} //////////////////////// SolrInfoMBeans methods ////////////////////// @Override
public String getDescription() {
return "My Search Handler";
} @Override
public String getSource() {
return "$Source$";
} @Override
public String getSourceId() {
return "$Id$";
} @Override
public String getVersion() {
return "$Revision$";
}
}

Configuration Parameters - I started out baking most of my "configuration" parameters as constants within the handler code, but later moved them into the invariants block in the XML declaration. Not ideal, since we still need to touch the solrconfig.xml file (which is regarded as application code in our environment) to change behavior. The ideal solution, given the circumstances, would probably be to use JNDI to hold the configuration parameters and have the handler connect to the JNDI to pull the properties it needs.

Using Filter - The MoreLikeThis handler converts the fq (filter query) parameter into a List of Query objects, because this is what is needed to pass into a searcher.getDocList(). In my case, I couldn't use DocListAndSet because DocList is unmodifiable (ie, DocList.add() throws an UnsupportedOperationException). So I fell back to the pattern I am used to, which is getting the ScoreDoc[] array from a standard searcher.search(Query,Filter,numDocs) call. That is why the buildFilter() above returns a Filter and not a List<Query>.

Connect to external services - My handler needs to connect to the taxonomy service. Our taxonomy exposes an RMI service with a very rich and fine-grained API. I tried to use this at first, but ran into problems because it needs access to configuration files on the local system, and Jetty couldn't see these files because it was not within its context. I ended up solving for this by exposing a coarse grained JSON service over HTTP on the taxonomy service. The handler calls it once per query and gets back all the information that it needs in a single call. Probably not ideal, since now the logic is spread out in two places - I will probably revisit the RMI client integration again in the future.

Layer multiple resultsets - This is the main reason for writing the custom handler. Most of the work happens in the append() method above. Each sub-handler calls SolrSearcher.search(Query, Filter, numDocs) and populates its resulting ScoreDocs array into a List<SolrDocument>. Since previous sub-handlers may have already returned a result, subsequent sub-handlers check against a Set of docIds.

Add a pseudo-field to the Document - There are currently two competing initiatives in Solr (SOLR-1566 and SOLR-1298) on how to handle this situation. Since I was populating SolrDocument objects (this was one of the reasons I started using SolrDocumentList), it was relatively simple for me to pass in a Map of extra fields which are just tacked on to the end of the SolrDocument.

Some Miscellaneous advice

Here is some advice and tips which I wish someone had told me before I started out on this.

For your own sanity, standardize on a Solr release. I chose 1.4.1 which is the latest at the time of writing this. Prior to that, I was developing within the Solr trunk. One day (after about 60-70% of my code was working), I decided to do an svn update, and all of a sudden there was a huge bunch of compile failures (in my code as well as the Solr code). Some of them were probably caused by missing/out-of-date JARs in my .classpath. But the point is that Solr code is being actively developed, and there is quite a bit of code churn, and if you really want to work on the trunk (or a pre-release branch), you should be ready to deal with these situtations.

Solr is well designed (so the flow is kind of intuitive) and reasonably well documented, but there are some places where you will probably need to step through the code in a debugger to figure out what's going on. I am still using the Jetty container in the examples subdirectory. This page on Lucid Imagination outlines the steps you need to run Solr within Eclipse using the Jetty plugin, but thanks to the information on this StackOverlow page, all I did was add some command-line parameters to the java call, like so:

1
2
3
sujit@cyclone:example$ java -Dsolr.solr.home=my_schema \
-agentlib:jdwp=transport=dt_socket,server=y,address=8883,suspend=n \
-jar start.jar

and then set up an external debug configuration for localhost:8883 in Eclipse, and I could step through the code just fine.

Solr has very aggressive caching (which is great for a production environment), but for development, you need to disable it. I did this by commenting out all the cache references for filterCache, queryResultCache and documentCache in solrconfig.xml, and changed the httpCaching to use never304="true". All these are in the solrconfig.xml file.

Conclusion

The approach I described here is not as performant as the "standard" flow. Because I have to do multiple searches in a single request, I am doing more I/O. I am also consuming more CPU cycles since I have to dedup documents across each layer. I am also consuming more memory per request because I populate the SolrDocument inline rather than just pass the DocListAndSet to the ResponseBuilder. I don't see a way around it, though, given the nature of my requirements.

If you are a Solr expert, or someone who is familiar with the internals, I would appreciate hearing your thoughts about this approach - criticisms and suggestions are welcome.

http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html

Solr: a custom Search RequestHandler的更多相关文章

  1. 通过Google Custom Search API 进行站内搜索

    今天突然想把博客的搜索改为google的站内搜索,印象中google adsense中好像提高这个站内搜索的代码,但苦逼的是google adsense帐号一直审核不通过,所以只能通过google c ...

  2. [Angular 2] Filter items with a custom search Pipe in Angular 2

    This lessons implements the Search Pipe with a new SearchBox component so you can search through eac ...

  3. Custom SOLR Search Components - 2 Dev Tricks

    I've been building some custom search components for SOLR lately, so wanted to share a couple of thi ...

  4. Solr调研总结

    http://wiki.apache.org/solr/ Solr调研总结 开发类型 全文检索相关开发 Solr版本 4.2 文件内容 本文介绍solr的功能使用及相关注意事项;主要包括以下内容:环境 ...

  5. solr教程,值得刚接触搜索开发人员一看

    http://blog.csdn.net/awj3584/article/details/16963525 Solr调研总结 开发类型 全文检索相关开发 Solr版本 4.2 文件内容 本文介绍sol ...

  6. Solr总结

    http://www.cnblogs.com/guozk/p/3498831.html Solr调研总结 开发类型 全文检索相关开发 Solr版本 4.2 文件内容 本文介绍solr的功能使用及相关注 ...

  7. 【转载】solr教程,值得刚接触搜索开发人员一看

    转载:http://blog.csdn.net/awj3584/article/details/16963525 Solr调研总结 开发类型 全文检索相关开发 Solr版本 4.2 文件内容 本文介绍 ...

  8. Solr+Tomcat+zookeeper部署实战

    一 .安装solr 环境说明:centos 7.3,solr 6.6,zookeeper3.4,Tomcat8.5,jdk1.8 zookeeper的部署请参考:http://www.cnblogs. ...

  9. 全文搜索引擎——Solr

    1.部署solr a.下载并解压Solr b.导入项目(独立项目): 将解压后的 server\solr-webapp 下的 webapp文件夹 拷贝到tomcat的webapps下,并重命名为 so ...

随机推荐

  1. webDriver对element进行操作

    非常感谢原作者:eastmount,原地址:http://blog.csdn.net/eastmount/article/details/48108259     感谢感谢 这篇文章主要Seleniu ...

  2. 分类和逻辑回归(Classification and logistic regression)

    分类问题和线性回归问题问题很像,只是在分类问题中,我们预测的y值包含在一个小的离散数据集里.首先,认识一下二元分类(binary classification),在二元分类中,y的取值只能是0和1.例 ...

  3. 解决windows上安装TortoiseSVN后不能使用命令行问题

    一般我们安装TortoiseSVN的时候都是一路next安装好之后就右键开始使用.但是有时候我们需要在windows的命令窗口下执行SVN命令.这时候我们就会发现svn help之后显示没svn这个命 ...

  4. Python与快速排序

    这个算法系列主要是自己学习算法过程中动手实践一下,写这个文章作为笔记和分享个人心得,如有错误请各位提出. 注:转载请说明出处 问题提出: 将以下数据升序排列:5, 2, 8, 6, 4, 9, 7, ...

  5. IE6中PNG图片背景无法透明显示的最佳解决方案

    我想,对于像我这样的年轻的程序员来说,做网页开发时用chrome.firefox或者ie10什么的大约是被宠坏了.所以当最近做的项目不得不在恐龙化石般的ie6上运行时,ie6种种诡异的行径简直让我发指 ...

  6. MySQL查看用户权限的两种方法

    http://yanue.net/post-96.html MySQL查看用户权限命令的两方法: 一. 使用MySQL grants MySQL grant详细用法见:http://yanue.net ...

  7. ror笔记2

    在rails app的 config 文件夹中新建unicorn.rb内容如下 worker_processes 2 working_directory "/home/mage/boleht ...

  8. Struts2结果页面配置(Result)

    1.全局页面配置 全局结果页面:针对一个包下的所有Action的页面跳转都可生效 <global-results> <result name="xxx">/ ...

  9. 最流行的JavaScript代码规范

    什么是最佳的JavaScript代码编程规范?这可能是一个众口难调的问题.那么,不妨换个问题,什么代码规范最流行? sideeffect.kr通过分析GitHub上托管的开源代码,得出了一些有趣的结果 ...

  10. 服务器意外重启导致storm报错的问题处理

    解决方法 cat /opt/storm-0.8.2/conf/storm.yaml中找到storm.local.dir设定的目录,备份supervisor和workers两个文件夹,#nohup su ...