Solr4.7从文件创建索引

索引数据源并不会一定来自于数据库、XML、JSON、CSV这类结构化数据，很多时候也来自于PDF、word、html、word、MP3等这类非结构化数据，从这类非结构化数据创建索引，solr也给我们提供了很好的支持，利用的是apache tika。

下面我们来看看在solr4.7中如何从pdf文件创建索引。

一、配置文件索引库

1、新建core

我们新建一个solr的core，用于存储文件型索引，新建core的步骤请参考：

http://blog.csdn.net/clj198606061111/article/details/21288499

2、准备jar

我们在$solr_home下面新建一个extract文件夹，用于存放solr扩展jar包。

从colr4.7发布包中solr-4.7.0\dist拷贝solr-cell-4.7.0.jar到新建的extract文件夹下。拷贝solr4.7发布包solr-4.7.0\contrib\extraction\lib下所有jar包到extract文件夹下。

3、配置solrconfig.xml

添加请求解析配置：

[html] view
plain copy

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>

指定依赖包位置：

注意，这个相对位置不是相对于配置文件所在文件夹位置，而是相对core主目录的。比如我的配置文件在solr_home\core1\conf，但是我的jar包在solr_home\ extract那么我的相对路径就是../extract而不是../../extract。

[html] view
plain copy

<lib dir="../extract" regex=".*\.jar" />

4、配置schema.xml

4.1配置索引字段的类型，也就是field类型。

其中text_general类型我们用到2个txt文件（stopwords.txt、synonyms.txt），这2个txt文件在发布包示例core里面有位置在：solr-4.7.0\example\solr\collection1\conf，复制这2个txt文件到新建的$solr_home的那个新建的core下面的conf目录下，和schema.xml一个位置。

[html] view
plain copy

<types>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>

4.2配置索引字段，也就是field

其中有个动态类型字段，attr_*，这个是什么意思呢。也就是solr在解析文件的时候，文件本身有很多属性，具体有哪些属性是不确定的，solr全部把他解析出来以attr作为前缀加上文件本身的属性名，组合在一起就成了field的名称

[html] view
plain copy

<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
<field name="text" type="text_general" indexed="true" stored="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

到这里solr服务端的配置以及完成了。

二、solrj测试

1、需要的jar

Maven配置

[html] view
plain copy

<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
<version>4.7.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.3.2</version>
<scope>test</scope>
</dependency>

2、测试类CreateIndexFromPDF.java

Solrj4.7里面ContentStreamUpdateRequest的addFile方法多了一个contentType参数，指明内容类型。ContentType请参看：http://baike.baidu.com/link?url=panQQa04z0gc4-gQRnIoUhwOQPABfG6unIqE1-7SEe5ZMygYxWT2lkvoKlQmTEYIZDNhntB4T9aGQM5KhevKDa

[java] view
plain copy

package com.clj.test.solr.solrj;
import java.io.File;
import java.io.IOException;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.request.AbstractUpdateRequest;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.client.solrj.response.QueryResponse;
/**
* 从PDF创建索引
* <功能详细描述>
*
* @author Administrator
* @version [版本号, 2014年3月18日]
* @see [相关类/方法]
* @since [产品/模块版本]
*/
public class CreateIndexFromPDF
{
public static void main(String[] args)
{
String fileName = "e:/MyBatis3用户指南中文版.pdf";
String solrId = "MyBatis3用户指南中文版.pdf";
try
{
indexFilesSolrCell(fileName, solrId);
}
catch (IOException e)
{
e.printStackTrace();
}
catch (SolrServerException e)
{
e.printStackTrace();
}
}
/** 从文件创建索引
* <功能详细描述>
* @param fileName
* @param solrId
* @see [类、类#方法、类#成员]
*/
public static void indexFilesSolrCell(String fileName, String solrId)
throws IOException, SolrServerException
{
String urlString = "http://localhost:8080/solr/core1";
SolrServer solr = new HttpSolrServer(urlString);
ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
String contentType="application/pdf";
up.addFile(new File(fileName), contentType);
up.setParam("literal.id", solrId);
up.setParam("uprefix", "attr_");
up.setParam("fmap.content", "attr_content");
up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solr.request(up);
QueryResponse rsp = solr.query(new SolrQuery("*:*"));
System.out.println(rsp);
}
}

执行上面代码，便把我们的pdf文件上传到solr服务器，解析、创建索引，后面的solr.query是执行一个查询，查询解析索引后结果。解析后pdf就变成了纯文本的内容，在控制台可以看到很多文档其他信息。

Solr解析完pdf、创建索引后，我们也可以在solr的管理界面查看索引结果。Core1s就是我们新建的文件索引库。如下图。

Solr4.7从文件创建索引的更多相关文章

Solr 从文件创建索引
http://blog.csdn.net/clj198606061111/article/details/21492457 http://wiki.apache.org/solr/Extracting ...
lucene 建立索引的不同方式
1.创建一个简单的索引: package lia.meetlucene; import java.io.File; import org.apache.lucene.document.Document ...
Lucene.Net无障碍学习和使用：索引篇
一.简单认识索引 Lucene.Net的应用相对比较简单.一段时间以来,我最多只是在项目中写点代码,利用一下它的类库而已,对很多名词术语不是很清晰,甚至理解可能还有偏差.从我过去的博客你也可以看出, ...
windows索引服务
windows索引服务是windows操作系统提供的桌面搜索引擎,通过预先创建索引来提高对硬盘上文件内容的搜索速度.以windows服务程序的方式运行. 一.工作方式 1.对指定路径下的文件创 ...
Linux下的压缩和解压缩命令——jar
原文链接:http://blog.chinaunix.net/uid-692788-id-2681136.htmlJAR包是Java中所特有一种压缩文档,其实大家就可以把它理解为.zip包.当然也是有 ...
MongoDB常用命令
本文整理了一年多以来我常用的MongoDB操作,涉及mongo-shell.pymongo,既有运维层面也有应用层面,内容有浅有深,这也就是我从零到熟练的历程. MongoDB的使用之前也分享过一篇, ...
jar命令的用法详解
本文详细讲述了JAR命令的用法,对于大家学习和总结jar命令的使用有一定的帮助作用.具体如下: JAR包是Java中所特有一种压缩文档,其实大家就可以把它理解为.zip包.当然也是有区别的,JAR包中 ...
.bat文件和Jar包的生成及运行
.bat文件和Jar包的生成及运行 1.Jar包简单介绍 Jar包是Java中所特有的一种压缩文档,有点类似于zip包,区别在于Jar包中有一个META-INF\MANIFEST.MF文件(在生成Ja ...
MongoDB使用小结：一些常用操作分享
本文整理了一年多以来我常用的MongoDB操作,涉及mongo-shell.pymongo,既有运维层面也有应用层面,内容有浅有深,这也就是我从零到熟练的历程. MongoDB的使用之前也分享过一篇, ...

随机推荐

UVA - 10129Play on Words（欧拉路）
UVA - 10129Play on Words Some of the secret doors contain a very interesting word puzzle. The team o ...
套接字socket 的地址族和类型、工作原理、创建过程
注:本分类下文章大多整理自<深入分析linux内核源代码>一书,另有参考其他一些资料如<linux内核完全剖析>.<linux c 编程一站式学习>等,只是为了更好 ...
linux 之进程间通信-------------InterProcess Communication
进程间通信至少可以通过传送打开文件来实现,不同的进程通过一个或多个文件来传递信息,事实上,在很多应用系统里,都使用了这种方法.但一般说来,进程间通信(IPC:InterProcess Communi ...
未能加载文件或程序集“**, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”或它的某一个依赖项。试图加载格式不正确的程序。
未能加载文件或程序集“Common, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null”或它的某一个依赖项.试图加载格式不正确的程序. 原来, ...
foundation 框架 NSString常用总结(二)
以此作为NSString常用总结(一)的补充 NSString* str = [NSString stringWithFormat:@"I love programing! You, com ...
UVa1585 Score
#include <stdio.h> int main(){ int T, O, score; char str[81], *p; scanf("%d" ...
JBoss 系列二十一：JBossCache核心API
内容简介本处介绍JBossCache相关的主要API,我们目的通过本部分描述,读者可以使用JBossCache API,在自己的应用中使用JBossCache. Cache接口 Cache 接口是和 ...
哈希长度扩展攻击的简介以及HashPump安装使用方法
哈希长度扩展攻击(hash length extension attacks)是指针对某些允许包含额外信息的加密散列函数的攻击手段.该攻击适用于在消息与密钥的长度已知的情形下,所有采取了 H(密钥 ∥ ...
大概看了一天python request源码。写下python requests库发送 get,post请求大概过程。
python requests库发送请求时,比如get请求,大概过程. 一.发起get请求过程:调用requests.get(url,**kwargs)-->request('get', url ...
【图文教程】用“iz3d”软件将您的游戏打造为红蓝3D游戏。
iz3d是一款能将普通3D游戏转换为红蓝3D游戏的软件.基本上支持所有游戏,或许没用过的人会认为这只是类似于播放器中的一个小功能,将平面图形做成“伪3D”红蓝效果. 实际上不是的,游戏与平面图的结构不 ...

Solr4.7从文件创建索引

Solr4.7从文件创建索引的更多相关文章

随机推荐

热门专题