【solr基础教程之二】索引

一、向Solr提交索引的方式

1、使用post.jar进行索引

（1）创建文档xml文件

<add>

    <doc>

        <field name="id">test4</field>

        <field name="title">testagain</field>

        <field name="url">http://www.163.com</field>

    </doc>

</add>

（2）使用java -jar post.jar

[root@jediael44 exampledocs]# java -Durl=http://ip:8080/solr/update -jar post.jar test.xml

SimplePostTool version 1.5

Posting files to base url http://ip:8080/solr/update using content-type application/xml..

POSTing file test.xml

1 files indexed.

COMMITting Solr index changes to http://localhost:8080/solr/update..

Time spent: 0:00:00.135

（3）查看post.jar的使用方法

[root@jediael44 exampledocs]# java -jar post.jar --help

SimplePostTool version 1.5

Usage: java [SystemProperties] -jar post.jar [-h|-] [<file|folder|url|arg> [<file|folder|url|arg>...]]

Supported System Properties and their defaults:

  -Ddata=files|web|args|stdin (default=files)

  -Dtype=<content-type> (default=application/xml)

  -Durl=<solr-update-url> (default=http://localhost:8983/solr/update)

  -Dauto=yes|no (default=no)

  -Drecursive=yes|no|<depth> (default=0)

  -Ddelay=<seconds> (default=0 for files, 10 for web)

  -Dfiletypes=<type>[,<type>,...] (default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)

  -Dparams="<key>=<value>[&<key>=<value>...]" (values must be URL-encoded)

  -Dcommit=yes|no (default=yes)

  -Doptimize=yes|no (default=no)

  -Dout=yes|no (default=no)

This is a simple command line tool for POSTing raw data to a Solr port. Data can be read from files specified as commandline args, URLs specified as args, as raw commandline arg strings or via STDIN.

Examples:

  java -jar post.jar *.xml

  java -Ddata=args -jar post.jar '<delete><id>42</id></delete>'

  java -Ddata=stdin -jar post.jar < hd.xml

  java -Ddata=web -jar post.jar http://example.com/

  java -Dtype=text/csv -jar post.jar *.csv

  java -Dtype=application/json -jar post.jar *.json

  java -Durl=http://localhost:8983/solr/update/extract -Dparams=literal.id=a -Dtype=application/pdf -jar post.jar a.pdf

  java -Dauto -jar post.jar *

  java -Dauto -Drecursive -jar post.jar afolder

  java -Dauto -Dfiletypes=ppt,html -jar post.jar afolder

The options controlled by System Properties include the Solr URL to POST to, the Content-Type of the data, whether a commit or optimize should be executed, and whether the response should be written to STDOUT. If auto=yes the tool will try to set type and url automatically from file name. When posting rich documents the file name will be propagated as "resource.name" and also used as "literal.id". You may override these or any other request parameter

through the -Dparams property. To do a commit only, use "-" as argument. The web mode is a simple crawler following links within domain, default delay=10s.

（4）默认情况下，使用xml文件作数据源，若使用其它方式，如下

java -Dtype=application/json -jar post.jar *.json

2、使用管理界面的Document页面进行提交

3、使用SolrJ进行索引

（1）使用SolrJ进行简单索引

package org.ljh.test.solr;

import org.apache.solr.client.solrj.SolrServer;

import org.apache.solr.client.solrj.impl.HttpSolrServer;

import org.apache.solr.common.SolrInputDocument;

public class BasicSolrJIndexDemo {

	public static void main(String[] args) throws Exception {

		/*

		 * 注意，虽然使用地址http://ip:8080/solr/#/collection1来访问页面，但应该通过http:/

		 * /ip:8080/solr/collection1来进行文档的提交

		 */

		String serverUrl = (args != null && args.length > 0) ? args[0]

				: "http://localhost:8080/solr/collection1";

		SolrServer solrServer = new HttpSolrServer(serverUrl);

		SolrInputDocument doc1 = new SolrInputDocument();

		doc1.setField("id", "solrJTest3");

		doc1.setField("url", "http://www.163.com/");

		solrServer.add(doc1);

		SolrInputDocument doc2 = new SolrInputDocument();

		doc2.setField("id", "solrJTest4");

		doc2.setField("url", "http://www.sina.com/");

		solrServer.add(doc2);

		solrServer.commit(true,true);

	}

}

（2）使用SolrJ进行简单查询

package org.ljh.test.solr;

import org.apache.solr.client.solrj.SolrQuery;

import org.apache.solr.client.solrj.SolrServer;

import org.apache.solr.client.solrj.impl.HttpSolrServer;

import org.apache.solr.client.solrj.response.QueryResponse;

import org.apache.solr.common.SolrDocument;

import org.apache.solr.common.SolrDocumentList;

public class BasicSolrJSearchDemo {

	public static void main(String[] args) throws Exception {

		String serverUrl = (args != null && args.length > 0) ? args[0]

				: "http://localhost:8080/solr/collection1";

		SolrServer solrServer = new HttpSolrServer(serverUrl);

		//读取输入参数作为查询关键字，若无关键字，则查询全部内容。

		String queryString = (args != null && args.length > 1) ? args[1] : "url:163";

		SolrQuery solrQuery = new SolrQuery(queryString);

		solrQuery.setRows(5);

		QueryResponse resp = solrServer.query(solrQuery);

		SolrDocumentList hits = resp.getResults();

		for(SolrDocument doc : hits ){

			System.out.println(doc.getFieldValue("id").toString() + " : " + doc.getFieldValue("url"));

		}

	}

}

4、使用第三方工具

（1）DIH

（2）ExtractingRequestHandler, aka Solr Cell

（3）Nutch

二、schema.xml ：定义文档的格式

schema.xml定义了被索引的文档应该包括哪些Field、这个Filed的类型，以及其它相关信息。

1、示例

Nutch为Solr提供的schema.xml如下：

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="nutch" version="1.5">

    <types>

        <fieldType name="string" class="solr.StrField" sortMissingLast="true"

            omitNorms="true"/>

        <fieldType name="long" class="solr.TrieLongField" precisionStep="0"

            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"

            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="date" class="solr.TrieDateField" precisionStep="0"

            omitNorms="true" positionIncrementGap="0"/>

        <fieldType name="text" class="solr.TextField"

            positionIncrementGap="100">

            <analyzer>

                <tokenizer class="solr.WhitespaceTokenizerFactory"/>

                <filter class="solr.StopFilterFactory"

                    ignoreCase="true" words="stopwords.txt"/>

                <filter class="solr.WordDelimiterFilterFactory"

                    generateWordParts="1" generateNumberParts="1"

                    catenateWords="1" catenateNumbers="1" catenateAll="0"

                    splitOnCaseChange="1"/>

                <filter class="solr.LowerCaseFilterFactory"/>

                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

            </analyzer>

        </fieldType>

        <fieldType name="url" class="solr.TextField"

            positionIncrementGap="100">

            <analyzer>

                <tokenizer class="solr.StandardTokenizerFactory"/>

                <filter class="solr.LowerCaseFilterFactory"/>

                <filter class="solr.WordDelimiterFilterFactory"

                    generateWordParts="1" generateNumberParts="1"/>

            </analyzer>

        </fieldType>

    </types>

    <fields>

        <field name="id" type="string" stored="true" indexed="true"/>

        <!-- core fields -->

        <field name="batchId" type="string" stored="true" indexed="false"/>

        <field name="digest" type="string" stored="true" indexed="false"/>

        <field name="boost" type="float" stored="true" indexed="false"/>

        <!-- fields for index-basic plugin -->

        <field name="host" type="url" stored="false" indexed="true"/>

        <field name="url" type="url" stored="true" indexed="true"

            required="true"/>

        <field name="content" type="text" stored="false" indexed="true"/>

        <field name="title" type="text" stored="true" indexed="true"/>

        <field name="cache" type="string" stored="true" indexed="false"/>

        <field name="tstamp" type="date" stored="true" indexed="false"/>

        <field name="_version_" type="long" indexed="true" stored="true"/>

        <!-- fields for index-anchor plugin -->

        <field name="anchor" type="string" stored="true" indexed="true"

            multiValued="true"/>

        <!-- fields for index-more plugin -->

        <field name="type" type="string" stored="true" indexed="true"

            multiValued="true"/>

        <field name="contentLength" type="long" stored="true"

            indexed="false"/>

        <field name="lastModified" type="date" stored="true"

            indexed="false"/>

        <field name="date" type="date" stored="true" indexed="true"/>

        <!-- fields for languageidentifier plugin -->

        <field name="lang" type="string" stored="true" indexed="true"/>

        <!-- fields for subcollection plugin -->

        <field name="subcollection" type="string" stored="true"

            indexed="true" multiValued="true"/>

        <!-- fields for feed plugin (tag is also used by microformats-reltag)-->

        <field name="author" type="string" stored="true" indexed="true"/>

        <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>

        <field name="feed" type="string" stored="true" indexed="true"/>

        <field name="publishedDate" type="date" stored="true"

            indexed="true"/>

        <field name="updatedDate" type="date" stored="true"

            indexed="true"/>

        <!-- fields for creativecommons plugin -->

        <field name="cc" type="string" stored="true" indexed="true"

            multiValued="true"/>

        <!-- fields for tld plugin -->

        <field name="tld" type="string" stored="false" indexed="false"/>

    </fields>

    <uniqueKey>id</uniqueKey>

    <defaultSearchField>content</defaultSearchField>

    <solrQueryParser defaultOperator="OR"/>

</schema>

以上文档包括5个部分：

（1）FiledType：域的类型

（2）Field：哪些域被索引、存储等，以及这个域是什么类型。

（3）uniqueKey：哪个域作为id，即文章的唯一标识。

（4）defaultSearchField：默认的搜索域

（5）solrQueryParser：OR，即使用OR来构建Query。

2、Field元素

一个或者多个Field元素组成一个Fields元素，Nutch中使用了此结构，但solr的example中没有Fileds元素，而是直接将Fields元素作为schma元素的下一级元素。FieldType与此类似。

一个Filed的示例如下：

<field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>

Filed的几个基本属性如下：

（1）name属性

域的名称

（2）type属性

域的类型

（3）stored属性

是否存储这个域，只有存储了，才能在搜索结果中查看这个域的完整内容。

（4）indexed属性

是否索引这个域，索引了就可以用作搜索域，除此之外，即使你不需要对这个域进行搜索，但需要排序、分组、查询提示、facet、function queries等，也需要对这个域进行索引。

例如，查询一本书时，一般不会通过销售的数量进行搜索，但会根据销售的数量进行排序。

In addition to enabling searching, you will also need to mark your field as indexed if you need to sort, facet, group by, provide query suggestions for, or execute function queries on values within a field.

（5）multiValued属性

若一个域中允许存在多个值，则设置multiValued为true。

 <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>

此时，在被索引的文档中，可以使用多个具有相同name值的Filed。

<add>

	<doc>

	............

		<Field name="tag">lucene</Field>

		<Field name="tag">solr</Field>

	</doc>

</add>

若使用SolrJ，则使用addField方法代替setField方法。

doc.addField("tag","lucene");

doc.addField("tag","solr");

（6）required属性

Solr使用required属性来指定每个提交的文档都必须提供这个域。注意uniqueKey元素中指定的域隐含了required=true。

3、dynamicField元素

<dynamicField name="*_ti" type="tint" indexed="true" stored="true"/>

（1）一般而言，不要使用动态域，除非是以下三种情况

Dynamic fields help address common problems that occur when building search applications, including

■ Modeling documents with many fields

■ Supporting documents from diverse sources

■ Adding new document sources

具体可见solr in action的5.3.3节。

4、copyField

copyFiled用于以下2种情形

copy fields support two use cases that are common in most search applications:

■ Populate a single catch-all field with the contents of multiple fields.

■ Apply different text analysis to the same field content to create a new searchable field.

即

（1）将多个域复制到一个单一的域，以方便搜索等。如：

<copyField source="title" dest="text"/>

<copyField source="author" dest="text"/>

<copyField source="description" dest="text"/>

<copyField source="keywords" dest="text"/>

<copyField source="content" dest="text"/>

<copyField source="content_type" dest="text"/>

<copyField source="resourcename" dest="text"/>

<copyField source="url" dest="text"/>

则搜索时只对text进行搜索即可。

（2）对同一个域进行多次不同的分析处理，如：

<field name="text"

type="stemmed_text"

indexed="true"

stored="true"/>

<field name="auto_suggest"

type="unstemmed_text"

indexed="true"

stored="false"

multiValued="true"/>

...

<copyField source="text" dest="auto_suggest" />

在上述例子中，若对一个域进行索引，则将词汇词干化，但在搜索提示时，就不对词汇进行词干化。

5、FieldType元素

（1）FiedlType定义了Filed的类型，它将在Filed中的type属性中被引用。

（2）Solr内置的FiledType有以下类型：

（3）有2大类FieldType：

一类是要对其进行分析后再索引的非结构化数据，如文章的正文等，如StrField，TrieLongField等。

另一类是不需要对其进行分析，而直接索引的的结构批数据，如url，id，人名等，主要是TextField。

（4）在schema.xml中看到的solr.*代表的是org.apache.solr.schema.*，如

        <fieldType name="string" class="solr.StrField" sortMissingLast="true"  omitNorms="true"/>

表示类型为org.apache.solr.schema.StrField。

（5）StringField

StringField中的内容不应该被分析，它包含的是结构化数据。

StringField，用类org.apache.solr.schema.StrField表示。

（6）DateField

DateField一般使用TrieDateField来表示，其中Trie数据可以方便的进行范围搜索。

DateField的默认格式：In general, Solr expects your dates to be in the ISO-8601 Date/Time format (yyyy-MMddTHH:mm:ssZ); the date in our tweet (2012-05-22T09:30:22Z) breaks down to

yyyy = 2012

MM = 05

dd = 22

HH = 09 (24-hr clock)

mm = 30

ss = 22

Z = UTC Timezone (Z is for Zulu）

可以通过以下方式截取其内容：

表示截取到小时的粒度，即其值为：2012-05022T09:00:00Z

（7）NumericField

有多个实现类型，如TrieDoubleField，TrieFloatField，TrieIntField，TrieLongField等。

（8）type有多个属性，主要包括

sortMissingFirst：当根据使用这个类型的域进行排序时，若这个域没有值，则在排序时，将此文档放在最前面。

sortMissingLast:：当根据使用这个类型的域进行排序时，若这个域没有值，则在排序时，将此文档放在最后面。

precisionStep：

positionIncrementGap：见solr in action 5.4.4节。

6、UniqueKey元素

（1）Solr使用<uniqueKey>元素来标识一个唯一标识符，类似于一个数据库表的主键。如：

 <uniqueKey>id</uniqueKey>

必须选择一个Field作为一个uniqueKey。使用uniqueKey标识的字段，每一个进行索引的文档都必须提供。

（2）Solr不要求为每个文档提供一个唯一标识符，但建议为每个文档都提供一个唯一标识符，以用于避免重复等。

（3）当向solr提交一个文档时，若此文档的id已经存在，则此文档会覆盖原有的文档。

（4）如果solr被部署在多个服务器中，则必须提供uniqueKey。

（5）使用基本类似来作为uniqueKey，不要使用复杂类型。 One thing to note is that it’s best to use a primitive field type, such as string or long, for the field you indicate as being the <uniqueKey/> as that ensures Solr doesn’t make

any changes to the value during indexing

三、SolrConfig.xml中与索引相关的内容

以下为一个示例

<!--  The default high-performance update handler  -->

<updateHandler class="solr.DirectUpdateHandler2">

<!--

 Enables a transaction log, used for real-time get, durability, and

         and solr cloud replica recovery.  The log can grow as big as

         uncommitted changes to the index, so use of a hard autoCommit

         is recommended (see below).

         "dir" - the target directory for transaction logs, defaults to the

                solr data directory.

-->

<updateLog>

<str name="dir">${solr.ulog.dir:}</str>

</updateLog>

<!--

 AutoCommit

         Perform a hard commit automatically under certain conditions.

         Instead of enabling autoCommit, consider using "commitWithin"

         when adding documents. 

         http://wiki.apache.org/solr/UpdateXmlMessages

         maxDocs - Maximum number of documents to add since the last

                   commit before automatically triggering a new commit.

         maxTime - Maximum amount of time in ms that is allowed to pass

                   since a document was added before automatically

                   triggering a new commit.

         openSearcher - if false, the commit causes recent index changes

           to be flushed to stable storage, but does not cause a new

           searcher to be opened to make those changes visible.

         If the updateLog is enabled, then it's highly recommended to

         have some sort of hard autoCommit to limit the log size.

-->

<autoCommit>

<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>

<openSearcher>false</openSearcher>

</autoCommit>

<!--

 softAutoCommit is like autoCommit except it causes a

         'soft' commit which only ensures that changes are visible

         but does not ensure that data is synced to disk.  This is

         faster and more near-realtime friendly than a hard commit.

-->

<autoSoftCommit>

<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>

</autoSoftCommit>

<!--

 Update Related Event Listeners

         Various IndexWriter related events can trigger Listeners to

         take actions.

         postCommit - fired after every commit or optimize command

         postOptimize - fired after every optimize command

-->

<!--

 The RunExecutableListener executes an external command from a

         hook such as postCommit or postOptimize.

         exe - the name of the executable to run

         dir - dir to use as the current working directory. (default=".")

         wait - the calling thread waits until the executable returns.

                (default="true")

         args - the arguments to pass to the program.  (default is none)

         env - environment variables to set.  (default is none)

-->

<!--

 This example shows how RunExecutableListener could be used

         with the script based replication...

         http://wiki.apache.org/solr/CollectionDistribution

-->

<!--

       <listener event="postCommit" class="solr.RunExecutableListener">

         <str name="exe">solr/bin/snapshooter</str>

         <str name="dir">.</str>

         <bool name="wait">true</bool>

         <arr name="args"> <str>arg1</str> <str>arg2</str> </arr>

         <arr name="env"> <str>MYVAR=val1</str> </arr>

       </listener>

-->

</updateHandler>

【solr基础教程之二】索引的更多相关文章

【solr基础教程之二】索引分类： H4_SOLR/LUCENCE 2014-07-18 21:06 3331人阅读评论(0) 收藏
一.向Solr提交索引的方式 1.使用post.jar进行索引 (1)创建文档xml文件 <add> <doc> <field name="id"&g ...
Termux基础教程（二）：软件包安装
Termux基础教程(二):软件包安装 Termux是一个在安卓手机上模拟Linux系统的高级终端,这个终端十分强大,实用. Termux可以安装Liunx的各种软件包,这就是Termux的灵魂所在. ...
ActiveMQ基础教程（二）：安装与配置（单机与集群）
因为本文会用到集群介绍,因此准备了三台虚拟机(当然读者也可以使用一个虚拟机,然后使用不同的端口来模拟实现伪集群): 192.168.209.133 test1 192.168.209.134 test ...
Kafka基础教程（二）：Kafka安装
因为kafka是基于Zookeeper的,而Zookeeper一般都是一个分布式的集群,尽管kafka有自带Zookeeper,但是一般不使用自带的,都是使用外部安装的,所以首先我们需要安装Zooke ...
Spring Cloud Alibaba基础教程-Nacos(二)
在Spring Cloud Alibaba基础教程-Nacos(一)当中学习了,如何从 nacos当中通过Java的方式获取值,以及连接数据库,下面我们开始第二篇的学习 ,如果对你有帮助,方便下次寻 ...
Git基础教程（二）
继续上篇Git基础教程(一),在开篇之前,先回顾一下上篇中的基本命令. 配置命令:git config --global * 版本库初始化:git init 向版本库添加文件:git add * 提交 ...
oracle 基础知识(十二)----索引
一, 索引介绍索引与表一样,也属于段(segment)的一种.里面存放了用户的数据,跟表一样需要占用磁盘空间.索引是一种允许直接访问数据表中某一数据行的树型结构,为了提高查询效率而引入,是一个独立于 ...
（Python基础教程之二十二）爬虫下载网页视频(video blob)
Python基础教程在SublimeEditor中配置Python环境 Python代码中添加注释 Python中的变量的使用 Python中的数据类型 Python中的关键字 Python字符串操 ...
【solr基础教程之一】Solr相关知识点串讲
Solr是Apache Lucene的一个子项目.Lucene为全文搜索功能提供了完备的API,但它只作为一个API库存在,而不能直接用于搜索.因此,Solr基于Lucene构建了一个完 ...

随机推荐

hadoop集群中的日志文件
hadoop存在多种日志文件,其中master上的日志文件记录全面信息,包括slave上的jobtracker与datanode也会将错误信息写到master中.而slave中的日志主要记录完成的ta ...
CSS3 中的按钮效果与进度条
效果如图
php+mssql 已经写好的万能函数
<?php /****************************************************************************************** ...
var_export函数的使用方法
var_export() 函数返回关于传递给该函数的变量的结构信息,它和 var_dump() 类似,不同的是其返回的表示是合法的 PHP 代码.var_export必须返回合法的php代码, 也就是 ...
Destoon标签使用技巧十则
Destoon标签 1.全局标签网站名称:{$DT[sitename]}网站地址:{DT_PATH}网站LOGO: {if $MODULE[$moduleid][logo]}{DT_SKIN}ima ...
un ange frappe a ma porte
Un signe, une larme 魂牵泪扰 un mot, une arme 字断情烧 nettoyer les étoiles à l'alcool de mon ame 灵魂之酒眷洗星 ...
Autolayout-VFL语言添加约束－备
一.VFL语言简介 VFL(Visual format language)语言是苹果为了简化手写Autolayout代码所创建的专门负责编写约束的代码.为我们简化了许多代码量. 二.使用步骤使用步骤 ...
js各种进制数之间的转换
计算机中常用的进制数有二进制.八进制.十进制.十六进制一.十进制 to 其他 var x = 10; // 或定义其他值均可 x.toString(n); // n 代表要转换到的进制,比如n可以为 ...
Verilog 模块参数重定义（转）
Verilog重载模块参数: 当一个模块引用另外一个模块时,高层模块可以改变低层模块用parameter定义的参数值,改变低层模块的参数值可采用以下两种方式: 1)defparam 重定义参数语法:d ...
POJ 2152 Fire(树形DP)
题意: 思路:令F[i][j]表示的最小费用.Best[i]表示以i为根节点的子树多有节点都找到负责消防站的最小费用. 好难的题... #include<algorithm> #incl ...

【solr基础教程之二】索引

【solr基础教程之二】索引的更多相关文章

随机推荐

热门专题