【详细】Lucene使用案例

Lucene是apache软件基金会4 jakarta项目组的一个子项目，是一个开放源代码的全文检索引擎工具包，但它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。Lucene是一套用于全文检索和搜寻的开源程式库，由Apache软件基金会支持和提供。Lucene提供了一个简单却强大的应用程式接口，能够做全文索引和搜寻。在Java开发环境里Lucene是一个成熟的免费开源工具。就其本身而言，Lucene是当前以及最近几年最受欢迎的免费Java信息检索程序库。人们经常提到信息检索程序库，虽然与搜索引擎有关，但不应该将信息检索程序库与搜索引擎相混淆(来自百度百科)

点击查看百度百科：https://baike.baidu.com/item/Lucene

CSDN一篇文章介绍：https://blog.csdn.net/regan_hoo/article/details/78802897

关于Lucene具体介绍不多说，这里写一个应用场景来学会使用Lucene：

我在一个文件夹里面存了一堆txt文本，大小不一，名字不同，里面的内容不同，我要做一个功能实现：

比如我输入一个java，只要名字，路径，内容等等出现了java这个词汇，就返回结果给我

1.建一个java工程，导包（开发中通常是使用老版本的）：

2.一步一步来：

我在D盘666文件夹放入一堆文本文件

然后在D盘的temp的index下创建索引库

创建索引：

package lucene;

import java.io.File;

import org.apache.commons.io.FileUtils;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field;

import org.apache.lucene.document.TextField;

import org.apache.lucene.document.Field.Store;

import org.apache.lucene.document.LongField;

import org.apache.lucene.document.StoredField;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

import org.junit.Test;

public class MyFirstLucene {

    // 创建索引

    @Test

    public void testIndex() throws Exception {

        // 新建一个索引库（我放在D盘某文件夹内）

        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));

        // 新建分析器对象

        Analyzer analyzer = new StandardAnalyzer();

        // 新建配置对象

        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);

        // 创建一个IndexWriter对象（参数一个索引库，一个配置）

        IndexWriter indexWriter = new IndexWriter(directory, config);

        // 创建域对象

        File f = new File("D:\\666");

        File[] list = f.listFiles();

        for (File file : list) {

            // 创建一个文档对象

            Document document = new Document();

            // 文件名称

            String file_name = file.getName();

            Field fileNameField = new TextField("fileName", file_name, Store.YES);

            // 文件大小

            long file_size = FileUtils.sizeOf(file);

            Field fileSizeField = new LongField("fileSize", file_size, Store.YES);

            // 文件路径

            String file_path = file.getPath();

            Field filePathField = new StoredField("filePath", file_path);

            // 文件内容

            String file_content = FileUtils.readFileToString(file);

            Field fileContentField = new TextField("fileContent", file_content, Store.YES);

            // 添加到document

            document.add(fileNameField);

            document.add(fileSizeField);

            document.add(filePathField);

            document.add(fileContentField);

            // 创建索引

            indexWriter.addDocument(document);

        }

        // 关闭资源

        indexWriter.close();

    }

}

好的，运行成功，我打开D盘的temp文件夹中的Index文件夹：存入了一堆看不懂的东西，这就代表创建索引库成功：

3.查询索引：

    // 搜索索引

    @Test

    public void testSearch() throws Exception {

        // 第一步：创建一个Directory对象，也就是索引库存放的位置。

        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));

        // 第二步：创建一个indexReader对象，需要指定Directory对象。

        IndexReader indexReader = DirectoryReader.open(directory);

        // 第三步：创建一个indexSearcher对象，需要指定IndexReader对象

        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        // 第四步：创建一个TermQuery对象，指定查询的域和查询的关键词。

        Query query = new TermQuery(new Term("fileName", "spring"));

        // 第五步：执行查询（显示条数）

        TopDocs topDocs = indexSearcher.search(query, 10);

        // 第六步：返回查询结果。遍历查询结果并输出。

        ScoreDoc[] scoreDocs = topDocs.scoreDocs;

        for (ScoreDoc scoreDoc : scoreDocs) {

            int doc = scoreDoc.doc;

            Document document = indexSearcher.doc(doc);

            // 文件名称

            String fileName = document.get("fileName");

            System.out.println(fileName);

            // 文件内容

            String fileContent = document.get("fileContent");

            System.out.println(fileContent);

            // 文件大小

            String fileSize = document.get("fileSize");

            System.out.println(fileSize);

            // 文件路径

            String filePath = document.get("filePath");

            System.out.println(filePath);

            System.out.println("------------");

        }

        // 第七步：关闭IndexReader对象

        indexReader.close();

    }

实现了基础功能

但是，这里有很大的一个问题：

无法处理中文

4.所以，接下来就处理中文问题：中文分析器（上边的示例采用的是标准分析器，即处理英文的分析器）

首先，我们看看用标准分析器分析中文的结果：

    // 查看分析器的分词效果

    @Test

    public void testTokenStream() throws Exception {

        // 创建一个分析器对象

        Analyzer analyzer = new StandardAnalyzer();// 获得tokenStream对象

        // 第一个参数：域名，可以随便给一个

        // 第二个参数：要分析的文本内容

        TokenStream tokenStream = analyzer.tokenStream("test",

                "高富帅可以用二维表结构来逻辑表达实现的数据");

        // 添加一个引用，可以获得每个关键词

        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

        // 添加一个偏移量的引用，记录了关键词的开始位置以及结束位置

        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);

        // 将指针调整到列表的头部

        tokenStream.reset();

        // 遍历关键词列表，通过incrementToken方法判断列表是否结束

        while (tokenStream.incrementToken()) {

            // 关键词的起始位置

            System.out.println("start->" + offsetAttribute.startOffset());

            // 取关键词

            System.out.println(charTermAttribute);

            // 结束位置

            System.out.println("end->" + offsetAttribute.endOffset());

        }

        tokenStream.close();

    }

结果：

这里截取了一部分，发现如果采用了标准分析器，每一个中文都分隔开了，显然有问题

于是不能采用标准分析器：

用一个SmartChinese分析器：

    // 查看分析器的分词效果

    @Test

    public void testTokenStream() throws Exception {

        Analyzer analyzer = new SmartChineseAnalyzer();

        TokenStream tokenStream = analyzer.tokenStream("test",

                "高富帅可以用二维表结构来逻辑表达实现的数据");

        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);

        tokenStream.reset();

        while (tokenStream.incrementToken()) {

            System.out.println("start->" + offsetAttribute.startOffset());

            System.out.println(charTermAttribute);

            System.out.println("end->" + offsetAttribute.endOffset());

        }

        tokenStream.close();

    }

效果有提升：

不过有一个问题：对于新词汇如高富帅这样的，它不识别

如果追求更好的效果：可以采用其他的第三方分析器

比如我这里采用一个IK分析器，可以自己添加进去“高富帅”这种词汇

    @Test

    public void testTokenStream() throws Exception {

        Analyzer analyzer = new IKAnalyzer();

        TokenStream tokenStream = analyzer.tokenStream("test",

                "高富帅可以用二维表结构来逻辑表达实现的数据");

        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);

        tokenStream.reset();

        while (tokenStream.incrementToken()) {

            System.out.println("start->" + offsetAttribute.startOffset());

            System.out.println(charTermAttribute);

            System.out.println("end->" + offsetAttribute.endOffset());

        }

        tokenStream.close();

    }

添加配置文件

IKAnalyzer.cfg.xml：

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">

<properties>

    <comment>IK Analyzer 扩展配置</comment>

    <!--用户可以在这里配置自己的扩展字典 -->

    <entry key="ext_dict">ext.dic;</entry> 

    <!--用户可以在这里配置自己的扩展停止词字典-->

    <entry key="ext_stopwords">stopword.dic;</entry> 

</properties>

ext.doc:

高富帅

二维表

stopword.dic:

我

是

用

的

二

维

表

来

a

an

and

are

as

at

be

but

by

for

if

in

into

is

it

no

not

of

on

or

such

that

the

their

then

there

these

they

this

to

was

will

with

这时候效果：

解决了中文问题，并且可以扩展

上边对于索引库操作只有添加，我们还可以对索引库做其他操作：

查询的时候也可以有多种方式，下面代码示例：

package lucene;

import java.io.File;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.Field.Store;

import org.apache.lucene.document.TextField;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.index.Term;

import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;

import org.apache.lucene.queryparser.classic.QueryParser;

import org.apache.lucene.search.BooleanClause.Occur;

import org.apache.lucene.search.BooleanQuery;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.MatchAllDocsQuery;

import org.apache.lucene.search.NumericRangeQuery;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.ScoreDoc;

import org.apache.lucene.search.TermQuery;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.FSDirectory;

import org.apache.lucene.util.Version;

import org.junit.Test;

import org.wltea.analyzer.lucene.IKAnalyzer;

/**

 * 索引维护 添加 （上边已完成） 删除 修改 查询

 */

public class LuceneManager {

    public IndexWriter getIndexWriter() throws Exception {

        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));

        Analyzer analyzer = new StandardAnalyzer();

        IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);

        return new IndexWriter(directory, config);

    }

    // 全删除

    @Test

    public void testAllDelete() throws Exception {

        IndexWriter indexWriter = getIndexWriter();

        indexWriter.deleteAll();

        indexWriter.close();

    }

    // 根据条件删除

    @Test

    public void testDelete() throws Exception {

        IndexWriter indexWriter = getIndexWriter();

        Query query = new TermQuery(new Term("fileName", "apache"));

        indexWriter.deleteDocuments(query);

        indexWriter.close();

    }

    // 修改

    @Test

    public void testUpdate() throws Exception {

        IndexWriter indexWriter = getIndexWriter();

        Document doc = new Document();

        doc.add(new TextField("fileN", "测试文件名", Store.YES));

        doc.add(new TextField("fileC", "测试文件内容", Store.YES));

        indexWriter.updateDocument(new Term("fileName", "lucene"), doc, new IKAnalyzer());

        indexWriter.close();

    }

    public IndexSearcher getIndexSearcher() throws Exception {

        Directory directory = FSDirectory.open(new File("D:\\temp\\index"));

        IndexReader indexReader = DirectoryReader.open(directory);

        return new IndexSearcher(indexReader);

    }

    // 执行查询的结果

    public void printResult(IndexSearcher indexSearcher, Query query) throws Exception {

        TopDocs topDocs = indexSearcher.search(query, 10);

        ScoreDoc[] scoreDocs = topDocs.scoreDocs;

        for (ScoreDoc scoreDoc : scoreDocs) {

            int doc = scoreDoc.doc;

            Document document = indexSearcher.doc(doc);

            // 文件名称

            String fileName = document.get("fileName");

            System.out.println(fileName);

            // 文件内容

            String fileContent = document.get("fileContent");

            System.out.println(fileContent);

            // 文件大小

            String fileSize = document.get("fileSize");

            System.out.println(fileSize);

            // 文件路径

            String filePath = document.get("filePath");

            System.out.println(filePath);

            System.out.println("------------");

        }

    }

    // 查询所有

    @Test

    public void testMatchAllDocsQuery() throws Exception {

        IndexSearcher indexSearcher = getIndexSearcher();

        Query query = new MatchAllDocsQuery();

        System.out.println(query);

        printResult(indexSearcher, query);

        // 关闭资源

        indexSearcher.getIndexReader().close();

    }

    // 根据数值范围查询

    @Test

    public void testNumericRangeQuery() throws Exception {

        IndexSearcher indexSearcher = getIndexSearcher();

        // 参数的意思：文本大小在100到200字节之间，不包含100，包含200

        Query query = NumericRangeQuery.newLongRange("fileSize", 100L, 200L, false, true);

        System.out.println(query);

        printResult(indexSearcher, query);

        // 关闭资源

        indexSearcher.getIndexReader().close();

    }

    // 可以组合查询条件

    @Test

    public void testBooleanQuery() throws Exception {

        IndexSearcher indexSearcher = getIndexSearcher();

        BooleanQuery booleanQuery = new BooleanQuery();

        Query query1 = new TermQuery(new Term("fileName", "apache"));

        Query query2 = new TermQuery(new Term("fileName", "lucene"));

        // 类似 select * from user where id = ? or/and name = ?

        booleanQuery.add(query1, Occur.MUST);// and

        booleanQuery.add(query2, Occur.SHOULD);// or

        System.out.println(booleanQuery);

        printResult(indexSearcher, booleanQuery);

        // 关闭资源

        indexSearcher.getIndexReader().close();

    }

    // 条件解释的对象查询(上边的查询和这种掌握一种即可)

    @Test

    public void testQueryParser() throws Exception {

        IndexSearcher indexSearcher = getIndexSearcher();

        QueryParser queryParser = new QueryParser("fileName", new IKAnalyzer());

        // *:* 域：值

        Query query = queryParser.parse("fileName:lucene OR fileContent:apache");

        printResult(indexSearcher, query);

        // 关闭资源

        indexSearcher.getIndexReader().close();

    }

    // 条件解析的对象查询 多个默认域

    @Test

    public void testMultiFieldQueryParser() throws Exception {

        IndexSearcher indexSearcher = getIndexSearcher();

        String[] fields = { "fileName", "fileContent" };

        MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields, new IKAnalyzer());

        Query query = queryParser.parse("lucene");

        printResult(indexSearcher, query);

        // 关闭资源

        indexSearcher.getIndexReader().close();

    }

}

【详细】Lucene使用案例的更多相关文章

基于lucene的案例开发：纵横小说分布式採集
转载请注明出处:http://blog.csdn.net/xiaojimanman/article/details/46812645 http://www.llwjy.com/blogdetail/9 ...
基于lucene的案例开发：查询语句创建PackQuery
转载请注明出处:http://blog.csdn.net/xiaojimanman/article/details/44656141 http://www.llwjy.com/blogdetail/1 ...
JQuery中Ajax详细参数使用案例
JQuery中Ajax详细参数使用案例参考文档:http://www.jb51.net/shouce/jquery1.82/ 参考文档:http://jquery.cuishifeng.cn/jQu ...
css伪类选择器详细解析及案例使用-----伪元素
伪元素:(css3中将所有伪元素前变成了两个冒号,即::first-letter.::first-line.::before.::after.::selection.目的是为了区分伪元素与伪类.对于I ...
Lucene入门案例一
1. 配置开发环境官方网站:http://lucene.apache.org/ Jdk要求:1.7以上创建索引库必须的jar包(lucene-core-4.10.3.jar,lucene-anal ...
Lucene使用案例
Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引 ...
java8新特性-函数式接口详细讲解及案例
一.函数式接口 1.1 概念函数式接口在Java中是指:有且仅有一个抽象方法的接口.函数式接口,即适用于函数式编程场景的接口.而Java中的函数式编程体现就是Lambda,所以函数式接口就是可以适 ...
css伪类选择器详细解析及案例使用-----伪类选择器（2）
结构伪类选择器: <div> <ul> /*ul:only-of-type*/ <li>one</li> /*li:first-child li:nth ...
css伪类选择器详细解析及案例使用-----伪类选择器（1）
动态伪类选择器:E:link :选择匹配的E元素,并且匹配元素被定义了超链接并未被访问过.E:visited :选择匹配的E元素,而且匹配的元素被定义了连接并已被访问过.E:active :选择匹配的 ...

随机推荐

【SHOI2012】魔法树（树链剖分，线段树）
[SHOI2012]魔法树题面 BZOJ上找不到这道题目只有洛谷上有.. 所以粘贴洛谷的题面题解树链剖分之后直接维护线段树就可以了树链剖分良心模板题 #include<iostream ...
【POJ 3401】Asteroids
题面 Bessie wants to navigate her spaceship through a dangerous asteroid field in the shape of an N x ...
LightOJ1282 Leading and Trailing
题面给定两个数n,k 求n^k的前三位和最后三位 Input Input starts with an integer T (≤ 1000), denoting the number of test ...
freemind中内容变成html转义字符解决方法
在使用freemind的时候,没有正常关闭,导致原来的内容变成下面这样: <html> <body> <p> <b>查询所有</b> < ...
sudo解决方案企业级应用实战
visudo 编辑sudo配置文件 which 查找命令所在路径例:touch /etc/oldboy.txt 没证件 sudo touch /etc/oldboy.txt 可以内置命令没路 ...
3.2.2 break 与 continue 语句
break 语句和 continue语句在while循环和for循环中都可以使用,并且一般常与选择结构结合使用.一旦break语句被执行,将使得break语句所属层次的循环提前结束.continue语 ...
面向对象设计模式_命令模式(Command)解读
在.Net框架中很多对象的方法中都会有Invoke方法,这种方法的设计实际是用了设计模式的命令模式, 模式图如下其核心思路是将Client 向Receiver发送的命令行为进行抽象(ICommand ...
redis五种基本类型CRUD操作
1.String 增:set key1 value1 改:set key1 new-value.自增 incr key1.按照特定值递增:increby key1 inrevalue 删:del ke ...
11 个简单的 Java 性能调优技巧
大多数开发人员理所当然地以为性能优化很复杂,需要大量的经验和知识.好吧,不能说这是完全错误的.优化应用程序以获得最佳性能不是一件容易的事情.但是,这并不意味着如果你不具备这些知识,就不能做任何事情.这 ...
笔记：I/O流-字符集
Java 库的 java.nio 包用 Charset 类统一了对字符集的转换,支付姐建立了两个字节Unicode码元序列与使用本地字符编码方式的字节序列之间的映射,Charset类使用的时由IANA ...

【详细】Lucene使用案例

【详细】Lucene使用案例的更多相关文章

随机推荐

热门专题