关于使用Filter降低Lucene tf idf打分计算的调研

将query改成filter,lucene中有个QueryWrapperFilter性能比较差，所以基本上都须要自己写filter。包含TermFilter,ExactPhraseFilter,ConjunctionFilter,DisjunctionFilter。

这几天验证下来，还是or改善最明显，4个termfilter,4508个返回结果,在我本机上性能提高1/3。ExactPhraseFilter也有小幅提升(5%-10%)。

最令人不解的是and,原来以为跟结果数和子查询数相关，但几次測试基本都是下降。

附ExactPhraseFilter和ut代码:

import java.io.IOException;

import java.util.ArrayList;

import org.apache.lucene.index.AtomicReaderContext;

import org.apache.lucene.index.DocsAndPositionsEnum;

import org.apache.lucene.index.Term;

import org.apache.lucene.index.TermContext;

import org.apache.lucene.index.TermState;

import org.apache.lucene.index.Terms;

import org.apache.lucene.index.TermsEnum;

import org.apache.lucene.search.DocIdSet;

import org.apache.lucene.search.DocIdSetIterator;

import org.apache.lucene.search.Filter;

import org.apache.lucene.util.ArrayUtil;

import org.apache.lucene.util.Bits;

// A fake to lucene phrase query, but far simplified.

public class ExactPhraseFilter extends Filter {

    protected final ArrayList<Term> terms = new ArrayList<Term>();

    protected final ArrayList<Integer> positions = new ArrayList<Integer>();

    protected String fieldName;

    public void add(Term term) {

        if (terms.size() == 0) {

            fieldName = term.field();

        } else {

            assert fieldName == term.field();

        }

        positions.add(Integer.valueOf(terms.size()));

        terms.add(term);

    }

    @Override

    public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException

    {

        return new ExactPhraseDocIdSet(context, acceptDocs);

    }

    static class PostingAndFreq implements Comparable<PostingAndFreq> {

        DocsAndPositionsEnum posEnum;

        int docFreq;

        int position;

        boolean useAdvance;

        int posFreq = 0;

        int pos = -1;

        int posTime = 0;

        public PostingAndFreq(DocsAndPositionsEnum posEnum, int docFreq, int position, boolean useAdvance) {

            this.posEnum = posEnum;

            this.docFreq = docFreq;

            this.position = position;

            this.useAdvance = useAdvance;

        }

        @Override

        public int compareTo(PostingAndFreq other) {

            if (docFreq != other.docFreq) {

                return docFreq - other.docFreq;

            }

            if (position != other.position) {

                return position - other.position;

            }

            return 0;

        }

    }

    protected class ExactPhraseDocIdSet extends DocIdSet {

        protected final AtomicReaderContext context;

        protected final Bits acceptDocs;

        protected final PostingAndFreq[] postings;

        protected boolean noDocs = false;

        public ExactPhraseDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {

            this.context = context;

            this.acceptDocs = acceptDocs;

            Terms fieldTerms = context.reader().fields().terms(fieldName);

            // TermContext states[] = new TermContext[terms.size()];

            postings = new PostingAndFreq[terms.size()];

            TermsEnum te = fieldTerms.iterator(null);

            for (int i = 0; i < terms.size(); ++i) {

                final Term t = terms.get(i);

                // states[i] = TermContext.build(context, terms.get(i), true);

                // final TermState state = states[i].get(context.ord);

                if (!te.seekExact(t.bytes(), true)) {

                    noDocs = true;

                    return;

                }

                if (i == 0) {

                    postings[i] = new PostingAndFreq(te.docsAndPositions(acceptDocs, null, 0), te.docFreq(), positions.get(i), false);

                } else {

                    postings[i] = new PostingAndFreq(te.docsAndPositions(acceptDocs, null, 0), te.docFreq(), positions.get(i), te.docFreq() > 5 * postings[0].docFreq);

                }

            }

            ArrayUtil.mergeSort(postings);

            for (int i = 1; i < terms.size(); ++i) {

                postings[i].posEnum.nextDoc();

            }

        }

        @Override

        public DocIdSetIterator iterator() throws IOException

        {

            if (noDocs) {

                return EMPTY_DOCIDSET.iterator();

            } else {

                return new ExactPhraseDocIdSetIterator(context, acceptDocs);

            }

        }

        protected class ExactPhraseDocIdSetIterator extends DocIdSetIterator {

            protected int docID = -1;

            public ExactPhraseDocIdSetIterator(AtomicReaderContext context, Bits acceptDocs) throws IOException {

            }

            @Override

            public int nextDoc() throws IOException {

                while (true) {

                    // first (rarest) term

                    final int doc = postings[0].posEnum.nextDoc();

                    if (doc == DocIdSetIterator.NO_MORE_DOCS) {

                        // System.err.println("END");

                        return docID = doc;

                    }

                    // non-first terms

                    int i = 1;

                    while (i < postings.length) {

                        final PostingAndFreq pf = postings[i];

                        int doc2 = pf.posEnum.docID();

                        if (pf.useAdvance) {

                            if (doc2 < doc) {

                                doc2 = pf.posEnum.advance(doc);

                            }

                        } else {

                            int iter = 0;

                            while (doc2 < doc) {

                                if (++iter == 50) {

                                    doc2 = pf.posEnum.advance(doc);

                                } else {

                                    doc2 = pf.posEnum.nextDoc();

                                }

                            }

                        }

                        if (doc2 > doc) {

                            break;

                        }

                        ++i;

                    }

                    if (i == postings.length) {

                        // System.err.println(doc);

                        docID = doc;

                        // return docID;

                        if (containsPhrase()) {

                            return docID;

                        }

                    }

                }

            }

            @Override

            public int advance(int target) throws IOException {

                throw new IOException();

            }

            private boolean containsPhrase() throws IOException {

                int index = -1;

                int i = 0;

                PostingAndFreq pf;

                // init.

                for (i = 0; i < postings.length; ++i) {

                    postings[i].posFreq = postings[i].posEnum.freq();

                    postings[i].pos = postings[i].posEnum.nextPosition() - postings[i].position;

                    postings[i].posTime = 1;

                }

                while (true) {

                    pf = postings[0];

                    // first term.

                    while (pf.pos < index && pf.posTime < pf.posFreq) {

                        pf.pos = pf.posEnum.nextPosition() - pf.position;

                        ++pf.posTime;

                    }

                    if (pf.pos >= index) {

                        index = pf.pos;

                    } else if (pf.posTime == pf.posFreq) {

                        return false;

                    }

                    // other terms.

                    for (i = 1; i < postings.length; ++i) {

                        pf = postings[i];

                        while (pf.pos < index && pf.posTime < pf.posFreq) {

                            pf.pos = pf.posEnum.nextPosition() - pf.position;

                            ++pf.posTime;

                        }

                        if (pf.pos > index) {

                            index = pf.pos;

                            break;

                        }

                        if (pf.pos == index) {

                            continue;

                        }

                        if (pf.posTime == pf.posFreq) {

                            return false;

                        }

                    }

                    if (i == postings.length) {

                        return true;

                    }

                }

            }

            @Override

            public int docID()

            {

                return docID;

            }

        }

    }

}

UT:

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;

import org.apache.lucene.analysis.standard.StandardAnalyzer;

import org.apache.lucene.codecs.Codec;

import org.apache.lucene.document.Document;

import org.apache.lucene.document.TextField;

import org.apache.lucene.document.Field.Store;

import org.apache.lucene.index.DirectoryReader;

import org.apache.lucene.index.IndexReader;

import org.apache.lucene.index.IndexWriter;

import org.apache.lucene.index.IndexWriterConfig;

import org.apache.lucene.index.Term;

import org.apache.lucene.index.IndexWriterConfig.OpenMode;

import org.apache.lucene.search.ConstantScoreQuery;

import org.apache.lucene.search.IndexSearcher;

import org.apache.lucene.search.Query;

import org.apache.lucene.search.TopDocs;

import org.apache.lucene.store.Directory;

import org.apache.lucene.store.RAMDirectory;

import org.apache.lucene.util.Version;

import org.testng.annotations.AfterTest;

import org.testng.annotations.BeforeTest;

import org.testng.annotations.Test;

import com.dp.arts.lucenex.codec.Dp10Codec;

public class ExactPhraseFilterTest

{

    final Directory dir = new RAMDirectory();

    @BeforeTest

    public void setUp() throws IOException {

        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

        IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);

        iwc.setOpenMode(OpenMode.CREATE);

        iwc.setCodec(Codec.forName(Dp10Codec.DP10_CODEC_NAME));

        IndexWriter writer = new IndexWriter(dir, iwc);

        addDocument(writer, "新疆烧烤");  // 0

        addDocument(writer, "啤酒");  // 1

        addDocument(writer, "烤烧");  // 2

        addDocument(writer, "烧烧烧");  // 3

        addDocument(writer, "烤烧中华烧烤"); // 4

        writer.close();

    }

    private void addDocument(IndexWriter writer, String str) throws IOException {

        Document doc = new Document();

        doc.add(new TextField("searchkeywords", str, Store.YES));

        writer.addDocument(doc, new StandardAnalyzer(Version.LUCENE_40));

    }

    @AfterTest

    public void tearDown() throws IOException

    {

        this.dir.close();

    }

    @Test

    public void test1() throws IOException

    {

        IndexReader reader = DirectoryReader.open(dir);

        IndexSearcher searcher = new IndexSearcher(reader);

        ExactPhraseFilter pf = new ExactPhraseFilter();

        pf.add(new Term("searchkeywords", "烧"));

        pf.add(new Term("searchkeywords", "烤"));

        Query query = new ConstantScoreQuery(pf);

        TopDocs results = searcher.search(query, 20);

        assert results.totalHits == 2;

        assert results.scoreDocs[0].doc == 0;

        assert results.scoreDocs[1].doc == 4;

        searcher.getIndexReader().close();

    }

}

关于使用Filter降低Lucene tf idf打分计算的调研的更多相关文章

Lucene默认的打分算法——ES默认
改变Lucene的打分模型随着Apache Lucene 4.0版本在2012年的发布,这款伟大的全文检索工具包终于允许用户修改默认的基于TF/IDF原理的打分算法.Lucene API变得更加容易 ...
文本分类学习（三）特征权重（TF/IDF）和特征提取
上一篇中,主要说的就是词袋模型.回顾一下,在进行文本分类之前,我们需要把待分类文本先用词袋模型进行文本表示.首先是将训练集中的所有单词经过去停用词之后组合成一个词袋,或者叫做字典,实际上一个维度很大的 ...
25.TF&IDF算法以及向量空间模型算法
主要知识点: boolean model IF/IDF vector space model 一.boolean model 在es做各种搜索进行打分排序时,会先用boolean mo ...
TF/IDF（term frequency/inverse document frequency)
TF/IDF(term frequency/inverse document frequency) 的概念被公认为信息检索中最重要的发明. 一. TF/IDF描述单个term与特定document的相 ...
基于TF/IDF的聚类算法原理
一.TF/IDF描述单个term与特定document的相关性TF(Term Frequency): 表示一个term与某个document的相关性. 公式为这个term在document中出 ...
使用solr的函数查询,并获取tf*idf值
1. 使用函数df(field,keyword) 和idf(field,keyword). http://118.85.207.11:11100/solr/mobile/select?q={!func ...
TF/IDF计算方法
FROM:http://blog.csdn.net/pennyliang/article/details/1231028 我们已经谈过了如何自动下载网页.如何建立索引.如何衡量网页的质量(Page R ...
tf–idf算法解释及其python代码实现(下)
tf–idf算法python代码实现这是我写的一个tf-idf的简单实现的代码,我们知道tfidf=tf*idf,所以可以分别计算tf和idf值在相乘,首先我们创建一个简单的语料库,作为例子,只有四 ...
tf–idf算法解释及其python代码实现(上)
tf–idf算法解释 tf–idf, 是term frequency–inverse document frequency的缩写,它通常用来衡量一个词对在一个语料库中对它所在的文档有多重要,常用在信息 ...

随机推荐

jquery中 dom对象与jQuery对象相互转换
var jq = $(dom对象);//额再补充点吧好记. $是jquery的别名.这一句等价于 var jq = jQuery(dom对象); 反之. dom对象 = jq[0]; //不写那么长 ...
NetSugar.Cap与CAP功能比对
前言首先非常感谢开源社区,在各位作者无私得奉献下,我才有幸接触CAP.在拜读源码和理解设计原理过程中,发现CAP的源码是一个非常值得我们学习的代码.本人代码的基本框架采用简单的DDD,在练习Demo ...
2017年12月24日 JS跟Jquery基础
js基础 alert();confirm(); 基础语法:与C#一致数据类型及类型转换var (string,decimal)parseInt()parseFloat();isNaN(); 运算符:数 ...
apache 优化配置详解
###=========httpd.conf begin===================##Apache主配置文件##设置服务器的基础目录,默认为Apache安装目录ServerRoot &qu ...
java自学-流程控制语句
一件事情从开始到结束,需要先做什么,再做什么,最后再怎么做,这段过程就是一个流程.程序逻辑处理也是一个流程,java中有专门的流程控制语句,主要分为这几种:顺序,判断,选择,循环. 1.顺序流程顺 ...
关于Python的那点吐槽
之前听到过别人有说过Python只是一个玩具做不了大项目,我当时是嗤之以鼻的,不说豆瓣这样的公司采用Python做的网站,GitHub上那么多大项目都是用Python写的,怎么能说Python只是一个 ...
在 :after/ :before 使用 font awesome web Icon
.element { position: relative; } /*replace the content value with the corresponding value from the l ...
手写堆优化dijkstra
$dijkstra$ 算法的堆优化,时间复杂度为$O(n+m)\log n$ 添加数组$id[]$记录某节点在堆中的位置,可以避免重复入堆从而减小常数而这一方法需要依托手写堆 #incl ...
如何使用canvas进行2d绘图
canvas 的 2D context 可以绘制简单的 2D 图形.它的 2D context 坐标开始于 <canvas> 元素的左上角,原点坐标是(0,0).所有的坐标值都基于这个原点 ...
NuGet 2.0 （.NET软件包管理器）发布了-现在升级吧
原文:https://blogs.msdn.microsoft.com/scott_hanselman/2012/07/10/nuget-2-0-net/ [原文发表地址] NuGet 2.0 (. ...

关于使用Filter降低Lucene tf idf打分计算的调研

关于使用Filter降低Lucene tf idf打分计算的调研的更多相关文章

随机推荐

热门专题