wand，week and 算法

一般搜索的query比较短，但如果query比较长，如是一段文本，需要搜索相似的文本，这时候一般就需要wand算法，该算法在广告系统中有比较成熟的应该，主要是adsense场景，需要搜索一个页面内容的相似广告。

Wand方法简单来说，一般我们在计算文本相关性的时候，会通过倒排索引的方式进行查询，通过倒排索引已经要比全量遍历节约大量时间，但是有时候仍然很慢。
原因是很多时候我们其实只是想要top n个结果，一些结果明显较差的也进行了复杂的相关性计算，而weak-and算法通过计算每个词的贡献上限来估计文档的相关性上限，从而建立一个阈值对倒排中的结果进行减枝，从而得到提速的效果。

wand算法首先要估计每个词对相关性贡献的上限，最简单的相关性就是TF*IDF，一般query中词的TF均为1，IDF是固定的，因此就是估计一个词在文档中的词频TF上限，一般TF需要归一化，即除以文档所有词的个数，因此，就是要估算一个词在文档中所能占到的最大比例，这个线下计算即可。

知道了一个词的相关性上界值，就可以知道一个query和一个文档的相关性上限值，显然就是他们共同的词的相关性上限值的和。

这样对于一个query，获得其所有词的相关性贡献上限，然后对一个文档，看其和query中都出现的词，然后求这些词的贡献和即可，然后和一个预设值比较，如果超过预设值，则进入下一步的计算，否则则丢弃。

如果按照这样的方法计算n个最相似文档，就要取出所有的文档，每个文档作预计算，比较threshold，然后决定是否在top-n之列。这样计算当然可行，但是还是可以优化的。优化的出发点就是尽量减少预计算，wand论文中提到的算法如下：

http://wulc.me/2018/03/18/Wand%20%E7%AE%97%E6%B3%95%E4%BB%8B%E7%BB%8D%E4%B8%8E%E5%AE%9E%E7%8E%B0/

import heapq

UB = {"t0":0.5,"t1":1,"t2":2,"t3":3,"t4":4} #upper bound of term's value

LAST_ID = 999999999999 # a large number, larger than all the doc id in the inverted index

THETA = 2 # theta, threshold for chechking whether to calculate the relevence between query and doc

TOPN = 3 #max result number 

class WAND:

    def __init__(self, InvertIndex):

        """init inverted index and necessary variable"""

        self.result_list = [] #result list

        self.inverted_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...

        self.current_doc = 0

        self.current_inverted_index = {} #posting

        self.query_terms = []

        self.sort_terms = []

        self.threshold = THETA

        self.last_id = LAST_ID

    def __init_query(self, query_terms):

        """init variable with query"""

        self.current_doc = 0

        self.current_inverted_index = {}

        self.query_terms = []

        self.sort_terms = []

        for term in query_terms:

            if term in self.inverted_index:  # terms may not appear in inverted_index

                doc_id = self.inverted_index[term][0]

                self.query_terms.append(term)

                self.current_inverted_index[term] = [doc_id, 0] #[ docid, index ]

                self.sort_terms.append([doc_id, term])

    def __pick_term(self, pivot_index):

        """select the term before pivot_index in sorted term list

         paper recommends returning the term with max idf, here we just return the firt term,

         also return the index of the term instead of the term itself for speeding up"""

        return 0

    def __find_pivot_term(self):

        """find pivot term"""

        score = 0

        for i in range(len(self.sort_terms)):

            score += UB[self.sort_terms[i][1]]

            if score >= self.threshold:

                return [self.sort_terms[i][1], i] #[term, index]

        return [None, len(self.sort_terms)]

    def __iterator_invert_index(self, change_term, docid, pos):

        """find the new_doc_id in the doc list of change_term such that new_doc_id >= docid,

        if no new_doc_id satisfy, the self.last_id"""

        doc_list = self.inverted_index[change_term]

        # new_doc_id, new_pos = self.last_id, len(doc_list)-1 # the case when new_doc_id not exists

        for i in range(pos, len(doc_list)):

            if doc_list[i] >= docid:   # since doc_list contains self.last_id, this inequation will always be satisfied

                new_pos = i

                new_doc_id = doc_list[i]

                break

        return [new_doc_id, new_pos]

    def __advance_term(self, change_index, doc_id ):

        """change the first doc of term self.sort_terms[change_index] in the current inverted index

        return whether the action succeed or not"""

        change_term = self.sort_terms[change_index][1]

        pos = self.current_inverted_index[change_term][1]

        new_doc_id, new_pos = self.__iterator_invert_index(change_term, doc_id, pos)

        self.current_inverted_index[change_term] = [new_doc_id, new_pos]

        self.sort_terms[change_index][0] = new_doc_id

    def __next(self):

        while True:

            self.sort_terms.sort() #sort terms by doc id

            pivot_term, pivot_index = self.__find_pivot_term() #find pivot term > threshold

            if pivot_term == None: #no more candidate

                return None

            pivot_doc_id = self.current_inverted_index[pivot_term][0]

            if pivot_doc_id == self.last_id: # no more candidate

                return None

            if pivot_doc_id <= self.current_doc:

                change_index = self.__pick_term(pivot_index)

                self.__advance_term(change_index, self.current_doc + 1)

            else:

                first_doc_id = self.sort_terms[0][0]

                if pivot_doc_id == first_doc_id:

                    self.current_doc = pivot_doc_id

                    return self.current_doc # return the doc for fully calculating

                else:

                    # pick all preceding term instead of just one, then advance all of them to pivot

                    change_index = 0

                    while change_index < pivot_index:

                        self.__advance_term(change_index, pivot_doc_id)

                        change_index += 1

            # print(self.sort_terms, self.current_doc, pivot_doc_id)

    def __insert_heap(self, doc_id, score):

        """store the Top N result"""

        if len(self.result_list) < TOPN:

            heapq.heappush(self.result_list, (score, doc_id))

        else:

            heapq.heappushpop(self.result_list, (score, doc_id))

    def __calculate_doc_relevence(self, docid):

        """fully calculate relevence between doc and query"""

        score = 0

        for term in self.query_terms:

            if docid in self.inverted_index[term]:

                score += UB[term]

        return score

    def perform_query(self, query_terms):

        self.__init_query(query_terms)

        while True:

            candidate_docid = self.__next()

            if candidate_docid == None:

                break

            #insert candidate_docid to heap

            print('candidata doc', candidate_docid)

            full_doc_score = self.__calculate_doc_relevence(candidate_docid)

            self.__insert_heap(candidate_docid, full_doc_score)

            print("result list ", self.result_list)

        return self.result_list

if __name__ == "__main__":

    testIndex = {}

    testIndex["t0"] = [1, 3, 26, LAST_ID]

    testIndex["t1"] = [1, 2, 4, 10, 100, LAST_ID]

    testIndex["t2"] = [2, 3, 6, 34, 56, LAST_ID]

    testIndex["t3"] = [1, 4, 5, 23, 70, 200, LAST_ID]

    testIndex["t4"] = [5, 14, 78, LAST_ID]

    w = WAND(testIndex)

    final_result = w.perform_query(["t0", "t1", "t2", "t3", "t4"])

    print("=================final result=======================")

    for i in reversed(range(len(final_result))):

        print("doc {0}, relevence score {1}".format(final_result[i][1], final_result[i][0]))

wand，week and 算法的更多相关文章

wand(weak and)算法基本思路
一般搜索的query比较短,但如果query比较长,如是一段文本,需要搜索相似的文本,这时候一般就需要wand算法,该算法在广告系统中有比较成熟的应该,主要是adsense场景,需要搜索一个页面内容的 ...
广告系统中weak-and算法原理及编码验证
wand(weak and)算法基本思路一般搜索的query比较短,但如果query比较长,如是一段文本,需要搜索相似的文本,这时候一般就需要wand算法,该算法在广告系统中有比较成熟的应该,主要 ...
3D点云配准算法简述
蝶恋花·槛菊愁烟兰泣露槛菊愁烟兰泣露,罗幕轻寒,燕子双飞去. 明月不谙离恨苦,斜光到晓穿朱户. 昨夜西风凋碧树,独上高楼,望尽天涯路. 欲寄彩笺兼尺素.山长水阔知何处? --晏殊导读: 3D点云 ...
B树——算法导论(25)
B树 1. 简介在之前我们学习了红黑树,今天再学习一种树--B树.它与红黑树有许多类似的地方,比如都是平衡搜索树,但它们在功能和结构上却有较大的差别. 从功能上看,B树是为磁盘或其他存储设备设计的, ...
分布式系列文章——Paxos算法原理与推导
Paxos算法在分布式领域具有非常重要的地位.但是Paxos算法有两个比较明显的缺点:1.难以理解 2.工程实现更难. 网上有很多讲解Paxos算法的文章,但是质量参差不齐.看了很多关于Paxos的资 ...
【Machine Learning】KNN算法虹膜图片识别
K-近邻算法虹膜图片识别实战作者:白宁超 2017年1月3日18:26:33 摘要:随着机器学习和深度学习的热潮,各种图书层出不穷.然而多数是基础理论知识介绍,缺乏实现的深入理解.本系列文章是作者结 ...
红黑树——算法导论(15)
1. 什么是红黑树 (1) 简介上一篇我们介绍了基本动态集合操作时间复杂度均为O(h)的二叉搜索树.但遗憾的是,只有当二叉搜索树高度较低时,这些集合操作才会较快:即当树的高度较高(甚至一种极 ...
散列表(hash table)——算法导论(13)
1. 引言许多应用都需要动态集合结构,它至少需要支持Insert,search和delete字典操作.散列表(hash table)是实现字典操作的一种有效的数据结构. 2. 直接寻址表在介绍散列 ...
虚拟dom与diff算法分析
好文集合: 深入浅出React(四):虚拟DOM Diff算法解析全面理解虚拟DOM,实现虚拟DOM
简单有效的kmp算法
以前看过kmp算法,当时接触后总感觉好深奥啊,抱着数据结构的数啃了一中午,最终才大致看懂,后来提起kmp也只剩下“奥,它是做模式匹配的”这点干货.最近有空,翻出来算法导论看看,原来就是这么简单(先不说 ...

随机推荐

Day 23 23.1：js加密算法
js加密算法逆向重点掌握的内容: 1.逆向的思维 2.网站逆向的分析思路和步骤注意:重点不是放在代码中,而是分析的思路和套路(技巧) 逆向到底是什么? 通俗来讲,逆向就是处理爬虫过程中对于加密数据 ...
Docker中安装Gitlab详细全教程
安装Docker: note: https://docs.docker.com/engine/install/centos/ 1 yum install -y yum-utils 2 yum-conf ...
Bug的分类及优先级划分
P0等级(功能无法正常使用.Block测试流程) 严重花屏内存泄漏用户数据丢失或破坏系统崩溃/死机/冻结模块无法启动或异常退出严重的数值计算错误功能设计与需求严重不符其它导致无法测试的错 ...
Java——File类
File类 File:代表一个文件或者文件夹方法 createNewFile() exists() getAbsolutePath() getName() getParent() isDirecto ...
git---全局设置用户名、密码、邮箱
# git config命令的–global参数,用了这个参数,表示你这台机器上所有的Git仓库都会使用这个配置,当然也可以对某个仓库指定不同的用户名和Email地址. # 1.查看git配置信息 $ ...
两个jsp界面之间使用window.location.href使用?传递参数以及接受参数
这篇文章如果能给你带来帮助,不胜荣幸,如果有不对的地方也欢迎批评指正. 网上有很多方法是讲怎么截取字符串啊等等的方法来获取参数,说实话,看着我就觉得费劲,咱们可以换一种思路来思考.一般跳转界面多为前段 ...
MTSC2021上海站PPT 分享
在Github上搭建个人主页
最近试着在github上搭建个人主页,没用github给的模板,用的是自己在网上找到那种类似个人主页的模板,到时候直接上传到仓库里就行了首先先创建仓库,点击右上角的加号,选择New reposito ...
手写g++编译命令行工具笔记
基本想法为什么要写 CPPRUN: 如果要开警告开关,敲完整的编译代码还挺麻烦的想要编译与运行一次性完成 Windows 的控制台本来是 cmd,后来有了 Powershell,但是后者不能用 & ...
（0524） rbf 格式（intel）
http://blog.chinaaet.com/yuwoo/p/5100049901 https://blog.csdn.net/qq_38531460/article/details/107066 ...

wand，week and 算法

wand，week and 算法的更多相关文章

随机推荐

热门专题