[IR] Ranking - top k
PageRanking 通过:
- Input degree of link
- "Flow" model - 流量判断喜好度
传统的方式又是什么呢?

Every term在某个doc中的权重(地位)。

公共的terms在Query与Doc中对应的的地位(单位化后)直接相乘,然后全部加起来,构成了cosin相似度。

Efficient cosine ranking
传统放入堆的模式:n * log(k)
使用Quick Select:n + k * log(k) : "find top k" + "sort top k"
Threshold Methods

Solution:

也可以采取非精确的方式,为什么一定要绝对准确的top k呢?
Index Elimination (heuristic function)
- idf低,很可能是停用词
- 只考虑包含了多个term的doc。但有risk,return的文档数小于k
3 of 4 query terms
故意抽样只关注一部分满足一定人为定制条件的docs。
Champion List
Term 1 R个最高权重的docs
Term 2 R个最高权重的docs
Term 3 R个最高权重的docs
以上的result求并集,得到champion Set,然后在此内求Cosine Similarity.
Cluster Pruning Method
Can you propose some modification to this method such that it guarantees returning
the closest vector for any query? Describe your method and illustrate it with a small
example.
Step 1: Sort leaders.
Step 2: In the high dimensionality, check whether the query is surrounded by the top k leaders. The
initial value of k > 1.
Step 3: If the query is surrounded by top k leaders, we retrieve all the followers around top k
leaders.
Step 4: If not, k = k+1 and goto Step 2.
Let's illustrate it in 2D space.

When k = 3, Q1 is not surrounded by top 3 leaders (A1, A2, A3). Then, k = 4, Q1 is surrounded by
top 4 leaders. We retrieve all the followers around top 4 leaders and get the result. In this case, the
followers around other leaders cannot be closer than this result. This guarantees returning
the closest vector for any query.
This method depends on how do we define the “surround” for high-dimensional space. Normally, at
least k+1 points are needed in k-demensional space to surround one point.
If Q1 (query terms: a, b, c) is surrounded by 4 leaders, as following,
Query (a, b, c)
leader 1: (A1, B1, C1)
leader 2: (A2, B2, C2)
leader 3: (A3, B3, C3)
leader 4: (A4, B4, C4)
a must be between min(A1, A2, A3, A4) and max(A1, A2, A3, A4).
b must be between min(B1, B2, B3, B4) and max(B1, B2, B3, B4).
c must be between min(C1, C2, C3, C4) and max(C1, C2, C3, C4).
[IR] Ranking - top k的更多相关文章
- [LeetCode] Top K Frequent Elements 前K个高频元素
Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...
- Leetcode 347. Top K Frequent Elements
Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...
- 大数据热点问题TOP K
1单节点上的topK (1)批量数据 数据结构:HashMap, PriorityQueue 步骤:(1)数据预处理:遍历整个数据集,hash表记录词频 (2)构建最小堆:最小堆只存k个数据. 时间复 ...
- LeetCode "Top K Frequent Elements"
A typical solution is heap based - "top K". Complexity is O(nlgk). typedef pair<int, un ...
- 347. Top K Frequent Elements
Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...
- 面试题:m个长度为n的ordered array,求top k 个 数字
package com.sinaWeibo.interview; import java.util.Comparator; import java.util.Iterator; import java ...
- get top k elements of the same key in hive
key points: 1. group by key and sort by using distribute by and sort by. 2. get top k elements by a ...
- Top k问题(线性时间选择算法)
问题描述:给定n个整数,求其中第k小的数. 分析:显然,对所有的数据进行排序,即很容易找到第k小的数.但是排序的时间复杂度较高,很难达到线性时间,哈希排序可以实现,但是需要另外的辅助空间. 这里我提供 ...
- pig询问top k,每个返回hour和ad_network_id最大的两个记录(SUBSTRING,order,COUNT_STAR,limit)
pig里面有一个TOP功能.我不知道为什么用不了.有时间去看看pig源代码. SET job.name 'top_k'; SET job.priority HIGH; --REGISTER piggy ...
随机推荐
- 爬虫神器xpath的用法(二)
爬取网页内容的时候,往往网页标签比较复杂,对于这种情况,需要用xpath的starts-with和string(.)功能属性来处理,具体看事例 #encoding=utf-8 from lxml im ...
- 在C#中如何读取枚举值的描述属性
在C#中,有时候我们需要读取枚举值的描述属性,也就是说这个枚举值代表了什么意思.比如本文中枚举值 Chinese ,我们希望知道它代表意思的说明(即“中文”). 有下面的枚举: 1 2 3 4 5 6 ...
- 【Android】android中Invalidate和postInvalidate的区别
Android中实现view的更新有两组方法,一组是invalidate,另一组是postInvalidate,其中前者是在UI线程自身中使用,而后者在非UI线程中使用. Android提供了Inva ...
- MiniDao-PE精简版
https://github.com/zhangdaiscott/minidao-pe MiniDao-PE精简版 MiniDao-PE 简介及特征 MiniDao-PE 是一种持久化解决方案,类似m ...
- ubuntu 16.04 samba 文件共享
生成samba用户名密码修改配置文件重启samba服务使之生效 以前在ubuntu 14.04的时候,很方便的通过几行命令和一个GUI界面就可以配置好samba共享文件给windows了: Ubunt ...
- http协议读书笔记1-概述
1.http协议在网络中的位置: http协议位于TCP协议的上层,http试用tcp来传输其报文数据,tcp在ip的上层. 2.浏览器发起连接的过程 上述图的过程是: 浏览器从url中解析出服务区的 ...
- git的一些相关知识
1.配置多个git远程仓库的ssh-Key切换(转自) 目前的git仓库如github都是通过使用SSH与客户端连接,如果只是固定使用单个git仓库的单个用户 (first),生成生成密钥对后,将公钥 ...
- VC 2010的重大变化
auto 关键字具有新的默认含义.由于使用旧含义的情况很少见,因此大多数应用程序都不会受此更改影响. 引入了新的 static_assert 关键字,如果代码中已经存在具有某个名称的标识符,则此关键字 ...
- Show Linux Package Sort By Size
ArchLinux: ~ $ pacsysclean Debian: ~ $ sudo apt-get install debian-goodies ~ $ dpigs -H
- bit操作 转
http://www.catonmat.net/blog/low-level-bit-hacks-you-absolutely-must-know/ Bit Hack #6. Turn off the ...