[IR] Ranking - top k
PageRanking 通过:
- Input degree of link
- "Flow" model - 流量判断喜好度
传统的方式又是什么呢?

Every term在某个doc中的权重(地位)。

公共的terms在Query与Doc中对应的的地位(单位化后)直接相乘,然后全部加起来,构成了cosin相似度。

Efficient cosine ranking
传统放入堆的模式:n * log(k)
使用Quick Select:n + k * log(k) : "find top k" + "sort top k"
Threshold Methods

Solution:

也可以采取非精确的方式,为什么一定要绝对准确的top k呢?
Index Elimination (heuristic function)
- idf低,很可能是停用词
- 只考虑包含了多个term的doc。但有risk,return的文档数小于k
3 of 4 query terms
故意抽样只关注一部分满足一定人为定制条件的docs。
Champion List
Term 1 R个最高权重的docs
Term 2 R个最高权重的docs
Term 3 R个最高权重的docs
以上的result求并集,得到champion Set,然后在此内求Cosine Similarity.
Cluster Pruning Method
Can you propose some modification to this method such that it guarantees returning
the closest vector for any query? Describe your method and illustrate it with a small
example.
Step 1: Sort leaders.
Step 2: In the high dimensionality, check whether the query is surrounded by the top k leaders. The
initial value of k > 1.
Step 3: If the query is surrounded by top k leaders, we retrieve all the followers around top k
leaders.
Step 4: If not, k = k+1 and goto Step 2.
Let's illustrate it in 2D space.

When k = 3, Q1 is not surrounded by top 3 leaders (A1, A2, A3). Then, k = 4, Q1 is surrounded by
top 4 leaders. We retrieve all the followers around top 4 leaders and get the result. In this case, the
followers around other leaders cannot be closer than this result. This guarantees returning
the closest vector for any query.
This method depends on how do we define the “surround” for high-dimensional space. Normally, at
least k+1 points are needed in k-demensional space to surround one point.
If Q1 (query terms: a, b, c) is surrounded by 4 leaders, as following,
Query (a, b, c)
leader 1: (A1, B1, C1)
leader 2: (A2, B2, C2)
leader 3: (A3, B3, C3)
leader 4: (A4, B4, C4)
a must be between min(A1, A2, A3, A4) and max(A1, A2, A3, A4).
b must be between min(B1, B2, B3, B4) and max(B1, B2, B3, B4).
c must be between min(C1, C2, C3, C4) and max(C1, C2, C3, C4).
[IR] Ranking - top k的更多相关文章
- [LeetCode] Top K Frequent Elements 前K个高频元素
Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...
- Leetcode 347. Top K Frequent Elements
Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...
- 大数据热点问题TOP K
1单节点上的topK (1)批量数据 数据结构:HashMap, PriorityQueue 步骤:(1)数据预处理:遍历整个数据集,hash表记录词频 (2)构建最小堆:最小堆只存k个数据. 时间复 ...
- LeetCode "Top K Frequent Elements"
A typical solution is heap based - "top K". Complexity is O(nlgk). typedef pair<int, un ...
- 347. Top K Frequent Elements
Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...
- 面试题:m个长度为n的ordered array,求top k 个 数字
package com.sinaWeibo.interview; import java.util.Comparator; import java.util.Iterator; import java ...
- get top k elements of the same key in hive
key points: 1. group by key and sort by using distribute by and sort by. 2. get top k elements by a ...
- Top k问题(线性时间选择算法)
问题描述:给定n个整数,求其中第k小的数. 分析:显然,对所有的数据进行排序,即很容易找到第k小的数.但是排序的时间复杂度较高,很难达到线性时间,哈希排序可以实现,但是需要另外的辅助空间. 这里我提供 ...
- pig询问top k,每个返回hour和ad_network_id最大的两个记录(SUBSTRING,order,COUNT_STAR,limit)
pig里面有一个TOP功能.我不知道为什么用不了.有时间去看看pig源代码. SET job.name 'top_k'; SET job.priority HIGH; --REGISTER piggy ...
随机推荐
- Memcached常规应用与分布式部署方案
1.Memcached常规应用 $mc = new Memcache(); $mc->conncet('127.0.0.1', 11211); $sql = sprintf("SELE ...
- GO語言基礎教程:Hello world!
首先簡單地說一下GO語言的環境安裝,從 http://golang.org/dl/ 針對自己的操作系統選擇合適的安裝包,然後下載安裝即可,下載的時候注意別選錯了的操作系統,例如go1.3.1.darw ...
- 努力学习 HTML5 (4)—— 浏览器对语义元素的支持情况
经过上一节学习,我们已经建立一个结构良好的页面,如果在旧版的 IE 浏览器中浏览可能这些语义元素无法显示. 毕竟这些语义元素什么也不做,要支持它们,只要让浏览器把它们当做普通的 <div> ...
- Android 的EditText实现不可编辑
android:editable is deprecated: Use an <EditText> to make it editable android:editable is depr ...
- C++ 记事本: 变量
C++ 变量也许和其他语言的变量没有什么差别.就是用来存储一些可能会变值的容器. 当然 C++ 变量里又分为 原子类型 的(int , char ,bool 等等),复合类型 的(struct ,cl ...
- MYSQL INSERT INTO SELECT 不插入重复数据
INSERT INTO `b_common_member_count` (uid) SELECT uid FROM `b_common_member` WHERE uid NOT IN (SELECT ...
- 更改chrome底色为护目色
找到chrome目录:C:\Documents and Settings\用户名\Local Settings\Application Data\Google\Chrome\User Data\Def ...
- SSH在Jenkins中的使用
我们今天在迁移Jenkins的时候又出现无法调用私钥来获取oschina的git代码和使用scp拷贝无法验证的问题.我发现主要的问题实际上是关于ssh的问题,因为git和scp都是通过ssh来实现与远 ...
- ecshop利用.htaccess实现301重定向的方法
实现方法如下(空间必须支持对目录中的.htaccess文件解析) 打开 .htaccess 找到 RewriteEngine on 它的下方添加 RewriteCond %{HTTP_HOST} ^需 ...
- 优先队列求解Huffman编码 c++
优先队列小析 优先队列的模板: template <class T, class Container = vector<T>,class Compare = less< ...