PageRanking 通过:

  1. Input degree of link
  2. "Flow" model - 流量判断喜好度

传统的方式又是什么呢?

Every term在某个doc中的权重(地位)。

公共的terms在Query与Doc中对应的的地位(单位化后)直接相乘,然后全部加起来,构成了cosin相似度。

  


Efficient cosine ranking

传统放入堆的模式:n * log(k)

使用Quick Select:n + k * log(k) : "find top k" + "sort top k"

Threshold Methods

  Solution: 

也可以采取非精确的方式,为什么一定要绝对准确的top k呢?

Index Elimination (heuristic function)

  1. idf低,很可能是停用词
  2. 只考虑包含了多个term的doc。但有risk,return的文档数小于k

3 of 4 query terms

故意抽样只关注一部分满足一定人为定制条件的docs。

Champion List

Term 1  R个最高权重的docs

Term 2  R个最高权重的docs

Term 3  R个最高权重的docs

以上的result求并集,得到champion Set,然后在此内求Cosine Similarity.

Cluster Pruning Method

Can you propose some modification to this method such that it guarantees returning
the closest vector for any query? Describe your method and illustrate it with a small
example.

Step 1: Sort leaders.
Step 2: In the high dimensionality, check whether the query is surrounded by the top k leaders. The
initial value of k > 1.
Step 3: If the query is surrounded by top k leaders, we retrieve all the followers around top k
leaders.
Step 4: If not, k = k+1 and goto Step 2.
Let's illustrate it in 2D space.

When k = 3, Q1 is not surrounded by top 3 leaders (A1, A2, A3). Then, k = 4, Q1 is surrounded by
top 4 leaders. We retrieve all the followers around top 4 leaders and get the result. In this case, the
followers around other leaders cannot be closer than this result. This guarantees returning
the closest vector for any query.
This method depends on how do we define the “surround” for high-dimensional space. Normally, at
least k+1 points are needed in k-demensional space to surround one point.

If Q1 (query terms: a, b, c) is surrounded by 4 leaders, as following,
Query (a, b, c)
leader 1: (A1, B1, C1)
leader 2: (A2, B2, C2)
leader 3: (A3, B3, C3)
leader 4: (A4, B4, C4)
a must be between min(A1, A2, A3, A4) and max(A1, A2, A3, A4).
b must be between min(B1, B2, B3, B4) and max(B1, B2, B3, B4).
c must be between min(C1, C2, C3, C4) and max(C1, C2, C3, C4).

[IR] Ranking - top k的更多相关文章

  1. [LeetCode] Top K Frequent Elements 前K个高频元素

    Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...

  2. Leetcode 347. Top K Frequent Elements

    Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...

  3. 大数据热点问题TOP K

    1单节点上的topK (1)批量数据 数据结构:HashMap, PriorityQueue 步骤:(1)数据预处理:遍历整个数据集,hash表记录词频 (2)构建最小堆:最小堆只存k个数据. 时间复 ...

  4. LeetCode "Top K Frequent Elements"

    A typical solution is heap based - "top K". Complexity is O(nlgk). typedef pair<int, un ...

  5. 347. Top K Frequent Elements

    Given a non-empty array of integers, return the k most frequent elements. For example,Given [1,1,1,2 ...

  6. 面试题:m个长度为n的ordered array,求top k 个 数字

    package com.sinaWeibo.interview; import java.util.Comparator; import java.util.Iterator; import java ...

  7. get top k elements of the same key in hive

    key points: 1. group by key and sort by using distribute by and sort by. 2. get top k elements by a ...

  8. Top k问题(线性时间选择算法)

    问题描述:给定n个整数,求其中第k小的数. 分析:显然,对所有的数据进行排序,即很容易找到第k小的数.但是排序的时间复杂度较高,很难达到线性时间,哈希排序可以实现,但是需要另外的辅助空间. 这里我提供 ...

  9. pig询问top k,每个返回hour和ad_network_id最大的两个记录(SUBSTRING,order,COUNT_STAR,limit)

    pig里面有一个TOP功能.我不知道为什么用不了.有时间去看看pig源代码. SET job.name 'top_k'; SET job.priority HIGH; --REGISTER piggy ...

随机推荐

  1. 配置editplus,讓其支持代碼自動格式化功能.

    使用editplus已經好多年了,累積了不少的東西,想換IDE比較麻煩,所以就研究了一下用editplus搭配gofmt.exe配置go語言代碼自動格式化的功能.還好功夫不負有心人,終於被我搞懂了,不 ...

  2. Android 使用java.net.socket 的接收问题

    // 初始化socketsocket = new Socket(InetAddress.getByName(sip), sport);InputStream sin = socket.getInput ...

  3. Jquery Data Table插件

    <%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding= ...

  4. Javascript的一个生产PDF的库: unicode和中文问题的解决

    Javascript的一个生产PDF的库: unicode和中文问题的解决基于canvas和jspdf库, 实现用javascript的支持中文pdf生成实用工具.参考:http://javascri ...

  5. Intellij Idea 15 旗舰版 破解

    转自:http://my.oschina.net/nyp/blog/533991(良心呀,真的好使) 注册方法:   注册码可以沿用14的,只是在 注册时选择 License server ,填 ht ...

  6. java知识大全积累篇

    原文出自:http://www.importnew.com/14429.html 构建 这里搜集了用来构建应用程序的工具. Apache Maven:Maven使用声明进行构建并进行依赖管理,偏向于使 ...

  7. 修改maven一更新jre就变成1.5版本

    <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> ...

  8. 转:Android开发之旅:环境搭建及HelloWorld

    http://www.cnblogs.com/skynet/archive/2010/04/12/1709892.html 引言 本系列适合0基础的人员,因为我就是从0开始的,此系列记录我步入Andr ...

  9. Android学习笔记----TimerTask中显示Toast的问题

    今天想在TimerTask的run函数中调用Toast显示一下提示信息,却总是导致程序崩溃.可是try语句块却又无法捕获到异常,代码如下: ...... Timer timer = new Timer ...

  10. [ linux ] pad远程

    ---恢复内容开始--- ssh: serverauditor vnc:vnc viewer ubuntu: sudo apt-get install x11vnc sudo apt-get inst ...