Learning to rank with scikit-learn: the pairwise transform

http://fa.bianp.net/blog/2012/learning-to-rank-with-scikit-learn-the-pairwise-transform/

tags: pythonscikit-learnranking

This tutorial introduces the concept of pairwise preference used in most ranking problems. I'll use scikit-learn and for learning and matplotlib for visualization.

In the ranking setting, training data consists of lists of items with some order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item, so that for any two samples a and b, either a < bb > a or band a are not comparable.

For example, in the case of a search engine, our dataset consists of results that belong to different queries and we would like to only compare the relevance for results coming from the same query.

This order relation is usually domain-specific. For instance, in information retrieval the set of comparable samples is referred to as a "query id". The goal behind this is to compare only documents that belong to the same query (Joachims 2002). In medical imaging on the other hand, the order of the labels usually depend on the subject so the comparable samples is given by the different subjects in the study (Pedregosa et al 2012).

import itertools
import numpy as np
from scipy import stats
import pylab as pl
from sklearn import svm, linear_model, cross_validation

To start with, we'll create a dataset in which the target values consists of three graded measurements Y = {0, 1, 2} and the input data is a collection of 30 samples, each one with two features.

The set of comparable elements (queries in information retrieval) will consist of two equally sized blocks, X=X1∪X2, where each block is generated using a normal distribution with different mean and covariance. In the pictures, we represent X1 with round markers and X2 with triangular markers.

np.random.seed(0)
theta = np.deg2rad(60)
w = np.array([np.sin(theta), np.cos(theta)])
K = 20
X = np.random.randn(K, 2)
y = [0] * K
for i in range(1, 3):
X = np.concatenate((X, np.random.randn(K, 2) + i * 4 * w))
y = np.concatenate((y, [i] * K)) # slightly displace data corresponding to our second partition
X[::2] -= np.array([3, 7])
blocks = np.array([0, 1] * (X.shape[0] / 2)) # split into train and test set
cv = cross_validation.StratifiedShuffleSplit(y, test_size=.5)
train, test = iter(cv).next()
X_train, y_train, b_train = X[train], y[train], blocks[train]
X_test, y_test, b_test = X[test], y[test], blocks[test] # plot the result
idx = (b_train == 0)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(0, 1, '$w$', fontsize=20)
pl.arrow(-3, -8, 8 * w[0], 8 * w[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(-2.6, -7, '$w$', fontsize=20)
pl.axis('equal')
pl.show()

In the plot we clearly see that for both blocks there's a common vector w such that the projection onto w gives a list with the correct ordering.

However, because linear considers that output labels live in a metric space it will consider that all pairs are comparable. Thus if we fit this model to the problem above it will fit both blocks at the same time, yielding a result that is clearly not optimal. In the following plot we estimate w^ using an l2-regularized linear model.

ridge = linear_model.Ridge(1.)
ridge.fit(X_train, y_train)
coef = ridge.coef_ / linalg.norm(ridge.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(2, 0, '$\hat{w}$', fontsize=20)
pl.axis('equal')
pl.title('Estimation by Ridge regression')
pl.show()

To assess the quality of our model we need to define a ranking score. Since we are interesting in a model that ordersthe data, it is natural to look for a metric that compares the ordering of our model to the given ordering. For this, we use Kendall's tau correlation coefficient, which is defined as (P - Q)/(P + Q), being P the number of concordant pairs and Q is the number of discordant pairs. This measure is used extensively in the ranking literature (e.g Optimizing Search Engines using Clickthrough Data).

We thus evaluate this metric on the test set for each block separately.

for i in range(2):
tau, _ = stats.kendalltau(
ridge.predict(X_test[b_test == i]), y_test[b_test == i])
print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))
Kendall correlation coefficient for block 0: 0.71122
Kendall correlation coefficient for block 1: 0.84387

The pairwise transform

As proved in (Herbrich 1999), if we consider linear ranking functions, the ranking problem can be transformed into a two-class classification problem. For this, we form the difference of all comparable elements such that our data is transformed into (x′k,y′k)=(xi−xj,sign(yi−yj)) for all comparable pairs.

This way we transformed our ranking problem into a two-class classification problem. The following plot shows this transformed dataset, and color reflects the difference in labels, and our task is to separate positive samples from negative ones. The hyperplane {x^T w = 0} separates these two classes.

# form all pairwise combinations
comb = itertools.combinations(range(X_train.shape[0]), 2)
k = 0
Xp, yp, diff = [], [], []
for (i, j) in comb:
if y_train[i] == y_train[j] \
or blocks[train][i] != blocks[train][j]:
# skip if same target or different group
continue
Xp.append(X_train[i] - X_train[j])
diff.append(y_train[i] - y_train[j])
yp.append(np.sign(diff[-1]))
# output balanced classes
if yp[-1] != (-1) ** k:
yp[-1] *= -1
Xp[-1] *= -1
diff[-1] *= -1
k += 1
Xp, yp, diff = map(np.asanyarray, (Xp, yp, diff))
pl.scatter(Xp[:, 0], Xp[:, 1], c=diff, s=60, marker='o', cmap=pl.cm.Blues)
x_space = np.linspace(-10, 10)
pl.plot(x_space * w[1], - x_space * w[0], color='gray')
pl.text(3, -4, '$\{x^T w = 0\}$', fontsize=17)
pl.axis('equal')
pl.show()

As we see in the previous plot, this classification is separable. This will not always be the case, however, in our training set there are no order inversions, thus the respective classification problem is separable.

We will now finally train an Support Vector Machine model on the transformed data. This model is known as RankSVM. We will then plot the training data together with the estimated coefficient w^ by RankSVM.

clf = svm.SVC(kernel='linear', C=.1)
clf.fit(Xp, yp)
coef = clf.coef_.ravel() / linalg.norm(clf.coef_)
pl.scatter(X_train[idx, 0], X_train[idx, 1], c=y_train[idx],
marker='^', cmap=pl.cm.Blues, s=100)
pl.scatter(X_train[~idx, 0], X_train[~idx, 1], c=y_train[~idx],
marker='o', cmap=pl.cm.Blues, s=100)
pl.arrow(0, 0, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.arrow(-3, -8, 7 * coef[0], 7 * coef[1], fc='gray', ec='gray',
head_width=0.5, head_length=0.5)
pl.text(1, .7, '$\hat{w}$', fontsize=20)
pl.text(-2.6, -7, '$\hat{w}$', fontsize=20)
pl.axis('equal')
pl.show()

Finally we will check that as expected, the ranking score (Kendall tau) increases with the RankSVM model respect to linear regression.

for i in range(2):
tau, _ = stats.kendalltau(
np.dot(X_test[b_test == i], coef), y_test[b_test == i])
print('Kendall correlation coefficient for block %s: %.5f' % (i, tau))
Kendall correlation coefficient for block 0: 0.83627
Kendall correlation coefficient for block 1: 0.84387

This is indeed higher than the values (0.71122, 0.84387) obtained in the case of linear regression.

Original ipython notebook for this blog post can be found here


  1. "Large Margin Rank Boundaries for Ordinal Regression", R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers, 115-132, Liu Press, 2000 

  2. "Optimizing Search Engines Using Clickthrough Data", T. Joachims. Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2002. 

  3. "Learning to rank from medical imaging data", Pedregosa et al. [arXiv

  4. "Efficient algorithms for ranking with SVMs", O. Chapelle and S. S. Keerthi, Information Retrieval Journal, Special Issue on Learning to Rank, 2009 

Comments !

转:pairwise 代码参考的更多相关文章

  1. Session id实现通过Cookie来传输方法及代码参考

    1. Web中的Session指的就是用户在浏览某个网站时,从进入网站到浏览器关闭所经过的这段时间,也就是用户浏览这个网站所花费的时间.因此从上述的定义中我们可以看到,Session实际上是一个特定的 ...

  2. Jquery 代码参考

    jquery 代码参考 jQuery(document).ready(function($){}); jQuery(window).on('load', function(){}); $('.vide ...

  3. php 修改后端代码参考

    后端代码参考:

  4. 【原创】C#模拟Post请求,正文为json数据的代码参考

    由于之前一直在做键值对post数据的提交,没遇到过json正文的提交,遇到的问题截图: 对于此种情况的post,我用 谷歌插件 PostMan 模拟试了下成功了,截图如下: Postman插件在你选择 ...

  5. 公共代码参考(Volley)

    Volley 是google提供的一个网络库,相对于自己写httpclient确实方便很多,本文参考部分网上例子整理如下,以作备忘: 定义一个缓存类: public class BitmapCache ...

  6. 固定表头/锁定前几列的代码参考[JS篇]

    引语:做有难度的事情,才是成长最快的时候.前段时间,接了一个公司的稍微大点的项目,急着赶进度,本人又没有独立带过队,因此,把自己给搞懵逼了.总是没有多余的时间来做自己想做的事,而且,经常把工作带入生活 ...

  7. C语言实现冒泡排序法和选择排序法代码参考

    为了易用,我编写排序函数,这和直接在主调函数中用是差不多的. 我认为选择排序法更好理解!请注意 i 和 j ,在写代码时别弄错了,不然很难找到错误! 冒泡排序法 void sort(int * ar, ...

  8. .OpenWrt驱动程序Makefile的分析概述 、驱动程序代码参考、以及测试程序代码参考

    # # # include $(TOPDIR)/rules.mk //一般在 Makefile 的开头 include $(INCLUDE_DIR)/kernel.mk // 文件对于 软件包为内核时 ...

  9. Flex组件参考 代码参考汇总

    1:tourdeflex快速熟悉各种组件用法的参考http://www.adobe.com/devnet/flex/tourdeflex.html在线:http://www.adobe.com/dev ...

随机推荐

  1. CF 96 D. Volleyball

    D. Volleyball http://codeforces.com/contest/96/problem/D 题意: n个路口,m条双向路,每条长度为w.每个路口有一个出租车司机,最多可以乘坐这辆 ...

  2. Nginx入门篇(二)之Nginx部署与配置文件解析

    一.Nginx编译安装 ()查看系统环境 [root@localhost tools]# cat /etc/redhat-release CentOS Linux release (Core) [ro ...

  3. Vue 数组封装和组件data定义为函数一些猜测

     数组封装 var vm={ list:[0,1] } var push=vm.list.push;//把数组原来的方法存起来 vm.list.push=function(arg){//重新定义数组的 ...

  4. 八、EnterpriseFrameWork框架基础功能之自定义报表

    本章写关于框架中的“自定义报表”,类似上章“字典管理”也是三部分功能组成,包括配置报表.对报表按角色授权.查看报表:其核心思想就是实现新增一个报表而不用修改程序代码.不用升级,只需要编写一个存储过程, ...

  5. 进阶篇:4.1)DFA设计指南:简化产品设计(kiss原则)

    本章目的:理解kiss原则,明确如何简化产品的设计. 1.前言:kiss原则,优化产品的第一原则 如果要作者选出一个优化产品的最好方法,那一定是kiss原则莫属.从产品的整体设计到公差的分析,kiss ...

  6. 第七篇 Postman+Node.js+Newman+Jenkins实现自动化测试

    今天终于不咋忙了,学习整理一下一直想做却没实现的事儿,这事已经折磨团队半年之久了.因为项目是B端业务的测试,测试过程中需要生产大量的测试数据,而且都是跨多个系统的测试,对于后置流程的测试,这些同学往往 ...

  7. Android错误:can not get file data of lua/start_v2.op [LUA ERROR] [string "require "lua/start_v2””] 已解决

    错误: can not get file data of lua/start_v2.op [LUA ERROR] [string "require "lua/start_v2””] ...

  8. oracle数据库之组函数

    组函数也叫聚合函数,用来对一组值进行运算,并且可以返回单个值 常见的组函数: (1)count(*),count(列名)  统计行数:找到所有不为 null 的数据来统计行数 (2)avg(列名)  ...

  9. Oracle同义词和序列

    同义词:是表.索引.视图的模式对象的一个别名,通过模式对象创建同意词,可以隐藏对象的实际名称和 所有者信息,为对象提供一定的安全性,开发应用程序时:应该尽量避免直接使用表,视图 或其他对象,改用对象的 ...

  10. 生成dataset的几种方式

    1.常用的方式通过sparksession读取外部文件或者数据生成dataset(这里就不讲了)  注: 生成Row对象的方法提一下:RowFactory.create(x,y,z),取Row中的数据 ...