Kernelized Locality-Sensitive Hashing Page

Brian Kulis (1) and Kristen Grauman (2)
(1) UC Berkeley EECS and ICSI, Berkeley, CA
(2) University of Texas, Department of Computer Sciences, Austin, TX

Introduction

Fast indexing and search for large databases is critical to content-based image and video retrieval---particularly given the ever-increasing availability of visual data in a variety of interesting domains, such as scientific image data, community photo collections on the Web, news photo collections, or surveillance archives. The most basic but essential task in image search is the "nearest neighbor" problem: to take a query image and accurately find the examples that are most similar to it within a large database. A naive solution to finding neighbors entails searching over all n database items and sorting them according to their similarity to the query, but this becomes prohibitively expensive when n is large or when the individual similarity function evaluations are expensive to compute. For vision applications, this complexity is amplified by the fact that often the most effective representations are high-dimensional or structured, and best known distance functions can require considerable computation to compare a single pair of objects.

To make large-scale search practical, vision researchers have recently explored approximate similarity search techniques, most notably locality-sensitive hashing (Indyk and Motwani 1998, Charikar 2002), where a predictable loss in accuracy is sacrificed in order to allow fast queries even for high-dimensional inputs. In spite of hashing's success for visual similarity search tasks, existing techniques have some important restrictions. Current methods generally assume that the data to be hashed comes from a multidimensional vector space, and require that the underlying embedding of the data be explicitly known and computable. For example, LSH relies on random projections with input vectors; spectral hashing (Weiss et al. NIPS 2008) assumes vectors with a known probability distribution.

This is a problematic limitation, given that many recent successful vision results employ kernel functions for which the underlying embedding is known only implicitly (i.e., only the kernel function is computable). It is thus far impossible to apply LSH and its variants to search data with a number of powerful kernels---including many kernels designed specifically for image comparisons, as well as some basic well-used functions like a Gaussian RBF. Further, since visual representations are often most naturally encoded with structured inputs (e.g., sets, graphs, trees), the lack of fast search methods with performance guarantees for flexible kernels is inconvenient.

In this work, we present an LSH-based technique for performing fast similarity searches over arbitrary kernel functions. The problem is as follows: given a kernel function and a database of n objects, how can we quickly find the most similar item to a query object in terms of the kernel function? Like standard LSH, our hash functions involve computing random projections; however, unlike standard LSH, these random projections are constructed using only the kernel function and a sparse set of examples from the database itself. Our main technical contribution is to formulate the random projections necessary for LSH in kernel space. Our construction relies on an appropriate use of the central limit theorem, which allows us to approximate a random vector using items from our database. The resulting scheme, which we call kernelized LSH (KLSH), generalizes LSH to scenarios when the feature space embeddings are either unknown or incomputable.

Method

The main idea behind our approach is to construct a random hyperplane hash function, as in standard LSH, but to perform computations purely in kernel space. The construction is based on the central limit theorem, which will compute an approximate random vector using items from the database. The central limit theorem states that, under very mild conditions, the mean of a set of objects from some underlying distribution will be Gaussian distributed in the limit as more objects are included in the set. Since for LSH we require a random vector from a particular Gaussian distribution---that of a zero-mean, identity covariance Gaussian---we can use the central limit theorem, along with an appropriate mean-shift and whitening, to form an approximate random vector from a unit-mean, identity covariance Gaussian. By performing this construction appropriately, the algorithm can be applied entirely in kernel space, and can also be applied efficiently over very large data sets.

Once we have computed the hash functions, we use standard LSH techniques to retrieve nearest neighbors of a query to the database in sublinear time. In particular, we employ the method of Charikar for obtaining a small set of candidate approximate nearest neighbors, and then these are sorted using the kernel function to yield a list of hashed nearest neighbors.

There are some limitations to the method. The random vector constructed by the KLSH routine is only approximately random; general bounds on the central limit theorem are unknown, so it is not clear how many database objects are required to get a sufficiently random vector for hashing. Further, we implicitly assume that the objects from the database selected to form the random vectors span the subspace from which the queries are drawn. That said, in practice the method is robust to the number of database objects chosen for the construction of the random vectors, and behaves comparably to standard LSH on non-kernelized data.

Experimental Results

80 Million Tiny Images. We ran KLSH over the 80 million images in the Tiny Image data set. We used the extracted Gist features from these images, and applied a nearest neighbor search on top of a Gaussian kernel.

The top left image in each set is the query. The remainder of the top row shows the top nearest neighbor using a linear scan (with the Gaussian kernel) and the second row shows the nearest neighbor using KLSH. Note that, with this data set, the hashing technique searched less than 1 percent of the database, and nearest neighbors were extracted in approximately .57 seconds (versus 45 seconds for a linear scan). Typically the hashing results appear qualitatively similar to (or match exactly) the linear scan results.

We can see quantitatively how the results of the nearest neighbors extracted from KLSH compare to the linear scan nearest neighbors in the above plot. It shows, for 10, 20, and 30 hashing nearest neighbors, how many linear scan nearest neighbors are required to cover the hashing nearest neighbors.

Flickr Scene Recognition. We performed a similar experiment with a set of Flickr images containing tourist photos from a set of landmarks. Here, we applied a chi-squared kernel on top of SIFT features for the nearest neighbor search. Note that these results did not appear in the conference paper.

We can also measure how the accuracy of a k-nearest neighbor classifier with KLSH approaches the accuracy of a linear scan k-NN classifier on this data set. The above plot shows that, as epsilon decreases, the hashing accuracy approaches the linear scan accuracy.

Object Recognition on Caltech-101. We applied our method on the Caltech-101 for object recognition, as there have been several recent kernel functions for images that have shown very good performance for object recognition, but have unknown or very complex featureembeddings. This data set also allowed us to test how changes in parameters affect hashing results.

The parameters p, t, and the number of hash bits only affect hashing accuracy marginally. The main parameter of interest is epsilon, a parameter from standard LSH which trades off speed for accuracy.

Local Patch Indexing with the Photo Tourism Data Set. Finally, we applied KLSH over a data set of 100,000 image patches from the Photo Tourism data set. We compared a standard Euclidean distance function (linear scan and hashing) with a learned kernel (linear scan and hashing). The particular learned kernel we used has no simple, explicit feature embedding (see the paper for details) but the linear scan retrieval results are significantly better than the baseline Euclidean distance, thus providing another example where KLSH is useful for retrieval. The results indicate that the hashing schemes do not degrade retrieval performance considerably on this data.

Summary. We have shown that hashing can be performed over arbitrary kernels to yield significant speed-ups for similarity searches with little loss in accuracy. In experiments, we have applied KLSH over several kernels, and over several domains:

Gaussian kernel (Tiny Images)

Chi-squared kernel (Flickr)

Correspondence kernel (Caltech-101)

Learned kernel (Photo Tourism)

Code

The code is available here. NOTE: the code was updated July 5, 2010 and September 23, 2010 to correct bugs in createHashTable.m. Please use the most recent version.

Paper

Kernelized Locality-Sensitive Hashing for Scalable Image Search
Brian Kulis & Kristen Grauman
In Proc. 12th International Conference on Computer Vision (ICCV), 2009.
[pdf]

Also see the following related papers, which apply LSH to learned Mahalanobis metrics:

Fast Similarity Search for Learned Metrics
Brian Kulis, Prateek Jain, & Kristen Grauman
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143--2157, 2009.
[pdf]

Fast Image Search for Learned Metrics
Prateek Jain, Brian Kulis, & Kristen Grauman
In. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[pdf]

from: http://web.cse.ohio-state.edu/~kulis/klsh/klsh.htm

局部敏感哈希 Kernelized Locality-Sensitive Hashing Page的更多相关文章

[Algorithm] 局部敏感哈希算法(Locality Sensitive Hashing)
局部敏感哈希(Locality Sensitive Hashing,LSH)算法是我在前一段时间找工作时接触到的一种衡量文本相似度的算法.局部敏感哈希是近似最近邻搜索算法中最流行的一种,它有坚实的理论 ...
局部敏感哈希算法(Locality Sensitive Hashing)
from:https://www.cnblogs.com/maybe2030/p/4953039.html 阅读目录 1. 基本思想 2. 局部敏感哈希LSH 3. 文档相似度计算局部敏感哈希(Lo ...
局部敏感哈希LSH（Locality-Sensitive Hashing）——海量数据相似性查找技术
一. 前言最近在工作中需要对海量数据进行相似性查找,即对微博全量用户进行关注相似度计算,计算得到每个用户关注相似度最高的TOP-N个用户,首先想到的是利用简单的协同过滤,先定义相似性度量(c ...
局部敏感哈希-Locality Sensitive Hashing
局部敏感哈希转载请注明http://blog.csdn.net/stdcoutzyx/article/details/44456679 在检索技术中,索引一直须要研究的核心技术.当下,索引技术主要分 ...
在茫茫人海中发现相似的你——局部敏感哈希（LSH）
一.引入在做微博文本挖掘的时候,会发现很多微博是高度相似的,因为大量的微博都是转发其他人的微博,并且没有添加评论,导致很多数据是重复或者高度相似的.这给我们进行数据处理带来很大的困扰,我们得想办法把 ...
【期外】（一）关于LSH ：局部敏感哈希算法
LSH是我同学的名字,平时我会亲切的称呼他为离骚,老师好,左移(leftshift),小骚骚之类的,最近他又多了一个新的外号:局部敏感哈希(Locally sensitive hashing). 好了 ...
局部敏感哈希-Locality Sensitivity Hashing
一. 近邻搜索从这里开始我将会对LSH进行一番长篇大论.因为这只是一篇博文,并不是论文.我觉得一篇好的博文是尽可能让人看懂,它对语言的要求并没有像论文那么严格,因此它可以有更强的表现力. 局部敏感哈 ...
从NLP任务中文本向量的降维问题，引出LSH（Locality Sensitive Hash 局部敏感哈希）算法及其思想的讨论
1. 引言 - 近似近邻搜索被提出所在的时代背景和挑战 0x1:从NN(Neighbor Search)说起 ANN的前身技术是NN(Neighbor Search),简单地说,最近邻检索就是根据数据 ...
海量数据挖掘MMDS week2: 局部敏感哈希Locality-Sensitive Hashing, LSH
http://blog.csdn.net/pipisorry/article/details/48858661 海量数据挖掘Mining Massive Datasets(MMDs) -Jure Le ...

随机推荐

基于 Struts2 的单文件和多文件上传
文件的上传下载是 Web 开发中老生常谈的功能,基于 Struts2 框架对于实现这一功能,更是能够给我们带来很多的便利.Struts2 已经有默认的 upload 拦截器.我们只需要写参数,它就会自 ...
几种JS&CSS框架
易用的,图形种类丰富的,美观的几种: 1.bootstrap 文档地址: http://www.cnblogs.com/fnng/p/4446047.html http://www.runoob.co ...
洛谷P3521 [POI2011]ROT-Tree Rotation [线段树合并]
题目传送门 Tree Rotation 题目描述 Byteasar the gardener is growing a rare tree called Rotatus Informatikus. I ...
装饰 Markdown
利用 Font Awesome 提升 Markdown 的表现能力 Font Awesome 是一个字体和图标工具包,包含人物.动物.建筑.商业.品牌等等各种主题丰富的图标符号,可以通过相应的语法添加 ...
CSU - 2058 跳一跳
Description 冰弦非常热衷于过气微信小游戏"跳一跳",现在给出了他每次游戏时的一些信息,请你帮他计算一下每局游戏的得分. 跳一跳的游戏规则如下: 玩家操控一个小棋子,在形 ...
二、redis系列之持久化
1. 绪言 redis是一种内存数据库,它把数据存储在服务器的内存当中,这样极大地保证了redis数据库的性能,但也为数据安全带来了隐患——redis所在服务器重启或者发生宕机后,redis数据库里的 ...
JAVAEE——宜立方商城03：商品类目选择、Nginx端口或域名区分虚拟机、Nginx反向代理、负载均衡、keepalived实现高可用
1. 学习计划第三天: 1.商品类目选择(EasyUI的tree实现) 2.图片上传 a) 图片服务器FastDFS(Nainx部分) 2. 商品类目选择 2.1. 原型 2.2. 功能分析展示商 ...
IdentityServer4之JWT签名(RSA加密证书)及验签
一.前言在IdentityServer4中有两种令牌,一个是JWT和Reference Token,在IDS4中默认用的是JWT,那么这两者有什么区别呢? 二.JWT与Reference Token ...
ARM 中必须明白的几个概念
文章具体介绍了关于ARM的22个常用概念. 1.ARM中一些常见英文缩写解释 MSB:最高有效位: LSB:最低有效位: AHB:先进的高性能总线: VPB:连接片内外设功能的VLSI外设总线: EM ...
Hdu4903 The only survival
The only survival Time Limit: 40000/20000 MS (Java/Others) Memory Limit: 131072/131072 K (Java/Ot ...

局部敏感哈希 Kernelized Locality-Sensitive Hashing Page

Kernelized Locality-Sensitive Hashing Page

Brian Kulis (1) and Kristen Grauman (2)(1) UC Berkeley EECS and ICSI, Berkeley, CA(2) University of Texas, Department of Computer Sciences, Austin, TX

Introduction

Method

Experimental Results

Code

Paper

局部敏感哈希 Kernelized Locality-Sensitive Hashing Page的更多相关文章

随机推荐

热门专题

Brian Kulis (1) and Kristen Grauman (2)
(1) UC Berkeley EECS and ICSI, Berkeley, CA
(2) University of Texas, Department of Computer Sciences, Austin, TX