Searching for Approximate Nearest Neighbours

Nearest neighbour search is a common task: given a query object represented as a point in some (often high-dimensional) space, we want to find other objects in that space that lie close to it. For example, a mapping application will perform a nearest neighbours search when we ask it for restaurants close to our location.

Nearest neighbour search at Lyst

Nearest neighbour search underpins two crucial systems at Lyst.

  1. Product de-duplication. Because we aggregate products from a vast range of different retailers, often the same product sold by two retailers will be described by different metadata. To identify these cases, we describe our images with a set of numerical features derived using the BRISK algorithm. This makes our images points in a high-dimensional space, and we use nearest neighbour search to perform de-duplication.
  2. Related products. Our product pages feature a set of related products that the user might also be interested in. For this task, each product is again represented as a vector (a point in a high-dimensional space) derived from a matrix factorisationrecommendation model. When a user visits a product page, we perform real-time nearest neighbours search in that space to find suitable related products.

The challenge with making these systems work is scaling them to the size of our product catalogue. To serve related products, we need to perform a NN search over 8 million products in under 100 ms; to perform de-duplication, we need to search over 80 million images in a under half a second.

Within these constraints, it is impossible to perform exhaustive NN search --- taking the query point and computing its distance to every other point. We therefore use approximate nearest neighbour (ANN) search: algorithms and data structures that allow us to trade off a small amount of accuracy for a massive boost in speed.

ANN via Random Projection Forests

The essence of approximate nearest neighbour search consists in roughly dividing the search space into a number of buckets that contain points that are close to each other, and then only looking within a given bucket when performing a search. This gives us speed (we only have to scan the contents of a given bucket) at the expense of accuracy (it's possible for a point's nearest neighbours to lie in a different bucket).

This is quite similar to building a hash table (dictionary, associative array etc.), but instead of using a uniform hash function, we use a special function that hashes points close to each other to the same hash code (hence, locality sensitive hashing).

To do this, we use forests of Random Projection trees. This is roughly how it works.

BUILDING AN ANN SEARCH TREE

We start with the entire set of points (blue) and try to recursively slice it into smaller and smaller buckets that contain only similar points. In this example, we'll pick a query point (red) and see how we can narrow down its approximate nearest neighbours. Note that here we are interested in cosine similarity rather than Euclidean distance (the angle between two points versus the length of the line that connects them).

In the first image, we have all of our candidate points and the query point. At this stage, if we wanted to run our query, we'd have to calculate the similarity of the red point with all the blue points --- and that would be too slow.

What we do instead is draw a random vector pointing out from the origin (the red arrow). It's clear that some of the blue points point in the same direction as the arrow (to the right of the red line), and some point away from it (to the left of it). That's our first split: all the dark red points are assigned to one bucket, and the blue points to the other.

Mathematically, we take the dot product of the random vector and our points: if it is greater than zero, we assign the points to the left subtree; if it is smaller, we assign them to the right subtree.

At this stage, each bucket still contains too many points, and so we continue the process by drawing another random vector, and doing the split again. After these two splits, only the dark red points are in the same bucket as the query point.

We continue the splits until we are satisfied with the resulting bucket size. In the last image, only 7 our of the initial 100 points are in the ANN bucket, giving us (in theory) a 10x speed-up when querying.

The resulting data structure is a binary tree: at each internal node, we split our set of points into two. The leaf nodes contain the points themselves.

QUERYING

Once the tree is built, querying it is very straightforward. If we query for the NNs of a point in the tree, we simply look up its bucket and perform brute force search only within that bucket.

If we query for a new point, we first need to traverse the tree to find the appropriate leaf node. We recursively take the dot product of the query point with the internal node vectors, moving down the correct subtree at every split until we hit a leaf node. In this example, with a tree depth of 4 and 7 points in the leaf node, we would perform 11 distance calculations: far fewer than the 99 we would have to do in a brute force search.

BUILDING A FOREST OF TREES

So far, we have built only on tree. In practice, we build many such trees --- a random projection forest. Because we are using a probabilistic algorithm, it is likely, but not guaranteed, that a leaf node will contain a query point's nearest neighbours. In fact, if we look at the first split in the example above, we can see that there are some points immediately to the left of the query point that fall on the other side of the partition. If we built only one tree, these points would be (erroneously) never retrieved. We build many trees to make this occurence less likely, trading off query time for retrieval accuracy.

ANN search in Python

With theory out of the way, on to the important question: what can we use to do this in Python?

There are a number of packages that implement approximate nearest neighbour search.

  1. LSHForest, easy to obtain as part of scikit-learn, supports indexing sparse vectors.
  2. panns, supports both Euclidean and angular distance, small index file size.
  3. annoy by Erik Bernhardsson, a Python wrapper of C++ code, very fast.

The advantage of the first two lies in their accessibility: they are implemented in pure Python, and LSHForest is built into scikit-learn. Unfortunately, they seem to be quite slow according to the ANN performance shootout maintained by Erik, the author of annoy.

Annoy itself is very fast and pleasant to use. However, it does not support indexing new points into an existing data structure, and has to keep vectors for all indexed points in memory (or in a memmapped file). For our problems, we found it useful to construct lightweight ANN structures that act as indexes into an external database: we obtain row ids from the index, but perform data retrieval and final scoring using a separate service.

To make this possible, we have released our own Python implementation of Random Projection Forests: rpforest.

RPForest

RPForest is a Python package for approximate nearest neighbours search, with performance critical parts written in Cython. Install it from pip using pip install rpforest. You'll need to install numpy first and have a C++ compiler.

Using it is straightforward. To fit the model, run:

from rpforest import RPForest

model = RPForest(leaf_size=50, no_trees=10)
model.fit(X)

The speed-precision tradeoff is governed by the leaf_size and no_trees parameters. Increasing leaf_size leads the model to produce shallower trees with larger leaf nodes; increasing no_trees fits more trees.

RPForest supports in-memory ANN queries. After fitting, ANNs can be obtained by calling:

nns = model.query(x_query, 10)

It also supports indexing and candidate ANN queries on datasets larger than would fit in available memory. This is accomplished by first fitting the model on a subset of the data, then indexing a larger set of data into the fitted model:

model = RPForest(leaf_size=50, no_trees=10)
model.fit(X_train) model.clear() # Deletes X_train vectors for point_id, x in get_x_vectors():
model.index(point_id, x) nns = model.get_candidates(x_query, 10)

While not as fast as annoy, RPForest handily beats LSHForest and panns in the ANN performance shootout:

Contributing

We wrote RPForest to provide the functionality we needed, and we hope it's useful for you too. Please help us improve it --- all issues and pull requests are welcome on the RPForestGithub page.

Searching for Approximate Nearest Neighbours的更多相关文章

  1. 快速近似最近邻搜索库 FLANN - Fast Library for Approximate Nearest Neighbors

    What is FLANN? FLANN is a library for performing fast approximate nearest neighbor searches in high ...

  2. Approximate Nearest Neighbors.接近最近邻搜索

    (一):次优最近邻:http://en.wikipedia.org/wiki/Nearest_neighbor_search 有少量修改:如有疑问,请看链接原文.....1.Survey:Neares ...

  3. ann搜索算法(Approximate Nearest Neighbor)

    ANN的方法分为三大类:基于树的方法.哈希方法.矢量量化方法.brute-force搜索的方式是在全空间进行搜索,为了加快查找的速度,几乎所有的ANN方法都是通过对全空间分割,将其分割成很多小的子空间 ...

  4. Computer Graphics Research Software

    Computer Graphics Research Software Helping you avoid re-inventing the wheel since 2009! Last update ...

  5. 机器学习资源汇总----来自于tensorflow中文社区

    新手入门完整教程进阶指南 API中文手册精华文章TF社区 INTRODUCTION 1. 新手入门 1.1. 介绍 1.2. 下载及安装 1.3. 基本用法 2. 完整教程 2.1. 总览 2.2.  ...

  6. Searching with Deep Learning 深度学习的搜索应用

    本文首发于 vivo 互联网技术微信公众号 https://mp.weixin.qq.com/s/wLMvJPXXaND9xq-XMwY2Mg作者:Eike Dehling翻译:杨振涛 本文由来自 T ...

  7. Dream team: Stacking for combining classifiers梦之队:组合分类器

     sklearn实战-乳腺癌细胞数据挖掘(博主亲自录制视频) https://study.163.com/course/introduction.htm?courseId=1005269003& ...

  8. {Reship}{Code}{CV}

    UIUC的Jia-Bin Huang同学收集了很多计算机视觉方面的代码,链接如下: https://netfiles.uiuc.edu/jbhuang1/www/resources/vision/in ...

  9. PRML读书笔记——2 Probability Distributions

    2.1. Binary Variables 1. Bernoulli distribution, p(x = 1|µ) = µ 2.Binomial distribution + 3.beta dis ...

随机推荐

  1. Bootloader简介

    来介绍一下Bootloader,在专用的嵌入式开发板上运行GNU/Linux 系统已经变得越来越流行.一个嵌入式Linux 系统从软件的角度看通常可以分为四个层次: 1.引导加载程序.包括固化在固件( ...

  2. 为Debain &&Centos安装dig

    Debain&Ubuntu sudo apt-get install dnsutils Fdeoar&Centos yum install bind-utils

  3. Ubuntu 安装php mcrypt

    sudo apt-get install php5-mcrypt libmcrypt4 libmcrypt-dev sudo php5enmod mcrypt sudo /etc/init.d/apa ...

  4. scala的REPL shell的调用

    最近突然对spark的spark-shell发生了兴趣 它是如何启动scala的REPL的,并且在此前写入了常用的环境变量的呢? 通过查看spark的源码,找到了SparkILoop.scala im ...

  5. 常用meta整理【转载】

    < meta > 元素 概要 标签提供关于HTML文档的元数据.元数据不会显示在页面上,但是对于机器是可读的.它可用于浏览器(如何显示内容或重新加载页面),搜索引擎(关键词),或其他 we ...

  6. Sqlserver基于流程控制

    流程控制语句只能在单个批处理段,用户自定义函数和存储过程中使用不能夸多个批处理或者用户自定义函数或者存储过程 批处理:一个或者多个语句组成的一个批处理,是因为所有语句一次性地被提交到一个sql实例,如 ...

  7. 未完待续的JAVA基础知识

    第二卷 1.每个JAVA程序必须有一个main函数,但并非是每个类都有,main函数必须声明为static函数. 2.println与print之间的区别是换行与不换行. 3.在JAVA中,不想C/C ...

  8. Decorator设计模式浅谈

    装饰类跟基础组件都实现了目标接口,是为了匹配正确的类型.Java中的IO设计就是典型的Decorator设计模式. 装饰模式产生的初衷是, 对默认实现类的行为进行扩展. 由于装饰类的构造器接受的参数是 ...

  9. Centos搭建PHP5.3.8+Nginx1.0.9+Mysql5.5.17

    操作环境 操作系统:Mac Os Lion 虚拟主机:VMware Fusion 虚拟系统:Centos 5.5+ 操作用户:Root 实现目的:搭建LNMP环境. 安装依赖库和开发环境 #依赖库和开 ...

  10. C#变量初始化问题:字段初始值无法引用非静态字段、方法或属性

    http://www.cnblogs.com/bluestorm/p/3432190.html 问题:字段初始值设定项无法引用非静态字段.方法或属性的问题 下面代码出错的原因,在类中定义的字段为什么不 ...