Hnswlib - fast approximate nearest neighbor search

Header-only C++ HNSW implementation with python bindings.

NEWS:

Hnswlib is now 0.5.2. Bugfixes - thanks @marekhanus for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; @apoorv-sharma for fixing the bug int the insertion/deletion logic; @shengjun1985 for simplifying the memory reallocation logic; @TakaakiFuruse for improved description of add_items; @psobotfor improving error handling; @ShuAiii for reporting the bug in the python interface
Hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to @dbespalov, @dyashuni, @groodt,@uestc-lfs, @vinnitu, @fabiencastan, @JinHai-CN, @js1010!
Thanks to Apoorv Sharma @apoorv-sharma, hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).
Thanks to Dmitry @2ooom, hnswlib got a boost in performance for vector dimensions that are not multiple of 4
Thanks to Louis Abraham (@louisabraham) hnswlib can now be installed via pip!

Highlights:

Lightweight, header-only, no dependencies other than C++ 11.
Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
Has full support for incremental index construction. Has support for element deletions (currently, without actual freeing of the memory).
Can work with custom user defined distances (C++).
Significantly less memory footprint and faster build time compared to current nmslib's implementation.

Description of the algorithm parameters can be found in ALGO_PARAMS.md.

Python bindings

Supported distances:

Distance	parameter	Equation
Squared L2	'l2'	d = sum((Ai-Bi)^2)
Inner product	'ip'	d = 1.0 - sum(Ai*Bi)
Cosine similarity	'cosine'	d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

Short API description

hnswlib.Index(space, dim) creates a non-initialized index an HNSW in space space with integer dimension dim.

hnswlib.Index methods:

init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100) initializes the index from with no elements.
- max_elements defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
- ef_construction defines a construction time/accuracy trade-off (see ALGO_PARAMS.md).
- M defines tha maximum number of outgoing connections in the graph (ALGO_PARAMS.md).
add_items(data, ids, num_threads = -1) - inserts the data(numpy array of vectors, shape:N*dim) into the structure.
- num_threads sets the number of cpu threads to use (-1 means use default).
- ids are optional N-size numpy array of integer labels for all elements in data.
  - If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
- Thread-safe with other add_items calls, but not with knn_query.
mark_deleted(label) - marks the element as deleted, so it will be omitted from search results.
resize_index(new_size) - changes the maximum capacity of the index. Not thread safe with add_items and knn_query.
set_ef(ef) - sets the query time accuracy/speed trade-off, defined by the ef parameter ( ALGO_PARAMS.md). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.
knn_query(data, k = 1, num_threads = -1) make a batch query for k closest elements for each element of the
- data (shape:N*dim). Returns a numpy array of (shape:N*k).
- num_threads sets the number of cpu threads to use (-1 means use default).
- Thread-safe with other knn_query calls, but not with add_items.
load_index(path_to_index, max_elements = 0) loads the index from persistence to the uninitialized index.
- max_elements(optional) resets the maximum number of elements in the structure.
save_index(path_to_index) saves the index from persistence.
set_num_threads(num_threads) set the default number of cpu threads used during data insertion/querying.
get_items(ids) - returns a numpy array (shape:N*dim) of vectors that have integer identifiers specified in ids numpy vector (shape:N). Note that for cosine similarity it currently returns normalized vectors.
get_ids_list() - returns a list of all elements' ids.
get_max_elements() - returns the current capacity of the index
get_current_count() - returns the current number of element stored in the index

Read-only properties of hnswlib.Index class:

space - name of the space (can be one of "l2", "ip", or "cosine").
dim - dimensionality of the space.
M - parameter that defines the maximum number of outgoing connections in the graph.
ef_construction - parameter that controls speed/accuracy trade-off during the index construction.
max_elements - current capacity of the index. Equivalent to p.get_max_elements().
element_count - number of items in the index. Equivalent to p.get_current_count().

Properties of hnswlib.Index that support reading and writin

ef - parameter controlling query time/accuracy trade-off.
num_threads - default number of threads to use in add_items or knn_query. Note that calling p.set_num_threads(3) is equivalent to p.num_threads=3.

Python bindings examples

import hnswlib

import numpy as np

import pickle

dim = 128

num_elements = 10000

# Generating sample data

data = np.float32(np.random.random((num_elements, dim)))

ids = np.arange(num_elements)

# Declaring index

p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initializing index - the maximum number of elements should be known beforehand

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):

p.add_items(data, ids)

# Controlling the recall by setting ef:

p.set_ef(50) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)

labels, distances = p.knn_query(data, k = 1)

# Index objects support pickling

# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!

# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load

p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip

### Index parameters are exposed as class properties:

print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}")

print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")

print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")

print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")

An example with updates after serialization/deserialization:

import hnswlib

import numpy as np

dim = 16

num_elements = 10000

# Generating sample data

data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:

data1 = data[:num_elements // 2]

data2 = data[num_elements // 2:]

# Declaring index

p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initializing index

# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded

# during insertion of an element.

# The capacity can be increased by saving/loading the index, see below.

#

# ef_construction - controls index search speed/build speed tradeoff

#

# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)

# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:

# higher ef leads to better accuracy, but slower search

p.set_ef(10)

# Set number of threads used during batch search/construction

# By default using all available cores

p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))

p.add_items(data1)

# Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data1, k=1)

print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:

index_path='first_half.bin'

print("Saving index to '%s'" % index_path)

p.save_index("first_half.bin")

del p

# Re-initializing, loading the index

p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data

p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))

p.add_items(data2)

# Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data, k=1)

print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

Bindings installation

You can install from sources:

apt-get install -y python-setuptools python-pip

git clone https://github.com/nmslib/hnswlib.git

cd hnswlib

pip install .

or you can install via pip: pip install hnswlib

Other implementations

Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
Faiss library by facebook, uses own HNSW implementation for coarse quantization (python, C++): https://github.com/facebookresearch/faiss
Code for the paper "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors" (current state-of-the-art in compressed indexes, C++): https://github.com/dbaranchuk/ivf-hnsw
TOROS N2 (python, C++): https://github.com/kakao/n2
Online HNSW (C++): https://github.com/andrusha97/online-hnsw)
Go implementation: https://github.com/Bithack/go-hnsw
Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
Java implementation: https://github.com/jelmerk/hnswlib
Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
.Net implementation: https://github.com/microsoft/HNSW.Net
CUDA implementation: https://github.com/js1010/cuhnsw

Contributing to the repository

Contributions are highly welcome!

Please make pull requests against the develop branch.

200M SIFT test reproduction

To download and extract the bigann dataset (from root directory):

python3 download_bigann.py

To compile:

mkdir build

cd build

cmake ..

make all

To run the test on 200M SIFT subset:

./main

The size of the BigANN subset (in millions) is controlled by the variable subset_size_millions hardcoded in sift_1b.cpp.

hnsw的更多相关文章

Xamarin.iOS开发初体验
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAKwAAAA+CAIAAAA5/WfHAAAJrklEQVR4nO2c/VdTRxrH+wfdU84pW0

随机推荐

Centos7 MyCat2 安装部署
部署MyCat2 之前需要搭建好数据库的主从,详看文档:mysql 主次数据库搭建官网:http://www.mycat.org.cn/ 官方文档: https://www.yuque.com/cc ...
python win32 microsoft excel 类range的copyPictrue方法无效
这个报错也是可以的,不明不白,只是提示:microsoft excel 类range的copyPictrue方法无效网上找了好多博客,对我的情况没有效果,无奈我想打开excel看看到底是咋回事,结果 ...
IDEA的常用快捷键和文档注释
IDEA的常用快捷键 Alt + 回车导入包,自动修正 Ctrl + N 查找类 Ctrl + Shift + N 查找文件 Ctrl + Alt + N 格式化代码 Ctrl + Alt + O ...
CentOS6.8安装docker教程
在VMware新安装CentOS6.8系统 CentOS6.8可在阿里镜像库下载: https://mirrors.aliyun.com/centos-vault/6.8/isos/x86_64/ 在 ...
app内嵌H5踩坑
内嵌的H5是用的vue2版本开发的,期间有很多的坑要踩: 1.调用app返回上一个页面不触发页面的onmouted和window.onPageShow app返回上一个页面调用的方法并不会出发vue的 ...
使用stream流对数据进行处理
1. 使用场景本次使用是通过条件查询出所需要的多个字段后,对其进行处理(一个条件查询多个下拉框内容,并对每个下拉框内容封装对象,进行返回) 2. 代码点击查看代码 //获取所有需要的数据 List ...
JavaScript 数字与字符串的加减乘除运算
点击跳转 Tips: 除开字符串 + 数字的运算,会产生级联,其他情况下会将 String 转为 number 再进行数字运算. js 运算是从左到右的,所以一步一步来,不要跳步进行运算.
gulp安装出错
gulp安装出错标签(空格分隔): gulp 贴上报错: [root@localhost web]# npm install gulp --save-dev gulptest@1.0.0 /mnt/ ...
react native 第三方富文本编辑器 wxik/react-native-rich-editor（在移动端使用）
//更新2021年8月23日 (1)wxik/react-native-rich-editor 个人认为功能比较全,推荐使用关于使用的案例,官网上有,我直接粘贴我遇到的几个问题 1. 软键盘弹出时 ...
python3.7与python3.6,python2.7 pyc文件头部差异

hnsw