hnsw
Hnswlib - fast approximate nearest neighbor search
Header-only C++ HNSW implementation with python bindings.
NEWS:
Hnswlib is now 0.5.2. Bugfixes - thanks @marekhanus for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; @apoorv-sharma for fixing the bug int the insertion/deletion logic; @shengjun1985 for simplifying the memory reallocation logic; @TakaakiFuruse for improved description of
add_items; @psobotfor improving error handling; @ShuAiii for reporting the bug in the python interfaceHnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to @dbespalov, @dyashuni, @groodt,@uestc-lfs, @vinnitu, @fabiencastan, @JinHai-CN, @js1010!
Thanks to Apoorv Sharma @apoorv-sharma, hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).
Thanks to Dmitry @2ooom, hnswlib got a boost in performance for vector dimensions that are not multiple of 4
Thanks to Louis Abraham (@louisabraham) hnswlib can now be installed via pip!
Highlights:
- Lightweight, header-only, no dependencies other than C++ 11.
- Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
- Has full support for incremental index construction. Has support for element deletions (currently, without actual freeing of the memory).
- Can work with custom user defined distances (C++).
- Significantly less memory footprint and faster build time compared to current nmslib's implementation.
Description of the algorithm parameters can be found in ALGO_PARAMS.md.
Python bindings
Supported distances:
| Distance | parameter | Equation |
|---|---|---|
| Squared L2 | 'l2' | d = sum((Ai-Bi)^2) |
| Inner product | 'ip' | d = 1.0 - sum(Ai*Bi) |
| Cosine similarity | 'cosine' | d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi)) |
Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.
For other spaces use the nmslib library https://github.com/nmslib/nmslib.
Short API description
hnswlib.Index(space, dim)creates a non-initialized index an HNSW in spacespacewith integer dimensiondim.
hnswlib.Index methods:
init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)initializes the index from with no elements.max_elementsdefines the maximum number of elements that can be stored in the structure(can be increased/shrunk).ef_constructiondefines a construction time/accuracy trade-off (see ALGO_PARAMS.md).Mdefines tha maximum number of outgoing connections in the graph (ALGO_PARAMS.md).
add_items(data, ids, num_threads = -1)- inserts thedata(numpy array of vectors, shape:N*dim) into the structure.num_threadssets the number of cpu threads to use (-1 means use default).idsare optional N-size numpy array of integer labels for all elements indata.- If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
- Thread-safe with other
add_itemscalls, but not withknn_query.
mark_deleted(label)- marks the element as deleted, so it will be omitted from search results.resize_index(new_size)- changes the maximum capacity of the index. Not thread safe withadd_itemsandknn_query.set_ef(ef)- sets the query time accuracy/speed trade-off, defined by theefparameter ( ALGO_PARAMS.md). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.knn_query(data, k = 1, num_threads = -1)make a batch query forkclosest elements for each element of thedata(shape:N*dim). Returns a numpy array of (shape:N*k).num_threadssets the number of cpu threads to use (-1 means use default).- Thread-safe with other
knn_querycalls, but not withadd_items.
load_index(path_to_index, max_elements = 0)loads the index from persistence to the uninitialized index.max_elements(optional) resets the maximum number of elements in the structure.
save_index(path_to_index)saves the index from persistence.set_num_threads(num_threads)set the default number of cpu threads used during data insertion/querying.get_items(ids)- returns a numpy array (shape:N*dim) of vectors that have integer identifiers specified inidsnumpy vector (shape:N). Note that for cosine similarity it currently returns normalized vectors.get_ids_list()- returns a list of all elements' ids.get_max_elements()- returns the current capacity of the indexget_current_count()- returns the current number of element stored in the index
Read-only properties of hnswlib.Index class:
space- name of the space (can be one of "l2", "ip", or "cosine").dim- dimensionality of the space.M- parameter that defines the maximum number of outgoing connections in the graph.ef_construction- parameter that controls speed/accuracy trade-off during the index construction.max_elements- current capacity of the index. Equivalent top.get_max_elements().element_count- number of items in the index. Equivalent top.get_current_count().
Properties of hnswlib.Index that support reading and writin
ef- parameter controlling query time/accuracy trade-off.num_threads- default number of threads to use inadd_itemsorknn_query. Note that callingp.set_num_threads(3)is equivalent top.num_threads=3.
Python bindings examples
import hnswlib
import numpy as np
import pickle dim = 128
num_elements = 10000 # Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements) # Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip # Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16) # Element insertion (can be called several times):
p.add_items(data, ids) # Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k # Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1) # Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip ### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")
An example with updates after serialization/deserialization:
import hnswlib
import numpy as np dim = 16
num_elements = 10000 # Generating sample data
data = np.float32(np.random.random((num_elements, dim))) # We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:] # Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip # Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) # Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10) # Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4) print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1) # Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n") # Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p # Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. print("\nLoading index from 'first_half.bin'\n") # Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements) print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2) # Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")
Bindings installation
You can install from sources:
apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .
or you can install via pip: pip install hnswlib
Other implementations
- Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
- Faiss library by facebook, uses own HNSW implementation for coarse quantization (python, C++): https://github.com/facebookresearch/faiss
- Code for the paper "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors" (current state-of-the-art in compressed indexes, C++): https://github.com/dbaranchuk/ivf-hnsw
- TOROS N2 (python, C++): https://github.com/kakao/n2
- Online HNSW (C++): https://github.com/andrusha97/online-hnsw)
- Go implementation: https://github.com/Bithack/go-hnsw
- Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
- Java implementation: https://github.com/jelmerk/hnswlib
- Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
- .Net implementation: https://github.com/microsoft/HNSW.Net
- CUDA implementation: https://github.com/js1010/cuhnsw
Contributing to the repository
Contributions are highly welcome!
Please make pull requests against the develop branch.
200M SIFT test reproduction
To download and extract the bigann dataset (from root directory):
python3 download_bigann.py
To compile:
mkdir build
cd build
cmake ..
make all
To run the test on 200M SIFT subset:
./main
The size of the BigANN subset (in millions) is controlled by the variable subset_size_millions hardcoded in sift_1b.cpp.
hnsw的更多相关文章
- Xamarin.iOS开发初体验
aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAKwAAAA+CAIAAAA5/WfHAAAJrklEQVR4nO2c/VdTRxrH+wfdU84pW0
随机推荐
- Linux - tar 命令详解 (压缩,解压,加密压缩,解密压缩)
压缩tar -czvf /path/to/file.tar.gz file (第一个参数:文件压缩的位置和名字 第二个参数:需要压缩的文件) 解压 tar -xzvf /path/to/file. ...
- 四大组件之服务Service
参考:Android开发基础之服务Service 什么是服务呢? 用俗话话应该是长期于后台运行的程序,如果是官方一点,首先它是一个组件,用于执行长期运行的任务,并且与用户没有交互. 每一个服务都需要在 ...
- .net core 3.1项目运行在Windows server 2012R2服务器上,Decimal类型小数点不见了,求解!32112.7958
.net core 3.1项目运行在Windows server 2012R2服务器上,Decimal类型小数点不见了,求解! string str = "1002910.8241" ...
- Git命令学习总结(廖雪峰官方Git教程)
1.Windows系统安装完Git后,需要在Git Bash命令窗口输入以下命令,进行用户名和邮箱设置.
- Kubernetes 1.26.0 集群部署Prometheus监控
前言 该存储库收集 Kubernetes 清单.Grafana仪表板和Prometheus 规则,结合文档和脚本,使用Prometheus Operator提供易于操作的端到端 Kubernetes ...
- 087_VS load codes for Salesforce
1. 下载VShttps://code.visualstudio.com/ ,安装后:Extensions 安装Salesforce Extension Pack .Salesforce Packag ...
- ROS多机通信
嵌入式开发板端: export ROS_IP=`hostname -I | awk '{print $1}'`export ROS_HOSTNAME=`hostname -I | awk '{prin ...
- vue.js----之router详解(三)
在vue1.0版本的超链接标签还是原来的a标签,链接地址由v-link属性控制 而vue2.0版本里超链接标签由a标签被替换成了router-link标签,但最终在页面还是会被渲染成a标签的 至于为什 ...
- 记一次修改dotnet-cnblogs图片路径的正则匹配规则
大佬的GitHub 前言 因为最近一时心血来潮,学了markdown,我觉得但凡是个人,对于不用word或html就可以有不错的排版,而且使用起来简单便捷,都会投向markdown的怀抱中的.又因 ...
- svn ssh方式避免每次输入密码
ubuntu下没有tortoiseSVN,用svn+ssh方式每次都提示要输入密码 通过配置~/.ssh/config文件让系统记住ssh密匙(private key)文件就不用输入密码了. 在con ...