Hnswlib是一个强大的近邻搜索(ANN)库, 官方介绍 Header-only C++ HNSW implementation with python bindings, insertions and updates. 热门的向量数据库Milvus底层的ANN库之一就是Hnswlib, 为milvus提供HNSW检索。

HNSW 原理

HNSW 原理

将节点划分成不同层级,贪婪地遍历来自上层的元素,直到达到局部最小值,然后切换到下一层,以上一层中的局部最小值作为新元素重新开始遍历,直到遍历完最低一层。

安装使用

从源码安装:

apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .

或者直接pip安装 pip install hnswlib

python 使用

import hnswlib
import numpy as np dim = 16
num_elements = 10000 # Generating sample data
data = np.float32(np.random.random((num_elements, dim))) # We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:] # Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip # Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) # Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10) # Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4) print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1) # Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n") # Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p # Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. print("\nLoading index from 'first_half.bin'\n") # Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements) print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2) # Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

依次介绍:

distances

支持三种距离算法, l2, ip内积,以及cos。

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2)
Inner product 'ip' d = 1.0 - sum(Ai*Bi)
Cosine similarity 'cosine' d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

API

定义 index

p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

space 指定Distance算法,dim是向量的维度。

初始化索引

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
  • max_elements - 最大容量 (capacity),如果插入数据超过容量会报异常,可以动态扩容
  • ef_construction - 平衡索引构建速度和搜索准确率,ef_construction越大,准确率越高但是构建速度越慢。 ef_construction 提高并不能无限增加索引的质量,常见的 ef_constructio n 参数为 128。
  • M - 表示在建表期间每个向量的边数目量,M会影响内存消耗,M越高,内存占用越大,准确率越高,同时构建速度越慢。通常建议设置在 8-32 之间。

添加数据与查询数据

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10) # Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4) print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1) # Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")
  • p.set_ef(10):设置搜索时的最大近邻数量(ef),即在构建索引时最多保留多少个近邻。较高的ef值会导致更好的准确率,但搜索速度会变慢。
  • p.set_num_threads(4):设置在批量搜索和构建索引过程中使用的线程数。默认情况下,使用所有可用的核心。
  • p.add_items(data1):将数据添加到索引中。
  • labels, distances = p.knn_query(data1, k=1):对数据中的每个元素进行查询,找到与其最近的邻居,返回邻居的标签和距离。

保持与加载索引


# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p # Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. print("\nLoading index from 'first_half.bin'\n") # Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements) print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2) # Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")
  • 通过save_index保存索引
  • 然后load_index重新加载索引,只要未超过max_elements,可以再次add_items

C++使用

官方提供了C++ 例子,创建索引、插入元素、搜索和序列化

#include "../../hnswlib/hnswlib.h"

int main() {
int dim = 16; // Dimension of the elements
int max_elements = 10000; // Maximum number of elements, should be known beforehand
int M = 16; // Tightly connected with internal dimensionality of the data
// strongly affects the memory consumption
int ef_construction = 200; // Controls index search speed/build speed tradeoff // Initing index
hnswlib::L2Space space(dim);
hnswlib::HierarchicalNSW<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, max_elements, M, ef_construction); // Generate random data
std::mt19937 rng;
rng.seed(47);
std::uniform_real_distribution<> distrib_real;
float* data = new float[dim * max_elements];
for (int i = 0; i < dim * max_elements; i++) {
data[i] = distrib_real(rng);
} // Add data to index
for (int i = 0; i < max_elements; i++) {
alg_hnsw->addPoint(data + i * dim, i);
} // Query the elements for themselves and measure recall
float correct = 0;
for (int i = 0; i < max_elements; i++) {
std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1);
hnswlib::labeltype label = result.top().second;
if (label == i) correct++;
}
float recall = correct / max_elements;
std::cout << "Recall: " << recall << "\n"; // Serialize index
std::string hnsw_path = "hnsw.bin";
alg_hnsw->saveIndex(hnsw_path);
delete alg_hnsw; // Deserialize index and check recall
alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, hnsw_path);
correct = 0;
for (int i = 0; i < max_elements; i++) {
std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1);
hnswlib::labeltype label = result.top().second;
if (label == i) correct++;
}
recall = (float)correct / max_elements;
std::cout << "Recall of deserialized index: " << recall << "\n"; delete[] data;
delete alg_hnsw;
return 0;
}

Milvus 使用

milvus 通过cgo调用knowhere,knowhere是一个向量检索的抽象封装,集成了FAISS, HNSW等开源ANN库。

knowhere 是直接将hnswlib代码引入,使用hnswlib的代码在

https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/hnsw.cc

主要是基于hnswlib的C接口,实现HnswIndexNode

namespace knowhere {
class HnswIndexNode : public IndexNode {
public:
HnswIndexNode(const int32_t& /*version*/, const Object& object) : index_(nullptr) {
search_pool_ = ThreadPool::GetGlobalSearchThreadPool();
} Status
Train(const DataSet& dataset, const Config& cfg) override {
auto rows = dataset.GetRows();
auto dim = dataset.GetDim();
auto hnsw_cfg = static_cast<const HnswConfig&>(cfg);
hnswlib::SpaceInterface<float>* space = nullptr;
if (IsMetricType(hnsw_cfg.metric_type.value(), metric::L2)) {
space = new (std::nothrow) hnswlib::L2Space(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::IP)) {
space = new (std::nothrow) hnswlib::InnerProductSpace(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::COSINE)) {
space = new (std::nothrow) hnswlib::CosineSpace(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::HAMMING)) {
space = new (std::nothrow) hnswlib::HammingSpace(dim);
} else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::JACCARD)) {
space = new (std::nothrow) hnswlib::JaccardSpace(dim);
} else {
LOG_KNOWHERE_WARNING_ << "metric type not support in hnsw: " << hnsw_cfg.metric_type.value();
return Status::invalid_metric_type;
}
auto index = new (std::nothrow)
hnswlib::HierarchicalNSW<float>(space, rows, hnsw_cfg.M.value(), hnsw_cfg.efConstruction.value());
if (index == nullptr) {
LOG_KNOWHERE_WARNING_ << "memory malloc error.";
return Status::malloc_error;
}
if (this->index_) {
delete this->index_;
LOG_KNOWHERE_WARNING_ << "index not empty, deleted old index";
}
this->index_ = index;
return Status::success;
} Status
Add(const DataSet& dataset, const Config& cfg) override { // ... std::atomic<uint64_t> counter{0};
uint64_t one_tenth_row = rows / 10;
for (int i = 1; i < rows; ++i) {
futures.emplace_back(build_pool->push([&, idx = i]() {
index_->addPoint(((const char*)tensor + index_->data_size_ * idx), idx);
uint64_t added = counter.fetch_add(1);
if (added % one_tenth_row == 0) {
LOG_KNOWHERE_INFO_ << "HNSW build progress: " << (added / one_tenth_row) << "0%";
}
}));
}
// ...
}

其他实现

Hnswlib 介绍与入门使用的更多相关文章

  1. .NET平台开源项目速览(6)FluentValidation验证组件介绍与入门(一)

    在文章:这些.NET开源项目你知道吗?让.NET开源来得更加猛烈些吧!(第二辑)中,给大家初步介绍了一下FluentValidation验证组件.那里只是概述了一下,并没有对其使用和强大功能做深入研究 ...

  2. freemarker语法介绍及其入门教程实例

    # freemarker语法介绍及其入门教程实例 # ## FreeMarker标签使用 #####一.FreeMarker模板文件主要有4个部分组成</br>####  1.文本,直接输 ...

  3. (转)私有代码存放仓库 BitBucket介绍及入门操作

    转自:http://blog.csdn.net/lhb_0531/article/details/8602139 私有代码存放仓库 BitBucket介绍及入门操作 分类: 研发管理2013-02-2 ...

  4. NET平台开源项目速览(6)FluentValidation验证组件介绍与入门(转载)

    原文地址:http://www.cnblogs.com/asxinyu/p/dotnet_Opensource_project_FluentValidation_1.html 阅读目录 1.基本介绍 ...

  5. 读写Word的组件DocX介绍与入门

    本文为转载内容: 文章原地址:http://www.cnblogs.com/asxinyu/archive/2013/02/22/2921861.html 开源Word读写组件DocX介绍与入门 阅读 ...

  6. [转帖]Druid介绍及入门

    Druid介绍及入门 2018-09-19 19:38:36 拿着核武器的程序员 阅读数 22552更多 分类专栏: Druid   版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议 ...

  7. Redis介绍及入门安装及使用

    Redis介绍及入门安装及使用 什么是Redis Redis is an open source (BSD licensed), in-memory data structure store, use ...

  8. Mysql数据库的简单介绍与入门

    Mysql数据库的简单介绍与入门 前言 一.下载与安装 1.下载 官网下载MYSQL5.7.21版本,链接地址https://www.mysql.com/downloads/.下载流程图如下: 找到M ...

  9. day01-Mybatis介绍与入门

    Mybatis介绍与入门 1.官方文档 Mybatis中文手册:mybatis – MyBatis 3 或者 MyBatis中文网 Maven仓库:Maven Repository: org.myba ...

  10. Nodejs学习笔记(十四)— Mongoose介绍和入门

    目录 简介 mongoose安装 连接字符串 Schema Model 常用数据库操作 插入 更新 删除 条件查询 数量查询 根据_id查询 模糊查询 分页查询 其它操作 写在之后... 简介 Mon ...

随机推荐

  1. 定义一个函数,传入一个字典和一个元组,将字典的值(key不变)和元组的值交换,返回交换后的字典和元组

    知识点:zip() 函数用于将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表. li=[3,4,5] t=(7,8,9) print(list(zip(li,t ...

  2. PhotoShop Beta(爱国版)安装教程-内置AI绘画功能

    PS beta版安装教程 Window和Mac版都有,里面内置AI绘画功能 ps Beta版真的太爽了,今天来和大家分享下安装教程. 很多人拿这资料卖5块 9.9 19.9,球友们直接用,建议赶紧装, ...

  3. LeetCode 周赛上分之旅 #42 当 LeetCode 考树上倍增,出题的趋势在变化吗

    ️ 本文已收录到 AndroidFamily,技术和职场问题,请关注公众号 [彭旭锐] 和 BaguTree Pro 知识星球提问. 学习数据结构与算法的关键在于掌握问题背后的算法思维框架,你的思考越 ...

  4. 白盒AES和SM4实现的差分故障分析

    DFA攻击背景介绍 传统的密码安全性分析环境被称为黑盒攻击环境,攻击者只能访问密码系统的输入与输出,但随着密码系统部署环境的多样化,该分析模型已经不能够反映实际应用中攻击者的能力.2002年,Chow ...

  5. [HUBUCTF 2022 新生赛]ezPython

    附件链接:https://wwvc.lanzouj.com/iIqq218z5x0d 给了一个pyc文件 利用命令将pyc转换为py文件 uncompyle6 ezPython.pyc > ez ...

  6. Record - Nov. 27st, 2020 - Exam. REC & SOL

    Problem. 1 Junior - Thinking Desc. & Link. 注意到值域乘范围刚好能过. 然后就存两个桶即可...(数组开小飞了半天才调出来...) Problem. ...

  7. ProcessingJS

    ProcessingJS 图形 rect(x, y, w, h)(在新窗口中打开) ellipse(x, y, w, h) triangle(x1, y1, x2, y2, x3, y3) line( ...

  8. 将GitBash设置为VS Code的默认终端

    这个东西搞了半天,真的无语...网上的东西都太旧了 注意:"terminal.integrated.shell.windows"自2021年4月起已弃用. 1.首先打开设置 2.进 ...

  9. 别再吹捧什么区块链,元宇宙,Web3了,真正具有颠覆性的估计只有AI

    「感谢你阅读本文!」 别再吹捧什么区块链,元宇宙,Web3了,真正具有颠覆性的估计只有AI. 我们这个社会有这样一个特性,就是出现一个新事物,新概念,新技术,先不管是否真的现实,是否真的了解,第一件事 ...

  10. 13. 从零开始编写一个类nginx工具, HTTP中的压缩gzip,deflate,brotli算法

    wmproxy wmproxy将用Rust实现http/https代理, socks5代理, 反向代理, 静态文件服务器,后续将实现websocket代理, 内外网穿透等, 会将实现过程分享出来, 感 ...