Hnswlib 介绍与入门使用

Hnswlib是一个强大的近邻搜索（ANN)库，官方介绍 Header-only C++ HNSW implementation with python bindings, insertions and updates. 热门的向量数据库Milvus底层的ANN库之一就是Hnswlib, 为milvus提供HNSW检索。

HNSW 原理

将节点划分成不同层级，贪婪地遍历来自上层的元素，直到达到局部最小值，然后切换到下一层，以上一层中的局部最小值作为新元素重新开始遍历，直到遍历完最低一层。

安装使用

从源码安装：

apt-get install -y python-setuptools python-pip

git clone https://github.com/nmslib/hnswlib.git

cd hnswlib

pip install .

或者直接pip安装 pip install hnswlib

python 使用

import hnswlib

import numpy as np

dim = 16

num_elements = 10000

# Generating sample data

data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:

data1 = data[:num_elements // 2]

data2 = data[num_elements // 2:]

# Declaring index

p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initializing index

# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded

# during insertion of an element.

# The capacity can be increased by saving/loading the index, see below.

#

# ef_construction - controls index search speed/build speed tradeoff

#

# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)

# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:

# higher ef leads to better accuracy, but slower search

p.set_ef(10)

# Set number of threads used during batch search/construction

# By default using all available cores

p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))

p.add_items(data1)

# Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data1, k=1)

print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:

index_path='first_half.bin'

print("Saving index to '%s'" % index_path)

p.save_index("first_half.bin")

del p

# Re-initializing, loading the index

p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data

p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))

p.add_items(data2)

# Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data, k=1)

print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

依次介绍：

distances

支持三种距离算法， l2, ip内积，以及cos。

Distance	parameter	Equation
Squared L2	'l2'	d = sum((Ai-Bi)^2)
Inner product	'ip'	d = 1.0 - sum(Ai*Bi)
Cosine similarity	'cosine'	d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

API

定义 index

p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

space 指定Distance算法，dim是向量的维度。

初始化索引

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

max_elements - 最大容量 (capacity)，如果插入数据超过容量会报异常，可以动态扩容
ef_construction - 平衡索引构建速度和搜索准确率，ef_construction越大，准确率越高但是构建速度越慢。 ef_construction 提高并不能无限增加索引的质量，常见的 ef_constructio n 参数为 128。
M - 表示在建表期间每个向量的边数目量，M会影响内存消耗，M越高，内存占用越大，准确率越高，同时构建速度越慢。通常建议设置在 8-32 之间。

添加数据与查询数据

# Controlling the recall by setting ef:

# higher ef leads to better accuracy, but slower search

p.set_ef(10)

# Set number of threads used during batch search/construction

# By default using all available cores

p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))

p.add_items(data1)

# Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data1, k=1)

print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

p.set_ef(10)：设置搜索时的最大近邻数量（ef），即在构建索引时最多保留多少个近邻。较高的ef值会导致更好的准确率，但搜索速度会变慢。
p.set_num_threads(4)：设置在批量搜索和构建索引过程中使用的线程数。默认情况下，使用所有可用的核心。
p.add_items(data1)：将数据添加到索引中。
labels, distances = p.knn_query(data1, k=1)：对数据中的每个元素进行查询，找到与其最近的邻居，返回邻居的标签和距离。

保持与加载索引



# Serializing and deleting the index:

index_path='first_half.bin'

print("Saving index to '%s'" % index_path)

p.save_index("first_half.bin")

del p

# Re-initializing, loading the index

p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data

p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))

p.add_items(data2)

# Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data, k=1)

print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

通过save_index保存索引
然后load_index重新加载索引，只要未超过max_elements，可以再次add_items

C++使用

官方提供了C++ 例子，创建索引、插入元素、搜索和序列化

#include "../../hnswlib/hnswlib.h"

int main() {

    int dim = 16;               // Dimension of the elements

    int max_elements = 10000;   // Maximum number of elements, should be known beforehand

    int M = 16;                 // Tightly connected with internal dimensionality of the data

                                // strongly affects the memory consumption

    int ef_construction = 200;  // Controls index search speed/build speed tradeoff

    // Initing index

    hnswlib::L2Space space(dim);

    hnswlib::HierarchicalNSW<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, max_elements, M, ef_construction);

    // Generate random data

    std::mt19937 rng;

    rng.seed(47);

    std::uniform_real_distribution<> distrib_real;

    float* data = new float[dim * max_elements];

    for (int i = 0; i < dim * max_elements; i++) {

        data[i] = distrib_real(rng);

    }

    // Add data to index

    for (int i = 0; i < max_elements; i++) {

        alg_hnsw->addPoint(data + i * dim, i);

    }

    // Query the elements for themselves and measure recall

    float correct = 0;

    for (int i = 0; i < max_elements; i++) {

        std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1);

        hnswlib::labeltype label = result.top().second;

        if (label == i) correct++;

    }

    float recall = correct / max_elements;

    std::cout << "Recall: " << recall << "\n";

    // Serialize index

    std::string hnsw_path = "hnsw.bin";

    alg_hnsw->saveIndex(hnsw_path);

    delete alg_hnsw;

    // Deserialize index and check recall

    alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, hnsw_path);

    correct = 0;

    for (int i = 0; i < max_elements; i++) {

        std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1);

        hnswlib::labeltype label = result.top().second;

        if (label == i) correct++;

    }

    recall = (float)correct / max_elements;

    std::cout << "Recall of deserialized index: " << recall << "\n";

    delete[] data;

    delete alg_hnsw;

    return 0;

}

Milvus 使用

milvus 通过cgo调用knowhere，knowhere是一个向量检索的抽象封装，集成了FAISS, HNSW等开源ANN库。

knowhere 是直接将hnswlib代码引入，使用hnswlib的代码在

https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/hnsw.cc

主要是基于hnswlib的C接口，实现HnswIndexNode

namespace knowhere {

class HnswIndexNode : public IndexNode {

 public:

    HnswIndexNode(const int32_t& /*version*/, const Object& object) : index_(nullptr) {

        search_pool_ = ThreadPool::GetGlobalSearchThreadPool();

    }

    Status

    Train(const DataSet& dataset, const Config& cfg) override {

        auto rows = dataset.GetRows();

        auto dim = dataset.GetDim();

        auto hnsw_cfg = static_cast<const HnswConfig&>(cfg);

        hnswlib::SpaceInterface<float>* space = nullptr;

        if (IsMetricType(hnsw_cfg.metric_type.value(), metric::L2)) {

            space = new (std::nothrow) hnswlib::L2Space(dim);

        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::IP)) {

            space = new (std::nothrow) hnswlib::InnerProductSpace(dim);

        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::COSINE)) {

            space = new (std::nothrow) hnswlib::CosineSpace(dim);

        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::HAMMING)) {

            space = new (std::nothrow) hnswlib::HammingSpace(dim);

        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::JACCARD)) {

            space = new (std::nothrow) hnswlib::JaccardSpace(dim);

        } else {

            LOG_KNOWHERE_WARNING_ << "metric type not support in hnsw: " << hnsw_cfg.metric_type.value();

            return Status::invalid_metric_type;

        }

        auto index = new (std::nothrow)

            hnswlib::HierarchicalNSW<float>(space, rows, hnsw_cfg.M.value(), hnsw_cfg.efConstruction.value());

        if (index == nullptr) {

            LOG_KNOWHERE_WARNING_ << "memory malloc error.";

            return Status::malloc_error;

        }

        if (this->index_) {

            delete this->index_;

            LOG_KNOWHERE_WARNING_ << "index not empty, deleted old index";

        }

        this->index_ = index;

        return Status::success;

    }

    Status

    Add(const DataSet& dataset, const Config& cfg) override {

		// ...

        std::atomic<uint64_t> counter{0};

        uint64_t one_tenth_row = rows / 10;

        for (int i = 1; i < rows; ++i) {

            futures.emplace_back(build_pool->push([&, idx = i]() {

                index_->addPoint(((const char*)tensor + index_->data_size_ * idx), idx);

                uint64_t added = counter.fetch_add(1);

                if (added % one_tenth_row == 0) {

                    LOG_KNOWHERE_INFO_ << "HNSW build progress: " << (added / one_tenth_row) << "0%";

                }

            }));

        }

        // ...

    }

其他实现

Go实现：https://github.com/Bithack/go-hnsw
Java实现：https://github.com/jelmerk/hnswlib
使用Java Native Access的Java绑定：https://github.com/stepstone-tech/hnswlib-jna

Hnswlib 介绍与入门使用的更多相关文章

.NET平台开源项目速览(6)FluentValidation验证组件介绍与入门(一)
在文章:这些.NET开源项目你知道吗?让.NET开源来得更加猛烈些吧!(第二辑)中,给大家初步介绍了一下FluentValidation验证组件.那里只是概述了一下,并没有对其使用和强大功能做深入研究 ...
freemarker语法介绍及其入门教程实例
# freemarker语法介绍及其入门教程实例 # ## FreeMarker标签使用 #####一.FreeMarker模板文件主要有4个部分组成</br>#### 1.文本,直接输 ...
(转)私有代码存放仓库 BitBucket介绍及入门操作
转自:http://blog.csdn.net/lhb_0531/article/details/8602139 私有代码存放仓库 BitBucket介绍及入门操作分类: 研发管理2013-02-2 ...
NET平台开源项目速览(6)FluentValidation验证组件介绍与入门(转载)
原文地址:http://www.cnblogs.com/asxinyu/p/dotnet_Opensource_project_FluentValidation_1.html 阅读目录 1.基本介绍 ...
读写Word的组件DocX介绍与入门
本文为转载内容: 文章原地址:http://www.cnblogs.com/asxinyu/archive/2013/02/22/2921861.html 开源Word读写组件DocX介绍与入门阅读 ...
[转帖]Druid介绍及入门
Druid介绍及入门 2018-09-19 19:38:36 拿着核武器的程序员阅读数 22552更多分类专栏: Druid 版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议 ...
Redis介绍及入门安装及使用
Redis介绍及入门安装及使用什么是Redis Redis is an open source (BSD licensed), in-memory data structure store, use ...
Mysql数据库的简单介绍与入门
Mysql数据库的简单介绍与入门前言一.下载与安装 1.下载官网下载MYSQL5.7.21版本,链接地址https://www.mysql.com/downloads/.下载流程图如下: 找到M ...
day01-Mybatis介绍与入门
Mybatis介绍与入门 1.官方文档 Mybatis中文手册:mybatis – MyBatis 3 或者 MyBatis中文网 Maven仓库:Maven Repository: org.myba ...
Nodejs学习笔记（十四）— Mongoose介绍和入门
目录简介 mongoose安装连接字符串 Schema Model 常用数据库操作插入更新删除条件查询数量查询根据_id查询模糊查询分页查询其它操作写在之后... 简介 Mon ...

随机推荐

idea 热部署插件 JRebel 安装
idea 热部署插件 JRebel 安装 1.安装直接在idea 插件搜索安装 JRebel and XRebel 安装,安装后需要破解才能使用 2.破解破解原来需要远程连接服务器破解或者下载源码 ...
《Kali渗透基础》08. 弱点扫描（二）
@ 目录 1:OpenVAS / GVM 1.1:介绍 1.2:安装 1.3:使用 2:Nessus 2.1:介绍 2.2:安装 2.3:使用 3:Nexpose 本系列侧重方法论,各工具只是实现目标 ...
为何每个开发者都在谈论Go？
本文深入探讨了Go语言的多个关键方面,从其简洁的语法.强大的并发支持到出色的性能优势,进一步解析了Go在云原生领域的显著应用和广泛的跨平台支持.文章结构严谨,逐一分析了Go语言在现代软件开发中所占据的 ...
springboot整合feign的接口抽离
前言现在很多微服务框架使用feign来进行服务间的调用,需要在服务端和消费端两边分别对接口和请求返回实体进行编码,维护起来也比较麻烦.那有木有一种可能,只用服务端编写接口,客户端像本地方法一样调用, ...
【Qt6】列表模型——树形列表
QStandardItemModel 类作为标准模型,主打"类型通用",前一篇水文中,老周还没提到树形结构的列表,本篇咱们就好好探讨一下这货. 还是老办法,咱们先做示例,然后再聊知 ...
Flask框架——详解URL、HTTP请求、视图函数和视图类
文章目录 1 什么是url? 2 为什么要有url? 3 如何应用url? 3.1 url和路由的区别. 3.2 url传参的两种 3.2.1动态路由传参 3.2.1.1 动态路由的过滤 3.2.2 ...
RatingBar android 自定义评级星星
资源下载地址  <RatingBar android:id="@+id/ratingBar" android:layout_wi ...
【信创】 JED on 鲲鹏(ARM) 调优步骤与成果
项目背景基于国家对信创项目的大力推进,为了自主可控的技术发展,基础组件将逐步由国产组件替代,因此从数据库入手,将弹性库JED部署在国产华为鲲鹏机器上(基于ARM架构)进行调优,与Intel (X8 ...
CAP 定理的含义（转）
分布式系统(distributed system)正变得越来越重要,大型网站几乎都是分布式的. 分布式系统的最大难点,就是各个节点的状态如何同步.CAP 定理是这方面的基本定理,也是理解分布式系统的起 ...
HTTP请求中浏览器的缓存机制（转）
摘要:在Web开发过程中,我们可能会经常遇到浏览器缓存的问题.本文作者详细解释了浏览器缓存的机制,帮助读者更深层次的认识浏览器的缓存. 流程当资源第一次被访问的时候,HTTP头部如下 (Reques ...