Frame of Reference and Roaring Bitmaps

2024-10-18 19:53:16 原文

https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps

http://roaringbitmap.org/

2015年2月18日Engineering

Frame of Reference and Roaring Bitmaps

作者

Postings lists

While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers. Why is this useful? As you may know, Elasticsearch shards, which are Lucene indices under the hood, split the data that they store into segments which are regularly merged together. Inside each segment, documents are given an identifier between 0 and the number of documents in the segment (up to 231-1). This is conceptually like an index in an array: it is stored nowhere but is enough to identity an item. Segments store data about documents sequentially, and a doc ID is the index of a document in a segment. So the first document in a segment would have a doc ID of 0, the second 1, etc. until the last document, which has a doc ID equal to the total number of documents in the segment minus one.

Why are these doc IDs useful? An inverted index needs to map terms to the list of documents that contain this term, called a postings list, and these doc IDs that we just discussed are a perfect fit since they can be compressed efficiently.

Frame Of Reference

In order to be able to compute intersections and unions efficiently, we require that these postings lists are sorted. A nice side-effect of this decision is that postings lists can be compressed with delta-encoding.

For instance, if your postings list is [73, 300, 302, 332, 343, 372], the list of deltas would be [73, 227, 2, 30, 11, 29]. What is interesting to note here is that all deltas are between 0 and 255, so you only need one byte per value. This is the technique that Lucene is using in order to encode your inverted index on disk: postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing: Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits. This encoding technique is known as Frame Of Reference (FOR) in the literature and has been used since Lucene 4.1.

Here is an example with a block size of 3 (instead of 256 in practice):

Frame of Reference and Roaring Bitmaps的更多相关文章

OD: Register, Stack Frame, Function Reference
几个重要的 Win32 寄存器 EIP 指令寄存器(Extended Instruction Pointer) 存放一个指针,指向下一条等待执行的指令地址 ESP 栈指针寄存器(Extended St ...
Elasticsearch 通关教程（七）： Elasticsearch 的性能优化
硬件选择 Elasticsearch(后文简称 ES)的基础是 Lucene,所有的索引和文档数据是存储在本地的磁盘中,具体的路径可在 ES 的配置文件../config/elasticsearch. ...
Elasticsearch 技术分析（九）：Elasticsearch的使用和原理总结
前言之前已经分享过Elasticsearch的使用和原理的知识,由于近期在公司内部做了一次内部分享,所以本篇主要是基于之前的博文的一个总结,希望通过这篇文章能让读者大致了解Elasticsearch ...
全文搜索引擎Elasticsearch详细介绍
我们生活中的数据总体分为两种:结构化数据和非结构化数据. 结构化数据:也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理.指具有固 ...
L ==> E · L · K
三剑客:Elastic Stack 在学习ELK前,先对 Lucene作基本了解. 今天才知道关系型数据库的索引是 B-Tree,罪过... 减少磁盘寻道次数 ---> 提高查询性能 Lucen ...
带你走进神一样的Elasticsearch索引机制
更多精彩内容请看我的个人博客前言相比于大多数人熟悉的MySQL数据库的索引,Elasticsearch的索引机制是完全不同于MySQL的B+Tree结构.索引会被压缩放入内存用于加速搜索过程,这一 ...
Busting Frame Busting: a Study of Clickjacking Vulnerabilities on Popular Sites
Busting Frame Busting Reference From: http://seclab.stanford.edu/websec/framebusting/framebust.pdf T ...
Frames of Reference参考框架
Frames of Reference参考框架 When describing the position and orientation of something (for example, your ...
Elasticsearch索引原理
转载 http://blog.csdn.net/endlu/article/details/51720299 最近在参与一个基于Elasticsearch作为底层数据框架提供大数据量(亿级)的实时统计 ...

随机推荐

redis scan 命令指南
redis scan 命令指南 1. 模糊查询键值 redis 中模糊查询key有 keys,scan等,一下是一些具体用法. -- 命令用法:keys [pattern] keys name* -- ...
[数据库]000 - 🍳Sysbench 数据库压力测试工具
000 - Sysbench 数据库压力测试工具 sysbench 是一个开源的.模块化的.跨平台的多线程性能测试工具,可以用来进行CPU.内存.磁盘I/O.线程.数据库的性能测试.目前支持的数据库有 ...
python对离散数据进行编码
机器学习中会遇到一些离散型数据,无法带入模型进行训练,所以要对其进行编码,常用的编码方式有两种: 1.特征不具备大小意义的直接独热编码(one-hot encoding) 2.特征有大小意义的采用映射 ...
C语言测一个浮点数的位数长度
测浮点数的位数牵扯到一个精度的问题,用普通的测整形数值的方法不能实现,于是我自己写了一个测浮点数的函数. #include <stdio.h> //for printf int lengt ...
OpenGL投影矩阵(Projection Matrix)构造方法
(翻译,图片也来自原文) 一.概述绝大部分计算机的显示器是二维的(a 2D surface).在OpenGL中一个3D场景需要被投影到屏幕上成为一个2D图像(image).这称为投影变换(参见这或这 ...
各开源协议BSD、Apache Licence 2.0、GPL
以下是上述协议的简单介绍:BSD开源协议BSD开源协议是一个给于使用者很大自由的协议.基本上使用者可以"为所欲为",可以自由的使用,修改源代码,也可以将修改后的代码作为开源或者专有 ...
java数组之binarySearch查找
/** * 1.如果找到目标对象则返回<code>[公式:-插入点-1]</code> * 插入点:第一个大与查找对象的元素在数组中的位置,如果数组中的所有元素都小于要查找的对 ...
RocketMQ踩坑记
一.前言现在的主流消息队列基本都是kafka.RabbitMQ和RocketMQ,只有了解各自的优缺点才能在不同的场景选择合适的MQ,对比图如下: MQ对比图本篇文章主要介绍我自己在跑官方demo ...
在.NET Core中使用Channel（三）
到目前为止,我们一直在使用所谓的"Unbounded"通道.你会注意到,当我们创建通道时,我们这样做: var myChannel = Channel.CreateUnbounde ...
ACL技术（访问控制列表）
• Access Control List • 访问控制列表 • 是一种包过滤技术 • ACL基于IP包头的IP地址.四层TCP/UDP头部的端口号.[五层数据]进行过滤 • ...