Frame of Reference and Roaring Bitmaps
https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps
2015年2月18日Engineering
Frame of Reference and Roaring Bitmaps
Postings lists
While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers. Why is this useful? As you may know, Elasticsearch shards, which are Lucene indices under the hood, split the data that they store into segments which are regularly merged together. Inside each segment, documents are given an identifier between 0 and the number of documents in the segment (up to 231-1). This is conceptually like an index in an array: it is stored nowhere but is enough to identity an item. Segments store data about documents sequentially, and a doc ID is the index of a document in a segment. So the first document in a segment would have a doc ID of 0, the second 1, etc. until the last document, which has a doc ID equal to the total number of documents in the segment minus one.
Why are these doc IDs useful? An inverted index needs to map terms to the list of documents that contain this term, called a postings list, and these doc IDs that we just discussed are a perfect fit since they can be compressed efficiently.
Frame Of Reference
In order to be able to compute intersections and unions efficiently, we require that these postings lists are sorted. A nice side-effect of this decision is that postings lists can be compressed with delta-encoding.
For instance, if your postings list is [73, 300, 302, 332, 343, 372], the list of deltas would be [73, 227, 2, 30, 11, 29]. What is interesting to note here is that all deltas are between 0 and 255, so you only need one byte per value. This is the technique that Lucene is using in order to encode your inverted index on disk: postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing: Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits. This encoding technique is known as Frame Of Reference (FOR) in the literature and has been used since Lucene 4.1.
Here is an example with a block size of 3 (instead of 256 in practice):
Frame of Reference and Roaring Bitmaps的更多相关文章
- OD: Register, Stack Frame, Function Reference
几个重要的 Win32 寄存器 EIP 指令寄存器(Extended Instruction Pointer) 存放一个指针,指向下一条等待执行的指令地址 ESP 栈指针寄存器(Extended St ...
- Elasticsearch 通关教程(七): Elasticsearch 的性能优化
硬件选择 Elasticsearch(后文简称 ES)的基础是 Lucene,所有的索引和文档数据是存储在本地的磁盘中,具体的路径可在 ES 的配置文件../config/elasticsearch. ...
- Elasticsearch 技术分析(九):Elasticsearch的使用和原理总结
前言 之前已经分享过Elasticsearch的使用和原理的知识,由于近期在公司内部做了一次内部分享,所以本篇主要是基于之前的博文的一个总结,希望通过这篇文章能让读者大致了解Elasticsearch ...
- 全文搜索引擎Elasticsearch详细介绍
我们生活中的数据总体分为两种:结构化数据 和 非结构化数据. 结构化数据:也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理.指具有固 ...
- L ==> E · L · K
三剑客:Elastic Stack 在学习ELK前,先对 Lucene作基本了解. 今天才知道关系型数据库的索引是 B-Tree,罪过... 减少磁盘寻道次数 ---> 提高查询性能 Lucen ...
- 带你走进神一样的Elasticsearch索引机制
更多精彩内容请看我的个人博客 前言 相比于大多数人熟悉的MySQL数据库的索引,Elasticsearch的索引机制是完全不同于MySQL的B+Tree结构.索引会被压缩放入内存用于加速搜索过程,这一 ...
- Busting Frame Busting: a Study of Clickjacking Vulnerabilities on Popular Sites
Busting Frame Busting Reference From: http://seclab.stanford.edu/websec/framebusting/framebust.pdf T ...
- Frames of Reference参考框架
Frames of Reference参考框架 When describing the position and orientation of something (for example, your ...
- Elasticsearch索引原理
转载 http://blog.csdn.net/endlu/article/details/51720299 最近在参与一个基于Elasticsearch作为底层数据框架提供大数据量(亿级)的实时统计 ...
随机推荐
- 浅入kubernetes(1):Kubernetes 入门基础
目录 Kubernetes 入门基础 Introduction basic of kubernetes What Is Kubernetes? Components of Kubernetes Kub ...
- HashMap的循环姿势你真的掌握了吗?
hashMap 应该是java程序员工作中用的比较多的一个键值对处理的数据的类型了.这种数据类型一般都会有增删查的方法,今天我们就来看看它的循环方法以前写过一篇关于ArrayList的循环效率问题&l ...
- [leetcode]49. Group Anagrams重排列字符串分组
是之前的重排列字符串的延伸,判断是重排列后存到HashMap中进行分组 这种HashMap进行分组的方式很常用 public List<List<String>> groupA ...
- JAVA基础之接口
接口 学习完框架之后,整合SSM过程中对于接口的认识加深了许多.根据<java核心技术>这本书进一步研究了一下. 1.概念 java核心技术是这样说的:"在Java程序设计中,接 ...
- 【陪你系列】5 千字长文+ 30 张图解 | 陪你手撕 STL 空间配置器源码
大家好,我是小贺. 点赞再看,养成习惯 文章每周持续更新,可以微信搜索「herongwei」第一时间阅读和催更,本文 GitHub https://github.com/rongweihe/MoreT ...
- JAVA数据结构(十一)—— 堆及堆排序
堆 堆基本介绍 堆排序是利用堆这种数据结构而设计的一种排序算法,堆排序是一种选择排序,最坏,最好,平均时间复杂度都是O(nlogn),不稳定的排序 堆是具有以下性质的完全二叉树:每个节点的值都大于或等 ...
- java数组之system.arrayCopy
public class ArrayDemo { /* public static void main(String[] args) { int[] a=new int[4]; int[] b=new ...
- 【函数分享】每日PHP函数分享(2021-1-12)
str_pad() 使用另一个字符串填充字符串为指定长度 . string str_pad ( string $input, int $pad_length[, string $pad_string= ...
- 一网打尽,一文讲通虚拟机VirtualBox及Linux使用
本文将从虚拟机的选择.安装.Linux系统安装.SSH客户端工具使用四个方面来详细介绍Linux系统在虚拟机下的安装及使用方法,为你在虚拟机下正常使用Linux保驾护航. 1.虚拟机的选择 在讲虚拟机 ...
- FastApi学习(二)
前言 继续学习 此为第二篇, 还差些知识点就可以结束, 更多的比如用户的身份校验/ swagger 文档修改等以后会单独写 正文 使用枚举来限定参数 可以使用枚举的方式来限定参数为某几个值之内才通过 ...