https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps

http://roaringbitmap.org/

2015年2月18日Engineering

Frame of Reference and Roaring Bitmaps

作者

Postings lists

While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers. Why is this useful? As you may know, Elasticsearch shards, which are Lucene indices under the hood, split the data that they store into segments which are regularly merged together. Inside each segment, documents are given an identifier between 0 and the number of documents in the segment (up to 231-1). This is conceptually like an index in an array: it is stored nowhere but is enough to identity an item. Segments store data about documents sequentially, and a doc ID is the index of a document in a segment. So the first document in a segment would have a doc ID of 0, the second 1, etc. until the last document, which has a doc ID equal to the total number of documents in the segment minus one.

Why are these doc IDs useful? An inverted index needs to map terms to the list of documents that contain this term, called a postings list, and these doc IDs that we just discussed are a perfect fit since they can be compressed efficiently.

Frame Of Reference

In order to be able to compute intersections and unions efficiently, we require that these postings lists are sorted. A nice side-effect of this decision is that postings lists can be compressed with delta-encoding.

For instance, if your postings list is [73, 300, 302, 332, 343, 372], the list of deltas would be [73, 227, 2, 30, 11, 29]. What is interesting to note here is that all deltas are between 0 and 255, so you only need one byte per value. This is the technique that Lucene is using in order to encode your inverted index on disk: postings lists are split into blocks of 256 doc IDs and then each block is compressed separately using delta-encoding and bit packing: Lucene computes the maximum number of bits required to store deltas in a block, adds this information to the block header, and then encodes all deltas of the block using this number of bits. This encoding technique is known as Frame Of Reference (FOR) in the literature and has been used since Lucene 4.1.

Here is an example with a block size of 3 (instead of 256 in practice):

Frame of Reference and Roaring Bitmaps的更多相关文章

  1. OD: Register, Stack Frame, Function Reference

    几个重要的 Win32 寄存器 EIP 指令寄存器(Extended Instruction Pointer) 存放一个指针,指向下一条等待执行的指令地址 ESP 栈指针寄存器(Extended St ...

  2. Elasticsearch 通关教程(七): Elasticsearch 的性能优化

    硬件选择 Elasticsearch(后文简称 ES)的基础是 Lucene,所有的索引和文档数据是存储在本地的磁盘中,具体的路径可在 ES 的配置文件../config/elasticsearch. ...

  3. Elasticsearch 技术分析(九):Elasticsearch的使用和原理总结

    前言 之前已经分享过Elasticsearch的使用和原理的知识,由于近期在公司内部做了一次内部分享,所以本篇主要是基于之前的博文的一个总结,希望通过这篇文章能让读者大致了解Elasticsearch ...

  4. 全文搜索引擎Elasticsearch详细介绍

    我们生活中的数据总体分为两种:结构化数据 和 非结构化数据. 结构化数据:也称作行数据,是由二维表结构来逻辑表达和实现的数据,严格地遵循数据格式与长度规范,主要通过关系型数据库进行存储和管理.指具有固 ...

  5. L ==> E · L · K

    三剑客:Elastic Stack 在学习ELK前,先对 Lucene作基本了解. 今天才知道关系型数据库的索引是 B-Tree,罪过... 减少磁盘寻道次数 ---> 提高查询性能 Lucen ...

  6. 带你走进神一样的Elasticsearch索引机制

    更多精彩内容请看我的个人博客 前言 相比于大多数人熟悉的MySQL数据库的索引,Elasticsearch的索引机制是完全不同于MySQL的B+Tree结构.索引会被压缩放入内存用于加速搜索过程,这一 ...

  7. Busting Frame Busting: a Study of Clickjacking Vulnerabilities on Popular Sites

    Busting Frame Busting Reference From: http://seclab.stanford.edu/websec/framebusting/framebust.pdf T ...

  8. Frames of Reference参考框架

    Frames of Reference参考框架 When describing the position and orientation of something (for example, your ...

  9. Elasticsearch索引原理

    转载 http://blog.csdn.net/endlu/article/details/51720299 最近在参与一个基于Elasticsearch作为底层数据框架提供大数据量(亿级)的实时统计 ...

随机推荐

  1. 还在使用Future轮询获取结果吗?CompletionService快来了解下吧。

    背景 二胖上次写完参数校验(<二胖写参数校验的坎坷之路>)之后,领导一直不给他安排其他开发任务,就一直让他看看代码熟悉业务.二胖每天上班除了偶尔跟坐在隔壁的前端小姐姐聊聊天,就是看看这些 ...

  2. 本地缓存性能之王Caffeine

    前言 随着互联网的高速发展,市面上也出现了越来越多的网站和app.我们判断一个软件是否好用,用户体验就是一个重要的衡量标准.比如说我们经常用的微信,打开一个页面要十几秒,发个语音要几分钟对方才能收到. ...

  3. [LeetCode]501. Find Mode in Binary Search Tree二叉搜索树寻找众数

    这次是二叉搜索树的遍历 感觉只要和二叉搜索树的题目,都要用到一个重要性质: 中序遍历二叉搜索树的结果是一个递增序列: 而且要注意,在递归遍历树的时候,有些参数如果是要随递归不断更新(也就是如果递归返回 ...

  4. 远程控制卡 使用ipmitools设置ipmi

    远程控制卡 使用ipmitools设置ipmi 使用DELL的远程控制卡可以方便的管理服务器 在CentOS中可以使用ipmitools管理 IPMI( Intelligent Platform Ma ...

  5. 关于Iterator

    1.在迭代过程中,用list来删除元素的坑 1 package test; 2 3 import java.util.ArrayList; 4 import java.util.Iterator; 5 ...

  6. “You may need an appropriate loader to handle this file type”

    这里不能为空!!!!!!!!!!!!!!!!!!!!

  7. 看图知义,Winform开发的技术特点分析

    整理一下自己之前的Winform开发要点,以图文的方式展示一些关键性的技术特点,总结一下. 1.主体界面布局 2.权限管理系统 3.工作流模块 4.字典管理 5.通用的附件管理模块 6.系统模块化开发 ...

  8. 2021升级版微服务教程—为什么会有微服务?什么是SpringCloud?

    2021升级版SpringCloud教程从入门到实战精通「H版&alibaba&链路追踪&日志&事务&锁」 教程全目录「含视频」:https://gitee.c ...

  9. navicat for mysql 破解版

    Navicat for MySQL下载地址:Navicat for MySQL 软件和破解程序 第1步.安装Navicat软件,最后点击完成 第2步.安装成功之后还要进行破解.点击patchNavic ...

  10. Ts有限状态机

    ts版本的有限状态机 最近做小游戏要做切换人物状态,花点时间写了一个有限状态机,使用语言为Ts,也可改成自己的语言 按照目前的逻辑,这个可以继续横向扩展,某些做流程管理 先上预览图 Fsm:状态机类 ...