Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200].

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

  1. If all values are identical (or missing), set a flag and record the value
  2. If there are fewer than 256 values, a simple table encoding is used
  3. If there are > 256 values, check to see if there is a common divisor
  4. If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

   

摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩的更多相关文章

  1. ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)

    doc_values Doc values are the on-disk data structure, built at document index time, which makes this ...

  2. Oracle 12.1.0.2 New Feature翻译学习【In-Memory column store内存列存储】【原创】

    翻译没有追求信达雅,不是为了学英语翻译,是为了快速了解新特性,如有语义理解错误可以指正.欢迎加微信12735770或QQ12735770探讨oracle技术问题:) In-Memory Column ...

  3. SQL Server 列存储索引 第三篇:维护

    列存储索引分为两种类型:聚集的列存储索引和非聚集的列存储索引,在一个表上只能创建一个聚集索引,要么是聚集的列存储索引,要么是聚集的行存储索引,然而一个表上可以创建多个非聚集索引. 一,创建列存储索引 ...

  4. lucene底层数据结构——FST,针对field使用列存储,delta encode压缩doc ids数组,LZ4压缩算法

    参考: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal http://www.slideshare.ne ...

  5. Druid(准)实时分析统计数据库——列存储+高效压缩

    Druid是一个开源的.分布式的.列存储系统,特别适用于大数据上的(准)实时分析统计.且具有较好的稳定性(Highly Available). 其相对比较轻量级,文档非常完善,也比较容易上手. Dru ...

  6. 腾讯Hermes设计概要——数据分析用的是列存储,词典文件前缀压缩,倒排文件递增id、变长压缩、依然是跳表-本质是lucene啊

    转自:http://data.qq.com/article?id=817 三.Hermes设计概要 架构描述 系统核心进程均采用分散化设计,根据业务发展需求,可随意扩缩容机器; 周期性数据直接通过td ...

  7. 列存储段消除(ColumnStore Segment Elimination)

    列存储索引是好的!对于数据仓库和报表工作量,它们是真正的性能加速器.与聚集列存储结合,你会在常规行存储索引(聚集索引,非聚集索引)上获得巨大的压缩好处.而且创建聚集列存储索引非常简单: CREATE ...

  8. SQL Server 2014聚集列存储索引

    转发请注明引用和原文博客(http://www.cnblogs.com/wenBlog) 简介 之前已经写过两篇介绍列存储索引的文章,但是只有非聚集列存储索引,今天再来简单介绍一下聚集的列存储索引,也 ...

  9. SQL Server 2014新特性探秘(3)-可更新列存储聚集索引

    简介      列存储索引其实在在SQL Server 2012中就已经存在,但SQL Server 2012中只允许建立非聚集列索引,这意味着列索引是在原有的行存储索引之上的引用了底层的数据,因此会 ...

随机推荐

  1. 【Python+selenium Wendriver API】之下拉框定位

    上代码: # coding:utf-8 from selenium import webdriver from selenium.webdriver.common.action_chains impo ...

  2. 安装virtualBox 增强包

    1 在原始操作系统安装. 2 打开USB设置. 3 运行虚拟机中的Linux中,Device->install guest additions 再安装增强包. 4 插入U盘,如果这时可以看到U盘 ...

  3. SDOI2012 Round1 day2 拯救小云公主(dis)解题报告

    #include<cstdio> #include<cmath> #include<iostream> using namespace std; typedef l ...

  4. CentOS7.1安装 Vsftpd FTP 服务器

    # yum install vsftpd 安装 Vsftpd FTP 编辑配置文件 ‘/etc/vsftpd/vsftpd.conf’ 用于保护 vsftpd. # vi /etc/vsftpd/vs ...

  5. WIn10远程:mstsc:出现身份验证错误,要求的函数不支持, 这可能是由于CredSSP加密Oracle修正

    a.单击 开始 > 运行,输入 regedit,单击 确定. b.定位到 HKLM\Software\Microsoft\Windows\CurrentVersion\Policies\Syst ...

  6. debian安装oracle jdk

    1 去官网下载linux jdk https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.htm ...

  7. TFS 中工作项的定制-修改工作流

    我们都会用到TFS中的工作项.一般来说,最主要的会用到任务.bug这些工作流来进行项目管理里.但我们发现,实际上,有些模板中的工作流并不能完全符合我们的需要,因此我们会进行工作流的定制操作.下面就会通 ...

  8. 我的Android进阶之旅------>MIME类型大全

    今天在实现一个安装apk的代码中看到一段代码为:application/vnd.android.package-archive,不知其意,所以百度了一下,了解到这是一种MIME的类型,代表apk类型. ...

  9. 使用oracle10g官方文档找到监听文件(listener.ora)的模板

    ***********************************************声明*************************************************** ...

  10. 存储过程 & 触发器

    触发器主要是通过事件进行触发而被执行的,而存储过程可以通过存储过程名字而被直接调用.当对某一表进行诸如UPDATE. INSERT. DELETE 这些操作时, 就会自动执行触发器所定义的SQL 语句 ...