Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200].

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

  1. If all values are identical (or missing), set a flag and record the value
  2. If there are fewer than 256 values, a simple table encoding is used
  3. If there are > 256 values, check to see if there is a common divisor
  4. If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

   

摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩的更多相关文章

  1. ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)

    doc_values Doc values are the on-disk data structure, built at document index time, which makes this ...

  2. Oracle 12.1.0.2 New Feature翻译学习【In-Memory column store内存列存储】【原创】

    翻译没有追求信达雅,不是为了学英语翻译,是为了快速了解新特性,如有语义理解错误可以指正.欢迎加微信12735770或QQ12735770探讨oracle技术问题:) In-Memory Column ...

  3. SQL Server 列存储索引 第三篇:维护

    列存储索引分为两种类型:聚集的列存储索引和非聚集的列存储索引,在一个表上只能创建一个聚集索引,要么是聚集的列存储索引,要么是聚集的行存储索引,然而一个表上可以创建多个非聚集索引. 一,创建列存储索引 ...

  4. lucene底层数据结构——FST,针对field使用列存储,delta encode压缩doc ids数组,LZ4压缩算法

    参考: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal http://www.slideshare.ne ...

  5. Druid(准)实时分析统计数据库——列存储+高效压缩

    Druid是一个开源的.分布式的.列存储系统,特别适用于大数据上的(准)实时分析统计.且具有较好的稳定性(Highly Available). 其相对比较轻量级,文档非常完善,也比较容易上手. Dru ...

  6. 腾讯Hermes设计概要——数据分析用的是列存储,词典文件前缀压缩,倒排文件递增id、变长压缩、依然是跳表-本质是lucene啊

    转自:http://data.qq.com/article?id=817 三.Hermes设计概要 架构描述 系统核心进程均采用分散化设计,根据业务发展需求,可随意扩缩容机器; 周期性数据直接通过td ...

  7. 列存储段消除(ColumnStore Segment Elimination)

    列存储索引是好的!对于数据仓库和报表工作量,它们是真正的性能加速器.与聚集列存储结合,你会在常规行存储索引(聚集索引,非聚集索引)上获得巨大的压缩好处.而且创建聚集列存储索引非常简单: CREATE ...

  8. SQL Server 2014聚集列存储索引

    转发请注明引用和原文博客(http://www.cnblogs.com/wenBlog) 简介 之前已经写过两篇介绍列存储索引的文章,但是只有非聚集列存储索引,今天再来简单介绍一下聚集的列存储索引,也 ...

  9. SQL Server 2014新特性探秘(3)-可更新列存储聚集索引

    简介      列存储索引其实在在SQL Server 2012中就已经存在,但SQL Server 2012中只允许建立非聚集列索引,这意味着列索引是在原有的行存储索引之上的引用了底层的数据,因此会 ...

随机推荐

  1. 【Python + Selenium】之JS定位总结

    感谢:小琰子 Python+Selenium 脚本中的一些js的用法汇总: 1.滚动条 driver.set_window_size(500,500) js = "window.scroll ...

  2. 【Postman】接口测试工具:在谷歌浏览器安装插件方法以及使用说明

    安装插件方法: <如何在谷歌浏览器chrome中离线安装.crx扩展程序的三种方法?> <postman chrome插件的安装与使用> 下载地址:http://www.cnp ...

  3. Oracle关于快速缓存区应用原理

    为什么oracle可以对于大量数据进行訪问时候能彰显出更加出色表现,就是通过所谓的快速缓存来实现数据的快速运算与操作.在之前的博文中我已经说过sql的运行原理,当我们訪问数据库的数据时候,首先不是从数 ...

  4. history命令使用方法详解

    history是一条非常实用的shell命令,可以显示出之前在shell中运行的命令,配合last显示之前登录的用户,就可以追溯是哪个用户执行了某些命令.以下详细说明history使用中常见的命令或技 ...

  5. Muduo网络库源代码分析(四)EventLoopThread和EventLoopThreadPool的封装

    muduo的并发模型为one loop per thread+ threadpool.为了方便使用,muduo封装了EventLoop和Thread为EventLoopThread,为了方便使用线程池 ...

  6. poj2891(线性同余方程组)

    一个exgcd解决一个线性同余问题,多个exgcd解决线性同余方程组. Strange Way to Express Integers Time Limit: 1000MS   Memory Limi ...

  7. nginx中使用waf防火墙

    1.安装依赖 yum install -y readline-devel ncurses-devel 2.安装Lua # .tar.gz # cd lua- # make linux # make i ...

  8. Latent Activity Trajectory (LAT)

    https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/funcZone_TKDE_Zheng.pdf Specific ...

  9. 【python】-- 递归函数、高阶函数、嵌套函数、匿名函数

    递归函数 在函数内部,可以调用其他函数.但是在一个函数在内部调用自身,这个函数被称为递归函数 def calc(n): print(n) if int(n/2) == 0: #结束符 return n ...

  10. mongodb 的注意点

    昨天同事安装mongodb遇到了些问题,问了下我,后拉发现都是些细节没注意(讲道理这应该是很简单,一顿操作就ok的事情) 首先,下载 mongo包, 然后 ,解压安装, 启动之. 问题就出现在他后台启 ...