列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩

Column-store compression

At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.

But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.

To see how it can help compression, take this set of doc values for a numeric field:

Doc      Terms

-----------------------------------------------------------------

Doc_1 | 100

Doc_2 | 1000

Doc_3 | 1500

Doc_4 | 1200

Doc_5 | 300

Doc_6 | 1900

Doc_7 | 4200

-----------------------------------------------------------------

The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200].

xxx

Doc values use several tricks like this. In order, the following compression schemes are checked:

If all values are identical (or missing), set a flag and record the value
If there are fewer than 256 values, a simple table encoding is used
If there are > 256 values, check to see if there is a common divisor
If there is no common divisor, encode everything as an offset from the smallest value

You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.

You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.

The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.

摘自：https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩的更多相关文章

ES doc_values介绍——本质是field value的列存储，做聚合分析用，ES默认开启，会占用存储空间（列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩）
doc_values Doc values are the on-disk data structure, built at document index time, which makes this ...
Oracle 12.1.0.2 New Feature翻译学习【In-Memory column store内存列存储】【原创】
翻译没有追求信达雅,不是为了学英语翻译,是为了快速了解新特性,如有语义理解错误可以指正.欢迎加微信12735770或QQ12735770探讨oracle技术问题:) In-Memory Column ...
SQL Server 列存储索引第三篇：维护
列存储索引分为两种类型:聚集的列存储索引和非聚集的列存储索引,在一个表上只能创建一个聚集索引,要么是聚集的列存储索引,要么是聚集的行存储索引,然而一个表上可以创建多个非聚集索引. 一,创建列存储索引 ...
lucene底层数据结构——FST，针对field使用列存储，delta encode压缩doc ids数组，LZ4压缩算法
参考: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal http://www.slideshare.ne ...
Druid（准）实时分析统计数据库——列存储+高效压缩
Druid是一个开源的.分布式的.列存储系统,特别适用于大数据上的(准)实时分析统计.且具有较好的稳定性(Highly Available). 其相对比较轻量级,文档非常完善,也比较容易上手. Dru ...
腾讯Hermes设计概要——数据分析用的是列存储，词典文件前缀压缩，倒排文件递增id、变长压缩、依然是跳表-本质是lucene啊
转自:http://data.qq.com/article?id=817 三.Hermes设计概要架构描述系统核心进程均采用分散化设计,根据业务发展需求,可随意扩缩容机器; 周期性数据直接通过td ...
列存储段消除（ColumnStore Segment Elimination）
列存储索引是好的!对于数据仓库和报表工作量,它们是真正的性能加速器.与聚集列存储结合,你会在常规行存储索引(聚集索引,非聚集索引)上获得巨大的压缩好处.而且创建聚集列存储索引非常简单: CREATE ...
SQL Server 2014聚集列存储索引
转发请注明引用和原文博客(http://www.cnblogs.com/wenBlog) 简介之前已经写过两篇介绍列存储索引的文章,但是只有非聚集列存储索引,今天再来简单介绍一下聚集的列存储索引,也 ...
SQL Server 2014新特性探秘(3)-可更新列存储聚集索引
简介列存储索引其实在在SQL Server 2012中就已经存在,但SQL Server 2012中只允许建立非聚集列索引,这意味着列索引是在原有的行存储索引之上的引用了底层的数据,因此会 ...

随机推荐

shiro自定义拦截url
在实际项目上,我们针对不同的用户(guste,user,admin,mobile user)等等,需要进入不同的页面,比如,手机端用户需要进入Mobile/这个路径下的,这个时候,我们需要自定义拦截u ...
详解path和classpath的区别
详解path和classpath的区别 path的作用 path是系统用来指定可执行文件的完整路径,即使不在path中设置JDK的路径也可执行JAVA文件,但必须把完整的路径写出来,如C:\Progr ...
laravel学习之路2： jwt集成
"tymon/jwt-auth": "^1.0@dev", 执行 composer update 'providers' => [ .... Tymon\ ...
Linux进程间通信(一) - 管道
管道(pipe) 普通的Linux shell都允许重定向,而重定向使用的就是管道. 例如:ps | grep vsftpd .管道是单向的.先进先出的.无结构的.固定大小的字节流,它把一个进程的标准 ...
MySQL常见问题和命令
问题: 1.centos MySQL启动失败:关闭selinux, vi /etc/selinux/config, 设置SELINUX=disabled,重启电脑: 命令: 停止.启动mysql服务器 ...
lua(仿类)
Account = { balance = } function Account:deposit(v) self.balance = self.balance + v end function Acc ...
lua注册函数
#include <stdio.h> #include <math.h> #define MAX_COLOR 255 extern "C" { #inclu ...
Educational Codeforces Round 1 (C) （atan2 + long double | 大数）
这题只能呵呵了. 东搞西搞,折腾快一天,最后用了一个800多行的代码AC了. 好好的题目你卡这种精度干啥. 还有要卡您就多卡点行不,为什么long double 又可以过... 废了N长时间写个了不管 ...
使用Zxing.net实现asp.net mvc二维码功能
新建一个html辅助类 public static class HtmlHelperExtensions { , , ) { var barcodeWriter = new BarcodeWriter ...
mysql中的乐观锁和悲观锁
mysql中的乐观锁和悲观锁的简介以及如何简单运用. 关于mysql中的乐观锁和悲观锁面试的时候被问到的概率还是比较大的. mysql的悲观锁: 其实理解起来非常简单,当数据被外界修改持保守态度,包括 ...

列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩

列存储压缩技巧，除公共除数或者同时减去最小数，字符串压缩的话，直接去重后用数字ID压缩的更多相关文章

随机推荐

热门专题