注意：由于是重复数据，词法不具有通用性！文章价值不大！

摘自：https://segmentfault.com/a/1190000002695169

Doc Values 会压缩存储重复的内容。给定这样一个简单的 mapping

mappings = {

    'testdata': {

        '_source': {'enabled': False},

        '_all': {'enabled': False},

        'properties': {

            'name': {

                'type': 'string',

                'index': 'no',

                'store': False,

                'dynamic': 'strict',

                'fielddata': {'format': 'doc_values'}

            }

        }

    }

}

插入100万行随机的重复值

words = ['hello', 'world', 'there', 'here']

def read_test_data_in_batches():

    batch = []

    for i in range(10000 * 100):

        if i % 50000 == 0:

            print(i)

        if len(batch) > 10000:

            yield batch

            batch = []

        batch.append({

            '_index': 'wentao-test-doc-values',

            '_type': 'testdata',

            '_source': {'name': random.choice(words)}

        })

    print(i)

    yield batch

磁盘占用是

size: 28.5Mi (28.5Mi)

docs: 1,000,000 (1,000,000)

把每个word搞长一些，同样是插入100万行

words = ['hello' * 100, 'world' * 100, 'there' * 100, 'here' * 100]

def read_test_data_in_batches():

    batch = []

    for i in range(10000 * 100):

        if i % 50000 == 0:

            print(i)

        if len(batch) > 10000:

            yield batch

            batch = []

        batch.append({

            '_index': 'wentao-test-doc-values',

            '_type': 'testdata',

            '_source': {'name': random.choice(words)}

        })

    print(i)

    yield batch

磁盘占用不升反降

size: 14.4Mi (14.4Mi)

docs: 1,000,000 (1,000,000)

这说明了lucene在底层用列式存储这些字符串的时候是做了压缩的。这个要是在某个商业列式数据库里，就这么点优化都是要大书特书的dictionary encoding优化云云。

Nested Document

实验表明把一堆小文档打包成一个大文档的nested document可以压缩存储空间。把前面的mapping改成这样：

mappings = {

    'testdata': {

        '_source': {'enabled': False},

        '_all': {'enabled': False},

        'properties': {

            'children': {

                'type': 'nested',

                'properties': {

                    'name': {

                        'type': 'string',

                        'index': 'no',

                        'store': False,

                        'dynamic': 'strict',

                        'fielddata': {'format': 'doc_values'}

                    }

                }

            }

        }

    }

}

还是插入100万行，但是每一千行打包成一个大文档

words = ['hello', 'world', 'there', 'here']

def read_test_data_in_batches():

    batch = []

    for i in range(10000 * 100):

        if i % 50000 == 0:

            print(i)

        if len(batch) > 1000:

            yield [{

                '_index': 'wentao-test-doc-values2',

                '_type': 'testdata',

                '_source': {'children': batch}

            }]

            batch = []

        batch.append({'name': random.choice(words)})

    print(i)

    yield [{

        '_index': 'wentao-test-doc-values2',

        '_type': 'testdata',

        '_source': {'children': batch}

    }]

磁盘占用是

size: 2.47Mi (2.47Mi)

docs: 1,001,000 (1,001,000)

文档数没有变小，但是磁盘空间仅仅占用了2.47M。这个应该受益于lucene内部对于嵌套文档的存储优化。

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率的更多相关文章

ElasticSearch入门第四篇：使用C#添加和更新文档
这是ElasticSearch 2.4 版本系列的第四篇: ElasticSearch入门第一篇:Windows下安装ElasticSearch ElasticSearch入门第二篇:集群配置 E ...
读《深入理解Elasticsearch》点滴-对象类型、嵌套文档、父子关系
一.对象类型 1.mapping定义文件 "title":{ "type":"text" }, "edition":{ ...
amazon redshift 分析型数据库特点——本质还是列存储
Amazon Redshift 是一种快速且完全托管的 PB 级数据仓库,使您可以使用现有的商业智能工具经济高效地轻松分析您的所有数据.从最低 0.25 USD 每小时 (不承担任何义务) 直到每年每 ...
时间序列数据库选型——本质是列存储，B-tree索引，抑或是搜索引擎中的倒排索引
时间序列数据库最多,使用也最广泛.一般人们谈论时间序列数据库的时候指代的就是这一类存储.按照底层技术不同可以划分为三类. 直接基于文件的简单存储:RRD Tool,Graphite Whisper.这 ...
Druid.io索引过程分析——时间窗，列存储，LSM树，充分利用内存，concise压缩
Druid底层不保存原始数据,而是借鉴了Apache Lucene.Apache Solr以及ElasticSearch等检索引擎的基本做法,对数据按列建立索引,最终转化为Segment,用于存储.查 ...
OpenTSDB介绍——基于Hbase的分布式的，可伸缩的时间序列数据库，而Hbase本质是列存储
原文链接:http://www.jianshu.com/p/0bafd0168647 OpenTSDB介绍 1.1.OpenTSDB是什么?主要用途是什么? 官方文档这样描述:OpenTSDB is ...
ELK学习笔记之ElasticSearch的索引详解
0x00 ElasticSearch的索引和MySQL的索引方式对比 Elasticsearch是通过Lucene的倒排索引技术实现比关系型数据库更快的过滤.特别是它对多条件的过滤支持非常好,比如年龄 ...
elasticsearch——海量文档高性能索引系统
elasticsearch elasticsearch是一个高性能高扩展性的索引系统,底层基于apache lucene. 可结合kibana工具进行可视化. 概念: index 索引: 类似SQL中 ...
〈二〉ElasticSearch的认识：索引、类型、文档
目录上节回顾本节前言索引index 创建索引查看索引查看单个索引查看所有索引删除索引修改索引修改副本分片数量关闭索引索引别名增加索引别名: 查看索引别名: 删除索引别名: 补充 ...

随机推荐

JS练习--嵌套列表（for循环）
CSS: ;;} ul,li{list-style: none;} .cont{ width: 600px; margin:30px auto; } .cont h3{ border-bottom: ...
棣小天儿的第一个python程序
根据给定的年月日,以数字形式打印出日期 months = [ 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'Augu ...
position学习终结者(二)
版权声明:本文为博主原创文章.未经博主同意不得转载. https://blog.csdn.net/wangshuxuncom/article/details/30982863 在博客& ...
Windows Server 2012 云硬盘如何挂载
那么首先科普一下,云服务器的数据盘(也就是我们买的云硬盘)默认是脱机状态,不自动挂载的.下面来教大家win2012环境如何挂载硬盘,其实和03.08的大同小异就是入口不同了. 点击“工具”中的“计 ...
Ubuntu 16.04安装JDK并配置环境变量(转发：https://blog.csdn.net/yan3013216087/article/details/78307258)
系统版本:Ubuntu 16.04 JDK版本:jdk1.8.0_121 1.官网下载JDK文件jdk-8u121-linux-x64.tar.gz 我这里下的是最新版,其他版本也可以 2.创建一个目 ...
Python基础-面向对象2
一.成员修饰符共有成员私有成员:创建方式在成员之前加两个下划线,私有成员无法直接访问,只能间接访问子类不能继承父类的私有属相私有普通字段的访问方式: class Fansik: def __in ...
使用Free命令查看Linux服务器内存使用状况(-/+ buffers/cache详解)
free命令可选参数 -b,-k,-m,-g show output in bytes, KB, MB, or GB -h human readable output (automatic unit ...
mysql 练习题（Day44）
init.sql文件内容 /* 数据导入: Navicat Premium Data Transfer Source Server : localhost Source Server Type : M ...
springmvc get post put delete
web.xml <!-- 配置 org.springframework.web.filter.HiddenHttpMethodFilter: 可以把 POST 请求转为 DELETE 或 POS ...
20145109 《Java程序设计》第八周学习总结
Chapter 15 API java.util.logging package The constructor of Logger class is protected. If Logger ins ...

Elasticsearch压缩索引——lucene倒排索引本质是列存储+使用嵌套文档可以大幅度提高压缩率