ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)
doc_values
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.(本质!!!) Doc values are supported on almost all field types, with the notable exception of analyzed string fields.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"status_code": {
"type": "keyword"
},
"session_id": {
"type": "keyword",
"doc_values": false
}
}
}
}
}
| 
 
  | 
 The   | 
| 
 
  | 
 The   | 
摘自:https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html
Column-store compression
At a high level, doc values are essentially a serialized column-store. As we discussed in the last section, column-stores excel at certain operations because the data is naturally laid out in a fashion that is amenable to those queries.
But they also excel at compressing data, particularly numbers. This is important for both saving space on disk and for faster access. Modern CPU’s are many orders of magnitude faster than disk drives (although the gap is narrowing quickly with upcoming NVMe drives). That means it is often advantageous to minimize the amount of data that must be read from disk, even if it requires extra CPU cycles to decompress.
To see how it can help compression, take this set of doc values for a numeric field:
Doc Terms
-----------------------------------------------------------------
Doc_1 | 100
Doc_2 | 1000
Doc_3 | 1500
Doc_4 | 1200
Doc_5 | 300
Doc_6 | 1900
Doc_7 | 4200
-----------------------------------------------------------------
The column-stride layout means we have a contiguous block of numbers:[100,1000,1500,1200,300,1900,4200].
xxx
Doc values use several tricks like this. In order, the following compression schemes are checked:
- If all values are identical (or missing), set a flag and record the value
 - If there are fewer than 256 values, a simple table encoding is used
 - If there are > 256 values, check to see if there is a common divisor
 - If there is no common divisor, encode everything as an offset from the smallest value
 
You’ll note that these compression schemes are not "traditional" general purpose compression like DEFLATE or LZ4. Because the structure of column-stores are rigid and well-defined, we can achieve higher compression by using specialized schemes rather than the more general compression algorithms like LZ4.
You may be thinking "Well that’s great for numbers, but what about strings?" Strings are encoded similarly, with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values. Which means strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using fixed, variable or prefix-encoded strings.
摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html
ES doc_values介绍——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间(列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩)的更多相关文章
- 列存储压缩技巧,除公共除数或者同时减去最小数,字符串压缩的话,直接去重后用数字ID压缩
		
Column-store compression At a high level, doc values are essentially a serialized column-store. As w ...
 - ES doc_values介绍2——本质是field value的列存储,做聚合分析用,ES默认开启,会占用存储空间
		
一.doc_values介绍 doc values是一个我们再三重复的重要话题了,你是否意识到一些东西呢? 搜索时,我们需要一个“词”到“文档”列表的映射 排序时,我们需要一个“文档”到“词“列表的映 ...
 - ES doc_values的来源,field data——就是doc->terms的正向索引啊,不过它是在查询阶段通过读取倒排索引loading segments放在内存而得到的?
		
Support in the Wild: My Biggest Elasticsearch Problem at Scale Java Heap Pressure Elasticsearch has ...
 - ES系列十四、ES聚合分析(聚合分析简介、指标聚合、桶聚合)
		
一.聚合分析简介 1. ES聚合分析是什么? 聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值.最小值,计算和.平均值等.ES作为 ...
 - CQRS\ES架构介绍
		
大家好,我叫汤雪华.我平时工作使用Java,业余时间喜欢用C#做点开源项目,如ENode, EQueue.我个人对DDD领域驱动设计.CQRS架构.事件溯源(Event Sourcing,简称ES). ...
 - MYSQL删除表的记录后如何使ID从1开始
		
MYSQL删除表的记录后如何使ID从1开始 MYSQL删除表的记录后如何使ID从1开始 http://hi.baidu.com/289766516/blog/item/a3f85500556e2c09 ...
 - es简单介绍及使用注意事项
		
是什么? Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎.无论在开源还是专有领域,Lucene可以被认为是迄今为止最先进.性能最好的.功能最全的搜索引擎库. El ...
 - 在数据库中使用数字ID作为主键的表生成主键方法
		
在数据库开发中,很多时候建一个表的时候会使用一个数字类型来作为主键,使用自增长类型自然会更方便,只是本人从来不喜欢有内容不在自己掌控之中,况且自增长类型在进行数据库复制时会比较麻烦.所以本人一直使用自 ...
 - Oracle 去重后排序
		
因项目需求,需要将查询结果,去重后,在按照主键(自增列)排序,百度一番,记录下来 DEMO SELECT * FROM (SELECT ROW_NUMBER() OVER(PARTITION BY S ...
 
随机推荐
- nginx大量TIME_WAIT的解决办法   netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
			
vi /etc/sysctl.conf net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_tw_reuse=1 #让TIME_WAIT状态可以重用,这样即使TIME_W ...
 - oracle10g安装问题
			
oracle10g的安装还是比较容易的,一直下一步就行了,但是今天安装的时候遇到了一个新问题,在安装的过程中提示提示一些 Configuration Assistant失败刚开始,我直接跳过去,但后面 ...
 - Android Studio 2.0 稳定版新特性介绍
			
Android Studio 2.0 最终迎来了稳定版本号,喜大普奔. 以下这篇文章是2.0新特性的一些简介. 假设想看具体内容请看这里<Android Studio有用指南> 文章转自这 ...
 - Coursera machine learning 第二周 quiz 答案 Octave/Matlab Tutorial
			
https://www.coursera.org/learn/machine-learning/exam/dbM1J/octave-matlab-tutorial Octave Tutorial 5 ...
 - iOS开发---业务逻辑
			
iOS开发---业务逻辑 1. 业务逻辑 iOS的app开发的最终目的是要让用户使用, 用户使用app完成自己的事就是业务逻辑, 业务逻辑的是最显眼开发工作.但是业务逻辑对于开发任务来说, 只是露 ...
 - 【BZOJ4358】permu kd-tree
			
[BZOJ4358]permu Description 给出一个长度为n的排列P(P1,P2,...Pn),以及m个询问.每次询问某个区间[l,r]中,最长的值域连续段长度. Input 第一行两个整 ...
 - 使用tomcat7-maven-plugin部署Web项目
			
一.环境准备 我使用的环境是:Window 10.Tomcat 8.0.36.maven3.tomcat7-maven-plugin 2.2版本. 二.设置环境变量 安装Tomcat8.0.36和 ...
 - 不怕慢 就怕站 不怕单线程 不怕 裸露ip
			
import sys import os import requests import threading from time import sleep from bs4 import Beautif ...
 - CSS 布局实例系列(二)如何通过 CSS 实现一个左边固定宽度、右边自适应的两列布局
			
最近在百度 IFE 训练营中看见的一道题目: 用两种不同的方法来实现一个两列布局,其中左侧部分宽度固定.右侧部分宽度随浏览器宽度的变化而自适应变化 个人总结出如下三种实现思路: 通过绝对定位实现 S ...
 - MyBatis:学习笔记(4)——动态SQL
			
MyBatis:学习笔记(4)——动态SQL 如果使用JDBC或者其他框架,很多时候需要你根据需求手动拼装SQL语句,这是一件非常麻烦的事情.MyBatis提供了对SQL语句动态的组装能力,而且他只有 ...