目的

研究聚合查询的BUCKETS桶·到底是如何计算?

PS:es版本为7.8.1

Bucket概念

关于es聚合查询,官方介绍,可以参考 es聚合查询-bucket

有道翻译:

桶聚合不像指标聚合那样计算字段的指标,相反,它们创建文档的桶。每个桶都与一个标准相关联(取决于聚合类型),该标准确定当前上下文中的文档是否“属于”它。换句话说,存储桶有效地定义了文档集。除了存储桶本身,存储桶聚合还计算并返回“落入”每个存储桶的文档数量。

与度量聚合相反,桶聚合可以容纳子聚合。这些子聚合将为它们的“父”桶聚合所创建的桶聚合。

有不同的桶聚合器,每个都有不同的“桶”策略。有的定义单个桶,有的定义固定数量的多个桶,还有的在聚合过程中动态创建桶。

备注:单个响应中允许的最大桶数受名为search.max_buckets的动态集群设置限制。它默认为10,000,尝试返回超过限制的请求将失败并出现异常。

search.max_buckets

官网看下search.max_buckets这个参数:

有道翻译:

search.max_buckets
(Dynamic, integer)单个响应中允许的最大聚合桶数。默认值为10000。 Requests that attempt to return more than this limit will return an error.
试图返回超过此限制的请求将返回错误。

缘起

在一次排查问题中,遇到如下报错日志:

trying to create too many buckets. must be less than or equal to: [10000] but was [10001].

关于以上问题的分析以及原因可参看我的这篇实战分析博文进行了解:trying to create too many buckets,本篇博文,我主要是要来验证一下search.max_buckets这个配置项的计算桶的个数究竟是如何进行统计算桶数的。

数据准备

1、创建测试索引库(PUT请求)

注意:此处建库有一定数据倾向性,多数字段mapping我设置了字段存储类型为keyword类型,是为了后面方便测试聚合操作,原因是keyword类型的数据可以满足类似名称、类别、状态码、邮政编码和标签等数据的要求,不进行分词,常常被用来过滤、排序和聚合。

如下:我构建一个用于测试聚合分桶查询的手机信息索引库,用于演示我下面的操作使用。

localhost:9200/phones_test_bucket
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"size": {
"type": "long"
},
"category": {
"type": "keyword"
},
"label": {
"type": "keyword"
},
"release_date": {
"type": "date"
}
}
}
} ===返回===
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "phones_test_bucket"
}

2、添加模拟数据

下面我将根据手机名称,颜色,类别进行聚合分桶查询,然后通过更改search.max_buckets的配置参数来验证分桶参数的取值关系。

localhost:9200/phones_test_bucket/_bulk
{"index":{}}
{"name":"小米","price":3400,"color":"白色","size":6.21,"category":"标准版","label":"性价比1","release_date":"2023-02-06"}
{"index":{}}
{"name":"小米","price":3400,"color":"白色","size":6.21,"category":"升级版","label":"性价比2","release_date":"2023-03-06"}
{"index":{}}
{"name":"小米","price":2400,"color":"黑色","size":6.21,"category":"升级版","label":"性价比3","release_date":"2023-02-06"}
{"index":{}}
{"name":"小米","price":3400,"color":"黑色","size":6.21,"category":"标准版","label":"性价比4","release_date":"2023-03-06"}
{"index":{}}
{"name":"苹果","price":2400,"color":"远峰蓝色","size":6.21,"category":"标准版","label":"流畅","release_date":"2023-02-06"}
{"index":{}}
{"name":"华为","price":5200,"color":"白色","size":6.21,"category":"标准版","label":"高端1","release_date":"2023-03-06"}
{"index":{}}
{"name":"华为","price":5200,"color":"黑色","size":6.21,"category":"标准版","label":"高端2","release_date":"2023-04-06"}
{"index":{}}
{"name":"华为","price":5900,"color":"黑色","size":6.21,"category":"升级版","label":"高端3","release_date":"2023-05-06"}
{"index":{}}
{"name":"华为","price":5900,"color":"白色","size":6.21,"category":"升级版","label":"高端4","release_date":"2023-05-06"}

3、分桶参数设置

在开始测试之前,我们需要关注下search.max_buckets这个参数的设置API,在一开始我就截图了,官网对这个参数说明的默认值是10000(我的es版本是7.8.1),截至我写这篇博文时,es最新版本已经更新到8.6,感兴趣可以去官网看看,8.6版本分桶参数说明了,此参数的默认值也变更了,变更为65536。

修改es分桶最大配置(PUT请求)
http://127.0.0.1:9200/_cluster/settings
{
"persistent": {
"search.max_buckets": 2
}
} ===返回===
{
"acknowledged": true,
"persistent": {
"search": {
"max_buckets": "2"
}
},
"transient": {}
}
修改查看分桶最大配置(GET请求)
http://127.0.0.1:9200/_cluster/settings
//无请求参数
====返回====
{
"persistent": {
"search": {
"max_buckets": "2"
}
},
"transient": {}
}

4、测试

1、第一组测试

单字段分组-最大分桶2-结果失败

http://127.0.0.1:9200/phones_test_bucket/_search
// 第一组数据,"max_buckets": "2"的情况下,分组失败
{
"size":0,
"aggs":{
"group_by_name":{
"terms":{
"field":"name" } }
}
}
===返回====
{
"error": {
"root_cause": [
{
"type": "too_many_buckets_exception",
"reason": "Trying to create too many buckets. Must be less than or equal to: [2] but was [3]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
"max_buckets": 2
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "phones_test_bucket",
"node": "UuMcBk37TNWHjY4hVtzyVA",
"reason": {
"type": "too_many_buckets_exception",
"reason": "Trying to create too many buckets. Must be less than or equal to: [2] but was [3]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
"max_buckets": 2
}
}
]
},
"status": 503
}

单字段分组-最大分桶数3-结果成功

http://127.0.0.1:9200/phones_test_bucket/_search
// 第一组数据,"max_buckets": "3"的情况下,分组成功
{
"size":0,
"aggs":{
"group_by_name":{
"terms":{
"field":"name" } }
}
}
===返回结果===
{
"took": 32,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 9,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"group_by_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "华为",
"doc_count": 4
},
{
"key": "小米",
"doc_count": 4
},
{
"key": "苹果",
"doc_count": 1
}
]
}
}
}

2、第二组测试

多字段分组-最大分桶7-结果失败

http://127.0.0.1:9200/phones_test_bucket/_search
// 多字段分组查询,name+color,第二组,"max_buckets": "7"的情况下,分组失败
{
"aggs": {
"group_by_name": {
"terms": {
"field": "name"
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
}
}
}
}
}
}
===返回===
{
"error": {
"root_cause": [
{
"type": "too_many_buckets_exception",
"reason": "Trying to create too many buckets. Must be less than or equal to: [7] but was [8]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
"max_buckets": 7
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "phones_test_bucket",
"node": "UuMcBk37TNWHjY4hVtzyVA",
"reason": {
"type": "too_many_buckets_exception",
"reason": "Trying to create too many buckets. Must be less than or equal to: [7] but was [8]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
"max_buckets": 7
}
}
]
},
"status": 503
}

多字段分组-最大分桶8-结果成功

http://127.0.0.1:9200/phones_test_bucket/_search
// 多字段分组查询,name+color,第二组,"max_buckets": "8"的情况下,分组成功
{
"size":0,
"aggs": {
"group_by_name": {
"terms": {
"field": "name"
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
}
}
}
}
}
}
===返回===
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 9,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"group_by_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "华为",
"doc_count": 4,
"group_by_color": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "白色",
"doc_count": 2
},
{
"key": "黑色",
"doc_count": 2
}
]
}
},
{
"key": "小米",
"doc_count": 4,
"group_by_color": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "白色",
"doc_count": 2
},
{
"key": "黑色",
"doc_count": 2
}
]
}
},
{
"key": "苹果",
"doc_count": 1,
"group_by_color": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "远峰蓝色",
"doc_count": 1
}
]
}
}
]
}
}
}

3、第三组测试

多字段分组-最大分桶16-结果失败

http://127.0.0.1:9200/phones_test_bucket/_search
// 多字段分组查询,name+color,第三组,"max_buckets": "17"的情况下,分组成功
{
"size":0,
"aggs": {
"group_by_name": {
"terms": {
"field": "name"
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"group_by_category": {
"terms": {
"field": "category"
}
}
}
}
}
}
}
}
===返回===
{
"error": {
"root_cause": [],
"type": "search_phase_execution_exception",
"reason": "",
"phase": "fetch",
"grouped": true,
"failed_shards": [],
"caused_by": {
"type": "too_many_buckets_exception",
"reason": "Trying to create too many buckets. Must be less than or equal to: [16] but was [17]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
"max_buckets": 16
}
},
"status": 503
}

多字段分组-最大分桶17-结果成功

http://127.0.0.1:9200/phones_test_bucket/_search
// 多字段分组查询,name+color,第三组,"max_buckets": "17"的情况下,分组成功
{
"size":0,
"aggs": {
"group_by_name": {
"terms": {
"field": "name"
},
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"group_by_category": {
"terms": {
"field": "category"
}
}
}
}
}
}
}
}
===返回===
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 9,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"group_by_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "华为",
"doc_count": 4,
"group_by_color": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "白色",
"doc_count": 2,
"group_by_category": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "升级版",
"doc_count": 1
},
{
"key": "标准版",
"doc_count": 1
}
]
}
},
{
"key": "黑色",
"doc_count": 2,
"group_by_category": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "升级版",
"doc_count": 1
},
{
"key": "标准版",
"doc_count": 1
}
]
}
}
]
}
},
{
"key": "小米",
"doc_count": 4,
"group_by_color": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "白色",
"doc_count": 2,
"group_by_category": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "升级版",
"doc_count": 1
},
{
"key": "标准版",
"doc_count": 1
}
]
}
},
{
"key": "黑色",
"doc_count": 2,
"group_by_category": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "升级版",
"doc_count": 1
},
{
"key": "标准版",
"doc_count": 1
}
]
}
}
]
}
},
{
"key": "苹果",
"doc_count": 1,
"group_by_color": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "远峰蓝色",
"doc_count": 1,
"group_by_category": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "标准版",
"doc_count": 1
}
]
}
}
]
}
}
]
}
}
}

测试结论

以第三组测试数据为例子,按名称+颜色+类别进行聚合分组,最终发现临界值出现在max_buckets为17,那这个17是怎么算的,为甚设置17就可以查出来,设置max_buckets为16就报Trying to create too many buckets错呢。来,我画了一张分桶模拟图,动动你的小手,我们来数一数。



结合我上面测试的返回值,看一下,结果是不是正好对应了17个桶,想必到这里你应该也知道分桶的个数到底是怎么计算的了吧。

测试结论:es聚合分组的桶数计算规则,具体分了多少桶,是和数据相关的,数据异同越大,
分桶数目越多,对于多字段嵌套查询,嵌套的层数越深,分桶数越大。所以不建议大量字
段嵌套进行聚合查询,容易引发分桶爆炸,触发熔断查询。

ES实战-桶查询的更多相关文章

  1. es实战之查询大量数据

    背景 项目中已提供海量日志数据的多维实时查询,客户提出新需求:将数据导出. 将数据导出分两步: 查询大量数据 将数据生成文件并下载 本文主要探讨第一步,在es中查询大量数据或者说查询大数据集. es支 ...

  2. es实战之数据导出成csv文件

    从es将数据导出分两步: 查询大量数据 将数据生成文件并下载 本篇主要是将第二步,第一步在<es实战之查询大量数据>中已讲述. csv vs excel excel2003不能超过6553 ...

  3. 1W字|40 图|硬核 ES 实战

    前言 上篇我们讲到了 Elasticsearch 全文检索的原理<别只会搜日志了,求你懂点检索原理吧>,通过在本地搭建一套 ES 服务,以多个案例来分析了 ES 的原理以及基础使用.这次我 ...

  4. 【Elasticsearch】ES中时间查询报错:Caused by: ElasticsearchParseException[failed to parse date field [Sun Dec 31 16:00:00 UTC 2017] with format [yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis]];

    ES中时间查询报错:Caused by: ElasticsearchParseException[failed to parse date field [Sun Dec 31 16:00:00 UTC ...

  5. ES 07 - Elasticsearch查询文档的六种方法

    目录 1 Query String Search(查询串检索) 2 Query DSL(ES特定语法检索) 3 Query Filter(过滤检索) 4 Full Text Search(全文检索) ...

  6. ES系列九、ES优化聚合查询之深度优先和广度优先

    1.优化聚合查询示例 假设我们现在有一些关于电影的数据集,每条数据里面会有一个数组类型的字段存储表演该电影的所有演员的名字. { "actors" : [ "Fred J ...

  7. 在es中用scroll查询与completableFuture

    一般而言,es返回数据的上限是10000条,如果超过这个数量,就必须使用scroll查询. 所谓scroll查询就类似DBMS中的游标,或者快照吧,利用查询条件,在第一次查询时,在所有的结果上形成了一 ...

  8. ES开启慢查询日志

    默认情况,慢日志是不开启的.要开启它,需要定义具体动作(query,fetch 还是 index),你期望的事件记录等级( WARN.INFO.DEBUG.TRACE 等),以及时间阈值. es有几种 ...

  9. ES 调优查询亿级数据毫秒级返回!怎么做到的?--文件系统缓存

    一道面试题的引入: 如果面试的时候碰到这样一个面试题:ElasticSearch(以下简称ES) 在数据量很大的情况下(数十亿级别)如何提高查询效率? 这个问题说白了,就是看你有没有实际用过 ES,因 ...

  10. ES的索引查询和删除

    postman 1.查看es状态 get http://127.0.0.1:9200/_cat/health 红色表示数据不可用,黄色表示数据可用,部分副本没有分配,绿色表示一切正常 2.查看所有索引 ...

随机推荐

  1. python爬取丁香园疫情数据

    毕设需求了就是说 导师要做关于时间线的- -看发展趋势 不得不今天又现学现卖 首先 创建一个python文件 python.file 引入一点资源 # 发送请求 import requests # 页 ...

  2. springboot如何在拦截器中拦截post请求参数以及解决文件类型上传问题

      我们经常有这样一个场景,比如:在springboot拦截器中想截取post请求的body参数做一些中间处理,或者用到自定义注解,想拦截一些特定post请求的方法的参数,记录一些请求日志. 想到了使 ...

  3. .net core 从(本地)服务器获取APK文件并解析APK信息

    1.apk解析除了使用客户端利用aapt.exe.unzip.exe开发客户端解析外,还可以直接利用服务进行解析 /// <summary> /// 从本地服务器获取APK文件并解析APK ...

  4. 使用花生壳进行内网穿透实验SQLserver

    在一次编写软件的过程中,想让远在河南的同学对试一试,但是他的电脑上没有与之对应的SQL数据库不能便不能够运行软件.于是我想到了远程连接.在使用花生壳的时候,遇到了一些问题,然而网络上并没有与之对应的解 ...

  5. 【TS】object类型

    object是一个对象,在ts中定义对象类型的语法为:let 变量名 :object = { } 在object类型中,对象内部定义的值是不受类型约束的,只要是一个object类型即可,例如: let ...

  6. Java 集合中的排序算法浅析

    作者:京东物流 秦彪 1.  引言 排序是一个Java开发者,在日常开发过程中随处可见的开发内容,Java中有丰富的API可以调用使用.在Java语言中,作为集合工具类的排序方法,必定要做到通用.高效 ...

  7. python3 os.getcwd()和os.path.realpath()区别

    在写一个android手机的自动化测试,用appium+python3,跟室友讨论拟定了一个框架,在写2个通用的简单模块时,遇到问题. 同一个目录下2个模块,driver.py(用于获取app 句柄) ...

  8. 2020.11.30【NOIP提高A组】模拟

    总结与反思 很不幸,估分 \(170\),可惜 \(T2\) 暴力 \(50pts\) 全掉了 \(T1\) 结论题,如果想到了,\(O(n)\) 过,只有十几行代码 感觉不好想,不过还是 \(A\) ...

  9. JZOJ 5346. 【NOIP2017提高A组模拟9.5】NYG的背包

    题目 分析 很神奇的贪心 \(Code\) #include<cstdio> #include<algorithm> using namespace std; typedef ...

  10. Android:Activity

    什么是 Activity 简单来说 Activity 类似于一个 Web 的页面,表示拥有整个屏幕的界面.电子邮件应用可能有一个显示新电子邮件列表的 Activity.一个用于撰写电子邮件的 Acti ...