ElasticSearch Cardinality Aggregation聚合计算的误差

使用ES不久，今天发现生产环境数据异常，其使用的ES版本是2.1.2，其它版本也类似。通过使用ES的HTTP API进行查询，发现得到的数据跟javaClient API 查询得到的数据不一致，于是对代码逻辑以及ES查询工具产生了怀疑。通过查阅官方文档找到如下描述：

Precision controledit

This aggregation also supports the precision_threshold option:

The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future
{

    "aggs" : {

        "author_count" : {

            "cardinality" : {

                "field" : "author_hash",

                "precision_threshold": 100 
            }

        }

    }

}
The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).
Counts are approximateedit

Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

configurable precision, which decides on how to trade memory for accuracy,

excellent accuracy on low-cardinality sets,

fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

The following chart shows how the error varies before and after the threshold:

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.

　　其意思就是：聚合查询存在误差，在5%范围之内，通过调整“precision_threshold”参数进行调整。

　　于是翻阅查询代码：加入如下部分问题得到解决。该参数在查询时未设置的情况下，默认值为3000。

 private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) {

        // 设置聚合条件

        TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE);

        // 查询条件构建

        BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery();

        FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery);

        packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值

        agg.subAggregation(packAgg);

        nativeSearchQueryBuilder.addAggregation(agg);

    }

ElasticSearch Cardinality Aggregation聚合计算的误差的更多相关文章

Elasticsearch：aggregation介绍
聚合(aggregation)功能集是整个Elasticsearch产品中最令人兴奋和有益的功能之一,主要是因为它提供了一个非常有吸引力对之前的facets的替代. 在本教程中,我们将解释Elasti ...
Django Aggregation聚合 django orm 求平均、去重、总和等常用方法
Django Aggregation聚合在当今根据需求而不断调整而成的应用程序中,通常不仅需要能依常规的字段,如字母顺序或创建日期,来对项目进行排序,还需要按其他某种动态数据对项目进行排序.Djng ...
JS数字计算精度误差的解决方法
本篇文章主要是对javascript避免数字计算精度误差的方法进行了介绍,需要的朋友可以过来参考下,希望对大家有所帮助. 如果我问你 0.1 + 0.2 等于几?你可能会送我一个白眼,0.1 + 0. ...
mssql sqlserver 对不同群组对象进行聚合计算的方法分享
摘要: 下文讲述通过一条sql语句,采用over关键字同时对不同类型进行分组的方法,如下所示: 实验环境:sql server 2008 R2 当有一张明细表,我们需同时按照不同的规则,计算平均.计数 ...
开发中使用mongoTemplate进行Aggregation聚合查询
笔记:使用mongo聚合查询(一开始根本没接触过mongo,一点一点慢慢的查资料完成了工作需求) 需求:在订单表中,根据buyerNick分组,统计每个buyerNick的电话.地址.支付总金额以及总 ...
java使用elasticsearch分组进行聚合查询（group by）-项目中实际应用
java连接elasticsearch 进行聚合查询进行相应操作一:对单个字段进行分组求和 1.表结构图片: 根据任务id分组,分别统计出每个任务id下有多少个文字标题 .SQL:select id ...
小试牛刀ElasticSearch大数据聚合统计
ElasticSearch相信有不少朋友都了解,即使没有了解过它那相信对ELK也有所认识E即是ElasticSearch.ElasticSearch最开始更多用于检索,作为一搜索的集群产品简单易用绝对 ...
Django Aggregation聚合
在当今根据需求而不断调整而成的应用程序中,通常不仅需要能依常规的字段,如字母顺序或创建日期,来对项目进行排序,还需要按其他某种动态数据对项目进行排序.Djngo聚合就能满足这些要求. 以下面的Mode ...
MDX Step by Step 读书笔记(七) - Performing Aggregation 聚合函数之 Sum, Aggregate, Avg
开篇介绍 SSAS 分析服务中记录了大量的聚合值,这些聚合值在 Cube 中实际上指的就是度量值.一个给定的度量值可能聚合了来自事实表中上千上万甚至百万条数据,因此在设计阶段我们所能看到的度量实际上就 ...

随机推荐

selenium爬虫
Web自动化测试工具,可运行在浏览器,根据指令操作浏览器,只是工具,必须与第三方浏览器结合使用,相比于之前学的爬虫只是慢了一点而已.而且这种方法爬取的东西不用在意时候ajax动态加载等反爬机制.因此找 ...
【LeetCode】55-跳跃游戏
题目描述给定一个非负整数数组,你最初位于数组的第一个位置. 数组中的每个元素代表你在该位置可以跳跃的最大长度. 判断你是否能够到达最后一个位置. 示例 1: 输入: [2,3,1,1,4] 输出: ...
SpringBoot应用启动过程分析
真的好奇害死猫!之前写过几个SpringBoot应用,但是一直没搞明白应用到底是怎么启动的,心里一直有点膈应.好吧,趁有空去看了下源码,写下这篇博客作为学习记录吧! 个人拙见,若哪里有理解不对的地方, ...
ORACLE官网JAVA学习文档
Trails Covering the Basics 1 Getting Started 1.1 The Java Technology Phenomenon 1.1.1 About the Ja ...
Mac破解软件下载的几个网站
一.关于破解(盗版)软件的个人看法(可忽略,网址在文末): 1.在经济(预算)允许的范围内,尽量支持正版: 2.软件如衣服,在不同的季节,不同的店铺买价格不一样,国内的一些代理网站经常会打折: 3.作 ...
webhook 自动部署代码
前话: 一般情况,自己在本地开发,代码改动后要push放到线上去看效果,但是我们还要到线上环境手动拉取代码库 git pull 下来, 一来一回太麻烦了. 现在用webhook就可以实现本地开发,pu ...
动态设置 view 在布局中位置
一.概述有时项目需要动态设置一个底部列表,比如 popupwindow ,listview 底部显示 ,所以记录一下此处, android.support.v7.widget.CardView ...
Redis的实现（java）
日常操作 public static void main(String[] args) { Jedis jedis = ); //1.开启事务 Transaction transaction = je ...
Django--路由层、视图层、模版层
路由层: 路由匹配 url(正则表达式,视图函数内存地址) 只要正则匹配到了内容,就不再往下匹配,而是直接运行后面的视图函数匹配首页) url(r'^&', home) 匹配尾页 url(r ...
松软科技课堂:SQL--FULLJOIN关键字
SQL FULL JOIN 关键字(from:www.sysoft.net.cn) 只要其中某个表存在匹配,FULL JOIN 关键字就会返回行. FULL JOIN 关键字语法 SELECT col ...

ElasticSearch Cardinality Aggregation聚合计算的误差

Precision controledit

Counts are approximateedit

ElasticSearch Cardinality Aggregation聚合计算的误差的更多相关文章

随机推荐

热门专题