ElasticSearch Cardinality Aggregation聚合计算的误差

使用ES不久，今天发现生产环境数据异常，其使用的ES版本是2.1.2，其它版本也类似。通过使用ES的HTTP API进行查询，发现得到的数据跟javaClient API 查询得到的数据不一致，于是对代码逻辑以及ES查询工具产生了怀疑。通过查阅官方文档找到如下描述：

Precision controledit

This aggregation also supports the precision_threshold option:

The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future
{

    "aggs" : {

        "author_count" : {

            "cardinality" : {

                "field" : "author_hash",

                "precision_threshold": 100 
            }

        }

    }

}
The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).
Counts are approximateedit

Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

configurable precision, which decides on how to trade memory for accuracy,

excellent accuracy on low-cardinality sets,

fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

The following chart shows how the error varies before and after the threshold:

For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.

　　其意思就是：聚合查询存在误差，在5%范围之内，通过调整“precision_threshold”参数进行调整。

　　于是翻阅查询代码：加入如下部分问题得到解决。该参数在查询时未设置的情况下，默认值为3000。

 private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) {

        // 设置聚合条件

        TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE);

        // 查询条件构建

        BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery();

        FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery);

        packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值

        agg.subAggregation(packAgg);

        nativeSearchQueryBuilder.addAggregation(agg);

    }

ElasticSearch Cardinality Aggregation聚合计算的误差的更多相关文章

Elasticsearch：aggregation介绍
聚合(aggregation)功能集是整个Elasticsearch产品中最令人兴奋和有益的功能之一,主要是因为它提供了一个非常有吸引力对之前的facets的替代. 在本教程中,我们将解释Elasti ...
Django Aggregation聚合 django orm 求平均、去重、总和等常用方法
Django Aggregation聚合在当今根据需求而不断调整而成的应用程序中,通常不仅需要能依常规的字段,如字母顺序或创建日期,来对项目进行排序,还需要按其他某种动态数据对项目进行排序.Djng ...
JS数字计算精度误差的解决方法
本篇文章主要是对javascript避免数字计算精度误差的方法进行了介绍,需要的朋友可以过来参考下,希望对大家有所帮助. 如果我问你 0.1 + 0.2 等于几?你可能会送我一个白眼,0.1 + 0. ...
mssql sqlserver 对不同群组对象进行聚合计算的方法分享
摘要: 下文讲述通过一条sql语句,采用over关键字同时对不同类型进行分组的方法,如下所示: 实验环境:sql server 2008 R2 当有一张明细表,我们需同时按照不同的规则,计算平均.计数 ...
开发中使用mongoTemplate进行Aggregation聚合查询
笔记:使用mongo聚合查询(一开始根本没接触过mongo,一点一点慢慢的查资料完成了工作需求) 需求:在订单表中,根据buyerNick分组,统计每个buyerNick的电话.地址.支付总金额以及总 ...
java使用elasticsearch分组进行聚合查询（group by）-项目中实际应用
java连接elasticsearch 进行聚合查询进行相应操作一:对单个字段进行分组求和 1.表结构图片: 根据任务id分组,分别统计出每个任务id下有多少个文字标题 .SQL:select id ...
小试牛刀ElasticSearch大数据聚合统计
ElasticSearch相信有不少朋友都了解,即使没有了解过它那相信对ELK也有所认识E即是ElasticSearch.ElasticSearch最开始更多用于检索,作为一搜索的集群产品简单易用绝对 ...
Django Aggregation聚合
在当今根据需求而不断调整而成的应用程序中,通常不仅需要能依常规的字段,如字母顺序或创建日期,来对项目进行排序,还需要按其他某种动态数据对项目进行排序.Djngo聚合就能满足这些要求. 以下面的Mode ...
MDX Step by Step 读书笔记(七) - Performing Aggregation 聚合函数之 Sum, Aggregate, Avg
开篇介绍 SSAS 分析服务中记录了大量的聚合值,这些聚合值在 Cube 中实际上指的就是度量值.一个给定的度量值可能聚合了来自事实表中上千上万甚至百万条数据,因此在设计阶段我们所能看到的度量实际上就 ...

随机推荐

ubantu下配置共享文件
原文转自 http://blog.chinaunix.net/uid-25305993-id-3754109.html 一 samba的安装: sudo apt-get install samba ...
Charles安装windows篇
简介 Charles是一款非常好用的网络抓包工具,类似fiddle抓包工具,当然也可以理解为一款HTTP代理服务器.HTTP监视器.反向代理服务器等. 二.官网下载地址:https://www.ch ...
springboot整合mybatis(注解)
springboot整合mybatis(注解) 1.pom.xml: <?xml version="1.0" encoding="UTF-8"?> ...
'mvn' 不是内部或外部命令,也不是可运行的程序或批处理文件。
一定要发现自己的问题不要盲目从众 1.把maven的安装包解压 2.配置环境变量 3.配置path路径 4.在dos下测试一下结果出现:'mvn' 不是内部或外部命令,也不是可运行的程序或批处理文 ...
Python(Head First)学习笔记：一
目录: 1 认识Python:Python的特点.安装.开发环境搭建 2 共享代码:连接共享社区.语法.函数.技巧 3 文件与异常:调试.处理错误.迭代.改进.完善 4 持久存储:文件存储.读写 5 ...
NetCore+AutoMapper多个对象映射到一个Dto对象
目录一. 定义源映射类和被映射类DTO二.注入AutoMapper三.配置映射四.写测试一.定义源映射对象为了体现AutoMapper映射特性,在SocialAttribute中的Name属性没 ...
sql server 2014 的安装
1.双击打开sql_server2014的安装包 2.点击弹出来的对话框的确定按钮 3.等待一会,安装包在准备中 4.弹出SQL server 安装中心,点击全新 SQL Server 独立安装 5. ...
java数据结构——递归（Recursion）例题持续更新中
继续学习数据结构递归,什么是递归呢?字面理解就是先递出去,然后回归,递归核心思想就是直接或间接调用本身,好比从前有座山,山里有位老和尚,在给小和尚讲故事,讲的是从前有座山,山里有位老和尚,在给小和尚讲 ...
Mach-O在内存中符号表地址、字符串表地址的计算
KSCrash 是一个用于 iOS 平台的崩溃捕捉框架,最近读了其部分源码,在 KSDynamicLinker 文件中有一个函数,代码如下: /** Get the segment base addr ...
java架构之路-（mysql底层原理）Mysql之让我们再深撸一次mysql
让我再深撸一次mysql吧,这次主要以应对面试来说说mysql,大概几个方向,索引结构,查询引擎,索引优化,explain的详解和trace工具的使用. 索引: 我们先来看一下mysql的B+tree ...

ElasticSearch Cardinality Aggregation聚合计算的误差

Precision controledit

Counts are approximateedit

ElasticSearch Cardinality Aggregation聚合计算的误差的更多相关文章

随机推荐

热门专题