Elasticsearch去重查询/过滤重复数据（聚合）

带家好，我是马儿，这次来讲一下最近遇到的一个问题

我司某个环境的es中被导入了重复数据，导致查询的时候会出现一些重复数据，所以要我们几个开发想一些解决方案，我们聊了聊，相出了下面一些方案：

1.从源头解决：导入数据时进行唯一性校验

2.从数据解决：清洗数据，将重复的数据查出后清理，然后入库

3.从查询解决：查询时筛选重复数据

我就从查询着手，找到了聚合查询的方法

聚合(Aggregations)

聚合功能为ES带来了统计分析的能力，类似于SQL语言中的group by，avg，sum等函数

桶(Buckets)：符合条件的文档的集合，相当于SQL中的group by

桶的概念在很多地方有应用，比如桶排序，HashMap的实现中数组也可看作桶，等等等等

示例：

根据city，对twitter索引的文档进行分组

aggs：聚合

my：自定义名称

terms：根据结果分类

field：筛选字段

city：需要分类的字段

GET /twitter/doc/_search

{

	"from": 0,

	"size": 0,

	"aggs": {

	  "my":{

	    "terms":{

	      "field": "city"

	    }

	  }

	}

}

结果中聚合的部分：

计算出了类型和命中的数量

"aggregations": {

    "my": {

      "doc_count_error_upper_bound": 0,

      "sum_other_doc_count": 0,

      "buckets": [

        {

          "key": "北京",

          "doc_count": 105

        },

        {

          "key": "上海",

          "doc_count": 1

        }

      ]

    }

  }

但这不是只有统计结果吗，我要的是筛选后的数据啊

top_hits指标聚合器

top_hits指标聚合器跟踪要聚合的最相关文档，可以有效地用于通过存储桶聚合器按某些字段对结果集进行分组。

选项：

from-要获取的第一个结果的偏移量。

size-每个存储桶要返回的最匹配匹配项的最大数目。默认情况下，返回前三个匹配项。

排序-匹配的热门匹配的排序方式。默认情况下，命中按主要查询的分数排序。

示例：

根据city，对twitter索引的文档进行分组、根据age进行排序、结果只包含user+age+city，然后显示每组的一条数据

aggs：聚合

my：自定义名称

terms：根据结果分类

field：筛选字段

city：需要分类的字段

sort：排序

age：排序依据字段

order：排序方式

desc：降序

_source includes：结果包含的字段

size：每组显示的数量

{

	"from": 0,

	"size": 0,

	"aggs": {

	  "my":{

	    "terms":{

	      "field": "city"

	    },

	    "aggs":{

	      "my_top_hits":{

	        "top_hits":{

	          "sort": [

              {

                "age": {

                  "order": "desc"

                }

              }

            ],

            "_source": {

              "includes": [

                "user",

                "age",

                "city"

              ]

            },

	          "size":1

	        }

	      }

	    }

	  }

	}

}

结果中聚合的部分：

"aggregations": {

    "my": {

      "doc_count_error_upper_bound": 0,

      "sum_other_doc_count": 0,

      "buckets": [

        {

          "key": "北京",

          "doc_count": 105,

          "my_top_hits": {

            "hits": {

              "total": 105,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwgirrweXGTc7-cPA",

                  "_score": null,

                  "_source": {

                    "city": "北京",

                    "user": "朝阳区-老王",

                    "age": 50

                  },

                  "sort": [

                    50

                  ]

                }

              ]

            }

          }

        },

        {

          "key": "上海",

          "doc_count": 1,

          "my_top_hits": {

            "hits": {

              "total": 1,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwiM1rweXGTc7-cPB",

                  "_score": null,

                  "_source": {

                    "city": "上海",

                    "user": "虹桥-老吴",

                    "age": 90

                  },

                  "sort": [

                    90

                  ]

                }

              ]

            }

          }

        }

      ]

    }

  }

但是光使用terms，我添加了多个字段后查不出来东西了都，难道这样还不行吗

使用script进行聚合

常规的聚合无法在聚合中进行复杂操作，所以要加入脚本

示例：

修改terms中内容为下，将三个条件拼接起来

"terms":{

	      "script": "doc['user.keyword'].value + '#' + doc['age'].value + '#' +doc['city'].value"

	    },

查询结果：

key：拼接的条件

doc_count：每组重复的数目

"aggregations": {

    "my": {

      "doc_count_error_upper_bound": 0,

      "sum_other_doc_count": 0,

      "buckets": [

        {

          "key": "双榆树-张三#20#北京",

          "doc_count": 101,

          "my_top_hits": {

            "hits": {

              "total": 101,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW9lr8sBP5iHlpen8GYt",

                  "_score": null,

                  "_source": {

                    "city": "北京",

                    "user": "双榆树-张三",

                    "age": 20

                  },

                  "sort": [

                    20

                  ]

                }

              ]

            }

          }

        },

        {

          "key": "东城区-李四#30#北京",

          "doc_count": 1,

          "my_top_hits": {

            "hits": {

              "total": 1,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwaOIrweXGTc7-cO-",

                  "_score": null,

                  "_source": {

                    "city": "北京",

                    "user": "东城区-李四",

                    "age": 30

                  },

                  "sort": [

                    30

                  ]

                }

              ]

            }

          }

        },

        {

          "key": "东城区-老刘#30#北京",

          "doc_count": 1,

          "my_top_hits": {

            "hits": {

              "total": 1,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwXhcrweXGTc7-cO9",

                  "_score": null,

                  "_source": {

                    "city": "北京",

                    "user": "东城区-老刘",

                    "age": 30

                  },

                  "sort": [

                    30

                  ]

                }

              ]

            }

          }

        },

        {

          "key": "朝阳区-老王#50#北京",

          "doc_count": 1,

          "my_top_hits": {

            "hits": {

              "total": 1,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwgirrweXGTc7-cPA",

                  "_score": null,

                  "_source": {

                    "city": "北京",

                    "user": "朝阳区-老王",

                    "age": 50

                  },

                  "sort": [

                    50

                  ]

                }

              ]

            }

          }

        },

        {

          "key": "朝阳区-老贾#35#北京",

          "doc_count": 1,

          "my_top_hits": {

            "hits": {

              "total": 1,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwcvBrweXGTc7-cO_",

                  "_score": null,

                  "_source": {

                    "city": "北京",

                    "user": "朝阳区-老贾",

                    "age": 35

                  },

                  "sort": [

                    35

                  ]

                }

              ]

            }

          }

        },

        {

          "key": "虹桥-老吴#90#上海",

          "doc_count": 1,

          "my_top_hits": {

            "hits": {

              "total": 1,

              "max_score": null,

              "hits": [

                {

                  "_index": "twitter",

                  "_type": "doc",

                  "_id": "AW5jwiM1rweXGTc7-cPB",

                  "_score": null,

                  "_source": {

                    "city": "上海",

                    "user": "虹桥-老吴",

                    "age": 90

                  },

                  "sort": [

                    90

                  ]

                }

              ]

            }

          }

        }

      ]

    }

  }

可以看到，每组都不一样，我们script真是太强大了

Java实现

使用elasticsearch包中的工具类，将索引中所有字段进行拼接，作为aggregation参数传入查询即可

总结：

本文介绍了es的聚合功能，aggs+top_hits+script就能过滤重复数据，得到唯一结果。

但是这边也有个坑，es聚合本身不支持分页

这个分页以后有机会再说