Hive学习之路（十七）Hive分析窗口函数(五) GROUPING SETS、GROUPING_

概述

GROUPING SETS,GROUPING__ID,CUBE,ROLLUP

这几个分析函数通常用于OLAP中，不能累加，而且需要根据不同维度上钻和下钻的指标统计，比如，分小时、天、月的UV数。

数据准备

数据格式

2015-03,2015-03-10,cookie1

2015-03,2015-03-10,cookie5

2015-03,2015-03-12,cookie7

2015-04,2015-04-12,cookie3

2015-04,2015-04-13,cookie2

2015-04,2015-04-13,cookie4

2015-04,2015-04-16,cookie4

2015-03,2015-03-10,cookie2

2015-03,2015-03-10,cookie3

2015-04,2015-04-12,cookie5

2015-04,2015-04-13,cookie6

2015-04,2015-04-15,cookie3

2015-04,2015-04-15,cookie2

2015-04,2015-04-16,cookie1

创建表

use cookie;

drop table if exists cookie5;

create table cookie5(month string, day string, cookieid string)

row format delimited fields terminated by ',';

load data local inpath "/home/hadoop/cookie5.txt" into table cookie5;

select * from cookie5;

玩一玩GROUPING SETS和GROUPING__ID

说明

在一个GROUP BY查询中，根据不同的维度组合进行聚合，等价于将不同维度的GROUP BY结果集进行UNION ALL

GROUPING__ID，表示结果属于哪一个分组集合。

查询语句

select

  month,

  day,

  count(distinct cookieid) as uv,

  GROUPING__ID

from cookie.cookie5

group by month,day

grouping sets (month,day)

order by GROUPING__ID;

等价于

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day

查询结果

结果说明

第一列是按照month进行分组

第二列是按照day进行分组

第三列是按照month或day分组是，统计这一组有几个不同的cookieid

第四列grouping_id表示这一组结果属于哪个分组集合，根据grouping sets中的分组条件month，day，1是代表month，2是代表day

再比如

SELECT  month, day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM cookie5

GROUP BY month,day

GROUPING SETS (month,day,(month,day))

ORDER BY GROUPING__ID;

等价于

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day

玩一玩CUBE

说明

根据GROUP BY的维度的所有组合进行聚合

查询语句

SELECT  month, day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM cookie5

GROUP BY month,day

WITH CUBE

ORDER BY GROUPING__ID;

等价于

SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM cookie5

UNION ALL

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day

查询结果

玩一玩ROLLUP

说明

是CUBE的子集，以最左侧的维度为主，从该维度进行层级聚合

查询语句

-- 比如，以month维度进行层级聚合

SELECT  month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID

FROM cookie5

GROUP BY month,day WITH ROLLUP  ORDER BY GROUPING__ID;

可以实现这样的上钻过程：
月天的UV->月的UV->总UV

--把month和day调换顺序，则以day维度进行层级聚合：

可以实现这样的上钻过程：
天月的UV->天的UV->总UV
（这里，根据天和月进行聚合，和根据天聚合结果一样，因为有父子关系，如果是其他维度组合的话，就会不一样）