Hive函数:GROUPING SETS,GROUPING__ID,CUBE,ROLLUP
参考:lxw大数据田地:http://lxw1234.com/archives/2015/04/193.htm
数据准备:
CREATE EXTERNAL TABLE test_data (
month STRING,
day STRING,
cookieid STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/user/jc_rc_ftp/test_data'; select * from test_data l;
+----------+-------------+-------------+--+
| l.month | l.day | l.cookieid |
+----------+-------------+-------------+--+
| 2015-03 | 2015-03-10 | cookie1 |
| 2015-03 | 2015-03-10 | cookie5 |
| 2015-03 | 2015-03-12 | cookie7 |
| 2015-04 | 2015-04-12 | cookie3 |
| 2015-04 | 2015-04-13 | cookie2 |
| 2015-04 | 2015-04-13 | cookie4 |
| 2015-04 | 2015-04-16 | cookie4 |
| 2015-03 | 2015-03-10 | cookie2 |
| 2015-03 | 2015-03-10 | cookie3 |
| 2015-04 | 2015-04-12 | cookie5 |
| 2015-04 | 2015-04-13 | cookie6 |
| 2015-04 | 2015-04-15 | cookie3 |
| 2015-04 | 2015-04-15 | cookie2 |
| 2015-04 | 2015-04-16 | cookie1 |
+----------+-------------+-------------+--+
14 rows selected (0.249 seconds)
GROUPING SETS
在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行UNION ALL
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_data
GROUP BY month,day
GROUPING SETS (month,day)
ORDER BY GROUPING__ID; 等价于
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_data GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_data GROUP BY day +----------+-------------+-----+---------------+--+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+--+
| 2015-04 | NULL | 6 | 1 |
| 2015-03 | NULL | 5 | 1 |
| NULL | 2015-04-16 | 2 | 2 |
| NULL | 2015-04-15 | 2 | 2 |
| NULL | 2015-04-13 | 3 | 2 |
| NULL | 2015-04-12 | 2 | 2 |
| NULL | 2015-03-12 | 1 | 2 |
| NULL | 2015-03-10 | 4 | 2 |
+----------+-------------+-----+---------------+--+
8 rows selected (177.299 seconds) SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_data
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING__ID; 等价于
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_data GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_data GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_data GROUP BY month,day
+----------+-------------+-----+---------------+--+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+--+
| 2015-04 | NULL | 6 | 1 |
| 2015-03 | NULL | 5 | 1 |
| NULL | 2015-03-10 | 4 | 2 |
| NULL | 2015-04-16 | 2 | 2 |
| NULL | 2015-04-15 | 2 | 2 |
| NULL | 2015-04-13 | 3 | 2 |
| NULL | 2015-04-12 | 2 | 2 |
| NULL | 2015-03-12 | 1 | 2 |
| 2015-04 | 2015-04-16 | 2 | 3 |
| 2015-04 | 2015-04-12 | 2 | 3 |
| 2015-04 | 2015-04-13 | 3 | 3 |
| 2015-03 | 2015-03-12 | 1 | 3 |
| 2015-03 | 2015-03-10 | 4 | 3 |
| 2015-04 | 2015-04-15 | 2 | 3 |
+----------+-------------+-----+---------------+--+
备注:其中的 GROUPING__ID,表示结果属于哪一个分组集合。
CUBE
根据GROUP BY的维度的所有组合进行聚合。
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_data
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID; 等价于
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM test_data
UNION ALL
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM test_data GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM test_data GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM test_data GROUP BY month,day
+----------+-------------+-----+---------------+--+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+--+
| NULL | NULL | 7 | 0 |
| 2015-03 | NULL | 5 | 1 |
| 2015-04 | NULL | 6 | 1 |
| NULL | 2015-04-16 | 2 | 2 |
| NULL | 2015-04-15 | 2 | 2 |
| NULL | 2015-04-13 | 3 | 2 |
| NULL | 2015-04-12 | 2 | 2 |
| NULL | 2015-03-12 | 1 | 2 |
| NULL | 2015-03-10 | 4 | 2 |
| 2015-04 | 2015-04-12 | 2 | 3 |
| 2015-04 | 2015-04-16 | 2 | 3 |
| 2015-03 | 2015-03-12 | 1 | 3 |
| 2015-03 | 2015-03-10 | 4 | 3 |
| 2015-04 | 2015-04-15 | 2 | 3 |
| 2015-04 | 2015-04-13 | 3 | 3 |
+----------+-------------+-----+---------------+--+
ROLLUP
是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。
比如,以month维度进行层级聚合:
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_data
GROUP BY month,day
WITH ROLLUP
ORDER BY GROUPING__ID;
可以实现这样的上钻过程:月天的UV->月的UV->总UV
+----------+-------------+-----+---------------+--+
| month | day | uv | grouping__id |
+----------+-------------+-----+---------------+--+
| NULL | NULL | 7 | 0 |
| 2015-04 | NULL | 6 | 1 |
| 2015-03 | NULL | 5 | 1 |
| 2015-04 | 2015-04-16 | 2 | 3 |
| 2015-04 | 2015-04-15 | 2 | 3 |
| 2015-04 | 2015-04-13 | 3 | 3 |
| 2015-04 | 2015-04-12 | 2 | 3 |
| 2015-03 | 2015-03-12 | 1 | 3 |
| 2015-03 | 2015-03-10 | 4 | 3 |
+----------+-------------+-----+---------------+--+ --把month和day调换顺序,则以day维度进行层级聚合:
SELECT
day,
month,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM test_data
GROUP BY day,month
WITH ROLLUP
ORDER BY GROUPING__ID;
+-------------+----------+-----+---------------+--+
| day | month | uv | grouping__id |
+-------------+----------+-----+---------------+--+
| NULL | NULL | 7 | 0 |
| 2015-04-12 | NULL | 2 | 1 |
| 2015-04-15 | NULL | 2 | 1 |
| 2015-03-12 | NULL | 1 | 1 |
| 2015-04-16 | NULL | 2 | 1 |
| 2015-03-10 | NULL | 4 | 1 |
| 2015-04-13 | NULL | 3 | 1 |
| 2015-04-16 | 2015-04 | 2 | 3 |
| 2015-04-15 | 2015-04 | 2 | 3 |
| 2015-04-13 | 2015-04 | 3 | 3 |
| 2015-03-12 | 2015-03 | 1 | 3 |
| 2015-03-10 | 2015-03 | 4 | 3 |
| 2015-04-12 | 2015-04 | 2 | 3 |
+-------------+----------+-----+---------------+--+
可以实现这样的上钻过程:
天月的UV->天的UV->总UV
(这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样)
Hive函数:GROUPING SETS,GROUPING__ID,CUBE,ROLLUP的更多相关文章
- Hive高阶聚合函数 GROUPING SETS、Cube、Rollup
-- GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起 ...
- Hive SQL grouping sets 用法
概述 GROUPING SETS,GROUPING__ID,CUBE,ROLLUP 这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时.天.月的UV数. ...
- hive中grouping sets的使用
hive中grouping sets 数量较多时如何处理? 可以使用如下设置来 set hive.new.job.grouping.set.cardinality = 30; 这条设置的意义在于 ...
- GROUPING SETS、CUBE、ROLLUP
其实还是写一个Demo 比较好 USE tempdb IF OBJECT_ID( 'dbo.T1' , 'U' )IS NOT NULL BEGIN DROP TABLE dbo.T1; END; G ...
- Hive学习之路 (十七)Hive分析窗口函数(五) GROUPING SETS、GROUPING__ID、CUBE和ROLLUP
概述 GROUPING SETS,GROUPING__ID,CUBE,ROLLUP 这几个分析函数通常用于OLAP中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时.天.月的UV数. ...
- 解析数仓OLAP函数:ROLLUP、CUBE、GROUPING SETS
摘要:GaussDB(DWS) ROLLUP,CUBE,GROUPING SETS等OLAP函数的原理解析. 本文分享自华为云社区<GaussDB(DWS) OLAP函数浅析>,作者: D ...
- Oracle的rollup、cube、grouping sets函数
转载自:https://blog.csdn.net/huang_xw/article/details/6402396 Oracle的group by除了基本用法以外,还有3种扩展用法,分别是rollu ...
- SQL Server2008 程序设计 汇总 GROUP BY,WITH ROLLUP,WITH CUBE,GROUPING SETS(..)
--SQL Server2008 程序设计 汇总 GROUP BY ,WITH ROLLUP WITH CUBE GROUPING SET(..) /*********************** ...
- TSQL 分组集(Grouping Sets)
分组集(Grouping Sets)是多个分组的并集,用于在一个查询中,按照不同的分组列对集合进行聚合运算,等价于对单个分组使用“union all”,计算多个结果集的并集.使用分组集的聚合查询,返回 ...
随机推荐
- iframe标签的定时刷新
由于有个项目是大数据类型的,需要时时展现数据,这就出现了这个需求,页面不断刷新,这个其实很简单了,window.location.reload(); 这个就轻松搞定了,但是灵机一动,加上个控制吧,这下 ...
- 1-7 hibernate关联关系映射
1.关联关系分为单向关联(一对一,一对多,多对一,多对多),多向关联(一对一,一对多,多对多). 2.单向一对一主键关联实例 需要为one-to-one元素指定constrained属性值为true. ...
- TCP 详解
计算机网络中比较中要的无非就是 TCP/IP 协议栈,以及应用层的 HTTP 和 HTTPS . 前几天一直炒的的比较火的就是 HTTP/2.0 了,但是其实 HTTP/2.0 早在2015年的时候就 ...
- 如何通过TortoiseGit(小乌龟)把本地项目上传到github上
1.第一步: 安装git for windows(链接:https://gitforwindows.org/)一路next就好了, 如果遇到什么问题可以参考我另外一篇文章~^ - ^ 2.第二步:安装 ...
- poj-1146 ID codes
Description It is 2084 and the year of Big Brother has finally arrived, albeit a century late. In or ...
- Flashing Back a Failed Primary Database into a Physical Standby Database(闪回FAILOVER失败的物理备库)
文档操作依据来自官方网址:https://docs.oracle.com/cd/E11882_01/server.112/e41134/scenarios.htm#SBYDB4888 闪回FAILOV ...
- Lucene详解
一.lucene原理 Lucene 是apache软件基金会一个开放源代码的全文检索引擎工具包,是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎.它不是一个完整的搜索应用程序 ...
- 常用linux日志查询命令
1.查看实时日志: tail -f nohup.out 2.分页查看所有日志: cat nohup.out | more 4.分页查看前N行日志: tail -n 1000 nohup.out | m ...
- openjudge(四)
关于switch的应用: #include <iostream>#include<iomanip>using namespace std;int main(){int a,b; ...
- 『转载』从内存资源中加载C++程序集:CMemLoadDll
MemLoadDll.h #if !defined(Q_OS_LINUX) #pragma once typedef BOOL (__stdcall *ProcDllMain)(HINSTANCE, ...