[Hive - LanguageManual] GroupBy
Group By Syntax
groupByClause: GROUP BY groupByExpression (, groupByExpression)*groupByExpression: expressiongroupByQuery: SELECT expression (, expression)* FROM src groupByClause? |
Simple Examples
In order to count the number of rows in a table:
SELECT COUNT(*) FROM table2; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
In order to count the number of distinct users by gender one could write the following query:
INSERT OVERWRITE TABLE pv_gender_sumSELECT pv_users.gender, count (DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender; |
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:
INSERT OVERWRITE TABLE pv_gender_aggSELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)FROM pv_usersGROUP BY pv_users.gender; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_aggSELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)FROM pv_usersGROUP BY pv_users.gender; |
Select statement and group by clause
When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well.
Let's take a simple example
CREATE TABLE t1(a INTEGER, b INTGER); |
A group by query on the above table could look like:
SELECT a, sum(b)FROM t1GROUP BY a; |
The above query works because the select clause contains a (the group by key) and an aggregation function (sum(b)).
However, the query below DOES NOT work:
SELECT a, bFROM t1GROUP BY a; |
This is because the select clause has an additional column (b) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1 looked like:
a b------100 1100 2100 3 |
Since the grouping is only done on a, what value of b should Hive display for the group a=100? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.
Advanced Features 高级特性
Multi-Group-By Inserts
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). (可将输出结果插入新表中或者直接覆盖到HDFS上的文件目录)
e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:
FROM pv_usersINSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.genderINSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age; |
Map-side Aggregation for Group By
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
hive.map.aggr控制如何聚合,默认是false,如果设置为true, Hive将会在map端做第一级的聚合,这通常提供更好的效果,但是要求更多的内存才能运行成功。
set hive.map.aggr=true;SELECT COUNT(*) FROM table2; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function
Version
Icon
Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.
See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.
Also see the JIRAs:
- HIVE-2397 Support with rollup option for group by
- HIVE-3433 Implement CUBE and ROLLUP operators in Hive
- HIVE-3471 Implement grouping sets in Hive
- HIVE-3613 Implement grouping_id function
New in Hive release 0.11.0:
- HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
[Hive - LanguageManual] GroupBy的更多相关文章
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual] Import/Export
LanguageManual ImportExport Skip to end of metadata Added by Carl Steinbach, last edited by Le ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...
- [Hive - LanguageManual] Hive Concurrency Model (待)
Hive Concurrency Model Hive Concurrency Model Use Cases Turn Off Concurrency Debugging Configuration ...
- [Hive - LanguageManual ] Explain (待)
EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...
随机推荐
- JDBC学习总结(一)
1.JDBC概述 JDBC是一种可以执行SQL语句并可返回结果的Java API,其全称是Java DataBase Connectivity,也是一套面向对象的应用程序接口API,它由一组用 ...
- Android上常见度量单位【xdpi、hdpi、mdpi、ldpi】解读
术语和概念 屏幕尺寸 屏幕的物理尺寸,以屏幕的对角线长度作为依据(比如 2.8寸, 3.5寸). 简而言之, Android把所有的屏幕尺寸简化为三大类:大,正常,和小. 程序可以针对这三种尺 ...
- MySQL增加列,移动列
ALTER TABLE test ADD COLUMN id INT UNSIGNED NOT NULL auto_increment PRIMARY KEY FIRST 给表添加列是一个常用的操作, ...
- jenkins mac slave 设置
1.在jenkins上增加节点, 2,在mac系统中将ssh的服务打开在偏好设置- 互联网与无线 - 共享中 3,使用mac root用户修改sshd-config的鉴权方式 首先获取到root用户登 ...
- MemSQL Start[c]UP 2.0 - Round 1 B. 4-point polyline (线段的 枚举)
昨天cf做的不好,居然挂零了,还是1点开始的呢.,,, a题少了一个条件,没判断长度. 写一下B题吧 题目链接 题意: 给出(n, m),可以得到一个矩形 让你依次连接矩形内的4个点使它们的长度和最长 ...
- tomcat发布web service教程
这几天一直在准备找工作,自学了关于web service的一些基本的内容,也遇到了不少问题.现在就把我自己学到的知识和大家分享一下,由于是初学,所以有什么错误的地方请大家帮忙指正,感激不尽~~!! 1 ...
- bzoj2535 2109
做过4010这题其实就水了 把图反向之后直接拓扑排序做即可,我们可以用链表来优化 每个航班的最小起飞序号就相当于在反向图中不用这个点最迟到哪 type node=record po,next:long ...
- Js 读写cookies
//写cookies函数 function setCookie(name, value)//两个参数,一个是cookie的名子,一个是值 { var Days = 30; //此 cookie 将被保 ...
- 添加gif效果图
1.贴加第三方包 http://blog.csdn.net/iamlazybone/article/details/5972234 2. <FrameLayout android:id=&quo ...
- Meta标签详解(转)
引言 您的个人网站即使做得再精彩,在“浩瀚如海”的网络空间中,也如一叶扁舟不易为人发现,如何推广个人网站,人们首先想到的方法无外乎以下几种: ● 在搜索引擎中登录自己的个人网站 ● 在知名网站加入你个 ...