[Hive - LanguageManual] GroupBy
Group By Syntax
groupByClause: GROUP BY groupByExpression (, groupByExpression)* groupByExpression: expression groupByQuery: SELECT expression (, expression)* FROM src groupByClause? |
Simple Examples
In order to count the number of rows in a table:
SELECT COUNT(*) FROM table2; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
In order to count the number of distinct users by gender one could write the following query:
INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; |
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM pv_users GROUP BY pv_users.gender; |
Select statement and group by clause
When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count
) in the select statement as well.
Let's take a simple example
CREATE TABLE t1(a INTEGER, b INTGER); |
A group by query on the above table could look like:
SELECT a, sum(b) FROM t1 GROUP BY a; |
The above query works because the select clause contains a
(the group by key) and an aggregation function (sum(b)
).
However, the query below DOES NOT work:
SELECT a, b FROM t1 GROUP BY a; |
This is because the select clause has an additional column (b
) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1
looked like:
a b ------ 100 1 100 2 100 3 |
Since the grouping is only done on a
, what value of b
should Hive display for the group a=100
? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.
Advanced Features 高级特性
Multi-Group-By Inserts
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). (可将输出结果插入新表中或者直接覆盖到HDFS上的文件目录)
e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:
FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age; |
Map-side Aggregation for Group By
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
hive.map.aggr控制如何聚合,默认是false,如果设置为true, Hive将会在map端做第一级的聚合,这通常提供更好的效果,但是要求更多的内存才能运行成功。
set hive.map.aggr= true ; SELECT COUNT(*) FROM table2; |
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function
Version
Icon
Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.
See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.
Also see the JIRAs:
- HIVE-2397 Support with rollup option for group by
- HIVE-3433 Implement CUBE and ROLLUP operators in Hive
- HIVE-3471 Implement grouping sets in Hive
- HIVE-3613 Implement grouping_id function
New in Hive release 0.11.0:
- HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
[Hive - LanguageManual] GroupBy的更多相关文章
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual] Import/Export
LanguageManual ImportExport Skip to end of metadata Added by Carl Steinbach, last edited by Le ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...
- [Hive - LanguageManual] Hive Concurrency Model (待)
Hive Concurrency Model Hive Concurrency Model Use Cases Turn Off Concurrency Debugging Configuration ...
- [Hive - LanguageManual ] Explain (待)
EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...
随机推荐
- 251. Flatten 2D Vector
题目: Implement an iterator to flatten a 2d vector. For example,Given 2d vector = [ [1,2], [3], [4,5,6 ...
- 231. Power of Two
题目: Given an integer, write a function to determine if it is a power of two. 链接: http://leetcode.com ...
- 【Spring】如何在单个Boot应用中配置多数据库?
原创 BOOT 为什么需要多数据库? 默认情况下,Spring Boot使用的是单数据库配置(通过spring.datasource.*配置具体数据库连接信息).对于绝大多数Spring Boot应用 ...
- taglist
http://blog.csdn.net/duguteng/article/details/7412652 这两天看到网上有将vim 改造成功能强大的IDE的blog,突然心血来潮,亲身经历了一下. ...
- linux sort命令学习
linux sort命令以行为单位对文本文件进行排序. 接下来我们会以/tmp/sort_test.txt这个文本文件为例对sort命令的用法进行说明. sh-# cat /tmp/sort_test ...
- IS_ERR、PTR_ERR、ERR_PTR
最近在使用filp_open打开文件时遇到到一个问题,当打开一个并不存在的文件时,filp_open返回值值为0xfffffffe,而并不是0(NULL),这是因为内核对返回指针的函数做了特殊处理.内 ...
- 将SQLServer2005中的数据同步到Oracle中
有时由于项目开发的需要,必须将SQLServer2005中的某些表同步到Oracle数据库中,由其他其他系统来读取这些数据.不同数据库类型之间的数据同步我们可以使用链接服务器和SQLAgent来实现. ...
- 1124. Mosaic(dfs)
1124 需要想那么一点点吧 一个连通块中肯定不需要伸进手不拿的情况 不是一个肯定会需要这种情况 然后注意一点 sum=0的时候 就输出0就可以了 不要再减一了 #include <iostre ...
- [转载]Python模块学习 ---- subprocess 创建子进程
[转自]http://blog.sciencenet.cn/blog-600900-499638.html 最近,我们老大要我写一个守护者程序,对服务器进程进行守护.如果服务器不幸挂掉了,守护者能即时 ...
- 信息:Could not publish server configuration for Tomcat v6.0 Server at localhost. Multiple Context
需要把server.xml更正一下,去掉重复的context.或者把整个server文件夹都删掉,重新添加服务器.也可以在server窗口中删除server,再新添加一个server.