Group By Syntax

groupByClause: GROUP BY groupByExpression (, groupByExpression)*
 
groupByExpression: expression
 
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

Simple Examples

In order to count the number of rows in a table:

SELECT COUNT(*) FROM table2;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

In order to count the number of distinct users by gender one could write the following query:

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

Select statement and group by clause

When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well.
Let's take a simple example

CREATE TABLE t1(a INTEGER, b INTGER);

A group by query on the above table could look like:

SELECT
   a,
   sum(b)
FROM
   t1
GROUP BY
   a;

The above query works because the select clause contains a (the group by key) and an aggregation function (sum(b)).

However, the query below DOES NOT work:

SELECT
   a,
   b
FROM
   t1
GROUP BY
   a;

This is because the select clause has an additional column (b) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1 looked like:

a    b
------
100  1
100  2
100  3

Since the grouping is only done on a, what value of b should Hive display for the group a=100? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.

Advanced Features  高级特性

Multi-Group-By Inserts

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). (可将输出结果插入新表中或者直接覆盖到HDFS上的文件目录)

e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count(DISTINCT pv_users.userid)
  GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
  SELECT pv_users.age, count(DISTINCT pv_users.userid)
  GROUP BY pv_users.age;

Map-side Aggregation for Group By

hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.

hive.map.aggr控制如何聚合,默认是false,如果设置为true, Hive将会在map端做第一级的聚合,这通常提供更好的效果,但是要求更多的内存才能运行成功。

set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function

Version

Icon

Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.

See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.

Also see the JIRAs:

  • HIVE-2397 Support with rollup option for group by
  • HIVE-3433 Implement CUBE and ROLLUP operators in Hive
  • HIVE-3471 Implement grouping sets in Hive
  • HIVE-3613 Implement grouping_id function

New in Hive release 0.11.0:

  • HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
 

[Hive - LanguageManual] GroupBy的更多相关文章

  1. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  2. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  3. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  4. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  5. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  6. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  7. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

  8. [Hive - LanguageManual] Hive Concurrency Model (待)

    Hive Concurrency Model Hive Concurrency Model Use Cases Turn Off Concurrency Debugging Configuration ...

  9. [Hive - LanguageManual ] Explain (待)

    EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...

随机推荐

  1. Android ActionBar中的下拉菜单

    在ActionBar中添加下拉菜单,主要有一下几个关键步骤: 1. 生成一个SpinnerAdapter,设置ActionBar的下拉菜单的菜单项 2. 实现ActionBar.OnNavigatio ...

  2. PCL—低层次视觉—点云分割(最小割算法)

    1.点云分割的精度 在之前的两个章节里介绍了基于采样一致的点云分割和基于临近搜索的点云分割算法.基于采样一致的点云分割算法显然是意识流的,它只能割出大概的点云(可能是杯子的一部分,但杯把儿肯定没分割出 ...

  3. git全局配置

    使用git的童鞋都知道,git是非常好的版本管理工具,工具再好要想用的得心应手还是要下凡功夫的,比如可以通过对git的全局配置文件.gitconfig进行适当的配置,可以在日常项目开发中节省很多的时间 ...

  4. Visaul Studio2015安装以及c++单元测试使用方法

    Visual Studio 2015安装流程     vs2015是一款十分好用的IDE,接下来就介绍一下安装流程.这里采用在线安装方式,从官网下载使得安装更加安全. 第一步:在百度中搜索Visual ...

  5. TCP/IP 与OSI结构图

    OSI参考模型各层的作用 物理层:在物理媒体上传输原始的数据比特流. 数据链路层:将数据分成一个个数据帧,以数据帧为单位传输.有应有答,遇错重发. 网络层:将数据分成一定长度的分组,将分组穿过通信子网 ...

  6. moment 和ko 绑定msdate格式的日期值(静态text)

    <!DOCTYPE html> <html lang="zh-cn"> <head> <meta charset="utf-8& ...

  7. BZOJ_1025_[SHOI2009]_游戏_(素数表+最小公倍数+DP)

    描述 http://www.lydsy.com/JudgeOnline/problem.php?id=1025 分析 对于\(n\),转一圈回来之后其实是好几个环各转了整数圈.这些环中的数为\(1,2 ...

  8. Brew 编译mod错误Error: L6265E: Non-RWPI Section libspace.o(.bss) cannot be assigned to PI Exec region ER_ZI

    Error: L6265E: Non-RWPI Section libspace.o(.bss) cannot be assigned to PI Exec region ER_ZI.: Error: ...

  9. LeetCode Binary Tree Level Order Traversal II (二叉树颠倒层序)

    题意:从左到右统计将同一层的值放在同一个容器vector中,要求上下颠倒,左右不颠倒. 思路:广搜逐层添加进来,最后再反转. /** * Definition for a binary tree no ...

  10. 省常中模拟 Test3 Day1

    tile 贪心 题意:给出一个矩形,用不同字母代表的正方形填充,要求相邻的方块字母不能相同,求字典序(将所有行拼接起来)最小的方案. 初步解法:一开始没怎么想,以为策略是每次填充一个尽量大的正方形.但 ...