Group By Syntax

groupByClause: GROUP BY groupByExpression (, groupByExpression)*
 
groupByExpression: expression
 
groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

Simple Examples

In order to count the number of rows in a table:

SELECT COUNT(*) FROM table2;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

In order to count the number of distinct users by gender one could write the following query:

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns. For example, the following is possible because count(DISTINCT) and sum(DISTINCT) specify the same column:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

Select statement and group by clause

When using group by clause, the select statement can only include columns included in the group by clause. Of course, you can have as many aggregation functions (e.g. count) in the select statement as well.
Let's take a simple example

CREATE TABLE t1(a INTEGER, b INTGER);

A group by query on the above table could look like:

SELECT
   a,
   sum(b)
FROM
   t1
GROUP BY
   a;

The above query works because the select clause contains a (the group by key) and an aggregation function (sum(b)).

However, the query below DOES NOT work:

SELECT
   a,
   b
FROM
   t1
GROUP BY
   a;

This is because the select clause has an additional column (b) that is not included in the group by clause (and it's not an aggregation function either). This is because, if the table t1 looked like:

a    b
------
100  1
100  2
100  3

Since the grouping is only done on a, what value of b should Hive display for the group a=100? One can argue that it should be the first value or the lowest value but we all agree that there are multiple possible options. Hive does away with this guessing by making it invalid SQL (HQL, to be precise) to have a column in the select clause that is not included in the group by clause.

Advanced Features  高级特性

Multi-Group-By Inserts

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilitites). (可将输出结果插入新表中或者直接覆盖到HDFS上的文件目录)

e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
  SELECT pv_users.gender, count(DISTINCT pv_users.userid)
  GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
  SELECT pv_users.age, count(DISTINCT pv_users.userid)
  GROUP BY pv_users.age;

Map-side Aggregation for Group By

hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.

hive.map.aggr控制如何聚合,默认是false,如果设置为true, Hive将会在map端做第一级的聚合,这通常提供更好的效果,但是要求更多的内存才能运行成功。

set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Grouping Sets, Cubes, Rollups, and the GROUPING__ID Function

Version

Icon

Grouping sets, CUBE and ROLLUP operators, and the GROUPING__ID function were added in Hive release 0.10.0.

See Enhanced Aggregation, Cube, Grouping and Rollup for information about these aggregation operators.

Also see the JIRAs:

  • HIVE-2397 Support with rollup option for group by
  • HIVE-3433 Implement CUBE and ROLLUP operators in Hive
  • HIVE-3471 Implement grouping sets in Hive
  • HIVE-3613 Implement grouping_id function

New in Hive release 0.11.0:

  • HIVE-3552 HIVE-3552 performant manner for performing cubes/rollups/grouping sets for a high number of grouping set keys
 

[Hive - LanguageManual] GroupBy的更多相关文章

  1. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  2. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  3. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  4. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  5. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  6. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  7. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

  8. [Hive - LanguageManual] Hive Concurrency Model (待)

    Hive Concurrency Model Hive Concurrency Model Use Cases Turn Off Concurrency Debugging Configuration ...

  9. [Hive - LanguageManual ] Explain (待)

    EXPLAIN Syntax EXPLAIN Syntax Hive provides an EXPLAIN command that shows the execution plan for a q ...

随机推荐

  1. 一个简单的将GUI程序的log信息输出到关联的Console窗口中(AllocConsole SetConsoleTitle WriteConsole 最后用ShowWindow(GetConsoleWindow)进行显示)

    // .h 文件 #pragma once class CConsoleDump { public: explicit CConsoleDump(LPCTSTR lpszWindowTitle = N ...

  2. 本人arcgis api for javascript中常见错误总结

    1. 2.对象不支持"replace"属性或方法 解决办法:一般在ie中执行js会报这样的错误,基本问题就是你引用了某个对象中不存在的方法,可能是这个方法本来存在而你写错了,或者调 ...

  3. Android Handler值传递(文)

    发送消息: public static class TimeReceiver extends BroadcastReceiver { @Override public void onReceive(C ...

  4. HDU 3308 线段树 最长连续上升子序列 单点更新 区间查询

    题意: T个测试数据 n个数 q个查询 n个数 ( 下标从0开始) Q u v 查询 [u, v ] 区间最长连续上升子序列 U u v 把u位置改成v #include<iostream> ...

  5. linux中proc文件系统 -- ldd3读书笔记

    1./proc 文件系统概述 /proc 文件系统是由软件创建,被内核用来向外界报告信息的一个文件系统./proc 下面的每一个文件都和一个内核函数相关联,当文件的被读取时,与之对应的内核函数用于产生 ...

  6. NDK(13)JNIEnv和JavaVM

    转自:  http://www.cnblogs.com/canphp/archive/2012/11/13/2768937.html JNIEnv是一个与线程相关的变量,不同线程的JNIEnv彼此独立 ...

  7. dp,px转换

    public static int dip2px(Context context, float dpValue) {        final float scale = context.getRes ...

  8. BZOJ2337: [HNOI2011]XOR和路径

    题解: 异或操作是每一位独立的,所以我们可以考虑每一位分开做. 假设当前正在处理第k位 那令f[i]表示从i到n 为1的概率.因为不是有向无环图(绿豆蛙的归宿),所以我们要用到高斯消元. 若有边i-& ...

  9. EF DataBase First生成model的验证

    如何避免在EF自动生成的model中的DataAnnotation被覆盖掉 相信很多人刚接触EF+MVC的时候,DataBase First模式生成model类中加验证信息的时候,会在重新生成mode ...

  10. Oracle 数据库整理表碎片

    Oracle 数据库整理表碎片 转载:http://kyle.xlau.org/posts/table-fragmentation.html 表碎片的来源 当针对一个表的删除操作很多时,表会产生大量碎 ...