Syntax of Order By

The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src orderBy

There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. (如果hive.mapred.mode=strict,那么order by 必须和limit一起使用)The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.(排序数据是由一个reducer完成并且最终输出,如果没有limit限制那么,那么整个操作必须相当长的时间才能完成。)

Syntax of Sort By

The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.

colOrder: ( ASC | DESC )
sortBy: SORT BY colName colOrder? (',' colName colOrder?)*
query: SELECT expression (',' expression)* FROM src sortBy

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified. The following example shows

SELECT key, value FROM src SORT BY key ASC, value DESC

The query had 2 reducers, and the output of each is:

0   5
0   3
3   6
9   1
0   4
0   3
1   1
2   5

Setting Types for Sort By

After a transform, variable types are generally considered to be strings, meaning that numeric data will be sorted lexicographically. To overcome this, a second SELECT statement with casts can be used before using SORT BY.

FROM (FROM (FROM src
            SELECT TRANSFORM(value)
            USING 'mapper'
            AS value, count) mapped
      SELECT cast(value as double) AS value, cast(count as int) AS count
      SORT BY value, count) sorted
SELECT TRANSFORM(value, count)
USING 'reducer'
AS whatever

Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.

For example, we are Distributing By x on the following 5 rows to 2 reducer:

x1
x2
x4
x3
x1

Reducer 1 got

x1
x2
x1

Reducer 2 got

x4
x3

Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.

In contrast, if we use Cluster By x, the two reducers will further sort rows on x:

Reducer 1 got

x1
x1
x2

Reducer 2 got

x3
x4

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
 
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
FROM (
  FROM pv_users
  MAP ( pv_users.userid, pv_users.date )
  USING 'map_script'
  AS c1, c2, c3
  DISTRIBUTE BY c2
  SORT BY c2, c1) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
  USING 'reduce_script'
  AS date, count;
 

[HIve - LanguageManual] Sort/Distribute/Cluster/Order By的更多相关文章

  1. hive中Sort By,Order By,Cluster By,Distribute By,Group By的区别

    order by:  hive中的order by 和传统sql中的order by 一样,对数据做全局排序,加上排序,会新启动一个job进行排序,会把所有数据放到同一个reduce中进行处理,不管数 ...

  2. [转]hive中order by,distribute by,sort by,cluster by

    转至http://my.oschina.net/repine/blog/296562 order by,distribute by,sort by,cluster by  查询使用说明 1 2 3 4 ...

  3. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  4. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  5. [HIve - LanguageManual] Joins

    Hive Joins Hive Joins Join Syntax Examples MapJoin Restrictions Join Optimization Predicate Pushdown ...

  6. [HIve - LanguageManual] Transform [没懂]

    Transform/Map-Reduce Syntax SQL Standard Based Authorization Disallows TRANSFORM TRANSFORM Examples ...

  7. [Hive - LanguageManual] Select base use

    Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...

  8. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  9. hive 中的Sort By、 Order By、Cluster By、Distribute By 区别

    Order by: order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)只有一个reducer,会导致当输入规模较大时,需要较长的计算时间.在hive ...

随机推荐

  1. SELinux开启与关闭

    SELinux是「Security-Enhanced Linux」的简称,是美国国家安全局「NSA=The National Security Agency」 和SCC(Secure Computin ...

  2. 机器学习 —— 概率图模型(Homework: Representation)

    前两周的作业主要是关于Factor以及有向图的构造,但是概率图模型中还有一种更强大的武器——双向图(无向图.Markov Network).与有向图不同,双向图可以描述两个var之间相互作用以及联系. ...

  3. <<c 和指针 >> 部分笔记。

    最近竟然对指针有些迷惑了,分不清指针的指向.废话少说,复习.(下面内容来自<<c和指针>>) =指针 ==内存和地址 尽管一个字包含了4个字节,它仍然只有一个地址.至于是最左边 ...

  4. java匿名对象

    java学习面向对象之匿名内部类 之前我们提到“匿名”这个字眼的时候,是在学习new对象的时候,创建匿名对象的时候用到的,之所以说是匿名,是因为直接创建对象,而没有把这个对象赋值给某个值,才称之为匿名 ...

  5. Android开发之Okhttp:java.lang.IllegalStateException: closed

    在使用Okhttp的时候 运行到response.body().string()一步时抛异常,java.lang.IllegalStateException: closed 查阅各种资料大致意思是Th ...

  6. uva10375 Choose and divide

    唯一分解定理. 挨个记录下每个质数的指数. #include<cstdio> #include<algorithm> #include<cstring> #incl ...

  7. td内容自动换行 ,td超过宽度显示点点点… , td 使用 overflow:hidden 无效,英文 数字 不换行 撑破div容器

    我们可以先给表格 table上 固定一个宽度 不让表格撑破 width: 747px; table-layout:fixed; 然后我们在td上加上如下样式 style="width:100 ...

  8. 完全二叉树的高度为什么是对lgN向下取整

    完全二叉树的高度为什么是对lgN向下取整呢? 说明一下这里的高度:只有根节点的树高度是0. 设一棵完全二叉树节点个数为N,高度为h.所以总节点个数N满足以下不等式: 1 + 21 + 22 +……+ ...

  9. [Sciter系列] MFC下的Sciter–2.Sciter中的事件,tiscript,语法

    [Sciter系列] MFC下的Sciter–2.Sciter中的事件,tiscript,CSS部分自觉学习,重点说明Tiscript部分的常见语法和事件用法. 本系列文章的目的就是一步步构建出一个功 ...

  10. poj 1201/zoj 1508 intervals 差分约束系统

      // 思路 : // 图建好后 剩下的就和上一篇的 火烧连营那题一样了 求得解都是一样的 // 所以稍微改了就过了 // 最下面还有更快的算法 速度是这个算法的2倍#include <ios ...