[HIve - LanguageManual] Sort/Distribute/Cluster/Order By
Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
| colOrder: ( ASC | DESC )orderBy: ORDER BY colName colOrder? (','colName colOrder?)*query: SELECT expression (','expression)* FROM src orderBy | 
There are some limitations in the "order by" clause. In the strict mode (i.e., hive.mapred.mode=strict), the order by clause has to be followed by a "limit" clause. (如果hive.mapred.mode=strict,那么order by 必须和limit一起使用)The limit clause is not necessary if you set hive.mapred.mode to nonstrict. The reason is that in order to impose total order of all results, there has to be one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.(排序数据是由一个reducer完成并且最终输出,如果没有limit限制那么,那么整个操作必须相当长的时间才能完成。)
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
| colOrder: ( ASC | DESC )sortBy: SORT BY colName colOrder? (','colName colOrder?)*query: SELECT expression (','expression)* FROM src sortBy | 
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Difference between Sort By and Order By
Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.
Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.
Basically, the data in each reducer will be sorted according to the order that the user specified. The following example shows
| SELECT key, value FROM src SORT BY key ASC, value DESC | 
The query had 2 reducers, and the output of each is:
| 05033691 | 
| 04031125 | 
Setting Types for Sort By
After a transform, variable types are generally considered to be strings, meaning that numeric data will be sorted lexicographically. To overcome this, a second SELECT statement with casts can be used before using SORT BY.
| FROM (FROM (FROM src            SELECT TRANSFORM(value)            USING 'mapper'            AS value, count) mapped      SELECT cast(value as double) AS value, cast(count as int) AS count      SORT BY value, count) sortedSELECT TRANSFORM(value, count)USING 'reducer'AS whatever | 
Syntax of Cluster By and Distribute By
Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.
Cluster By is a short-cut for both Distribute By and Sort By.
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However,Distribute By does not guarantee clustering or sorting properties on the distributed keys.
For example, we are Distributing By x on the following 5 rows to 2 reducer:
| x1x2x4x3x1 | 
Reducer 1 got
| x1x2x1 | 
Reducer 2 got
| x4x3 | 
Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this case), but they are not guaranteed to be clustered in adjacent positions.
In contrast, if we use Cluster By x, the two reducers will further sort rows on x:
Reducer 1 got
| x1x1x2 | 
Reducer 2 got
| x3x4 | 
Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.
| SELECT col1, col2 FROM t1 CLUSTER BY col1 | 
| SELECT col1, col2 FROM t1 DISTRIBUTE BY col1SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC | 
| FROM (  FROM pv_users  MAP ( pv_users.userid, pv_users.date )  USING 'map_script'  AS c1, c2, c3  DISTRIBUTE BY c2  SORT BY c2, c1) map_outputINSERT OVERWRITE TABLE pv_users_reduced  REDUCE ( map_output.c1, map_output.c2, map_output.c3 )  USING 'reduce_script'  AS date, count; | 
[HIve - LanguageManual] Sort/Distribute/Cluster/Order By的更多相关文章
- hive中Sort By,Order By,Cluster By,Distribute By,Group By的区别
		order by: hive中的order by 和传统sql中的order by 一样,对数据做全局排序,加上排序,会新启动一个job进行排序,会把所有数据放到同一个reduce中进行处理,不管数 ... 
- [转]hive中order by,distribute by,sort by,cluster by
		转至http://my.oschina.net/repine/blog/296562 order by,distribute by,sort by,cluster by 查询使用说明 1 2 3 4 ... 
- [HIve - LanguageManual]  Hive Operators and User-Defined Functions (UDFs)
		Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ... 
- Hive LanguageManual DDL
		hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ... 
- [HIve - LanguageManual]  Joins
		Hive Joins Hive Joins Join Syntax Examples MapJoin Restrictions Join Optimization Predicate Pushdown ... 
- [HIve - LanguageManual]   Transform [没懂]
		Transform/Map-Reduce Syntax SQL Standard Based Authorization Disallows TRANSFORM TRANSFORM Examples ... 
- [Hive - LanguageManual] Select base use
		Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ... 
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
		LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ... 
- hive 中的Sort By、 Order By、Cluster By、Distribute By 区别
		Order by: order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)只有一个reducer,会导致当输入规模较大时,需要较长的计算时间.在hive ... 
随机推荐
- Android:EditText 常用属性
			属性 作用 android:hint="输入邮箱/用户名" 提示信息 android:inputType="textPassword" 设置文本的类型 andr ... 
- Android 拦截短信
			public class SMSMess extends BroadcastReceiver { @Override public void onReceive(Context arg0, Inten ... 
- IOSSelector的用法
			1.首先,@selector 里面的方法不能传参数..不要相信网上的..都是复制粘贴的.2.分三步走:1.设置tag.2.设置btn的调用方法.3.使用参数2.看示例代码把.. UIButton ... 
- Android开发之SmsManager和SmsMessage
			Android的手机功能(通话与短信)都放在android.telephony包中,到了4.4时(也就是API19)android.provider.Telephony及相关类横空出世辅助电话功能以及 ... 
- Android开发性能优化大总结
			1. 采用硬件加速,在androidmanifest.xml中application添加android:hardwareAccelerated="true".不过这个需要在and ... 
- 深入学习android之AlarmManager
			对应AlarmManage有一个AlarmManagerServie服务程 序,该服务程序才是正真提供闹铃服务的,它主要维护应用程序注册下来的各类闹铃并适时的设置即将触发的闹铃给闹铃设备(在系统中,l ... 
- PHPnow 升级后 PHP不支持GD、MySQL
			来自http://tunps.com/php-unsupport-gd-and-mysql-after-upgrade-phpnow 最近磁盘格式化误操作后,最近两天都在忙于数据恢复,现在才恢复正常. ... 
- 给你一个承诺 - 玩转 AngularJS 的 Promise(转)
			在谈论Promise之前我们要了解一下一些额外的知识:我们知道JavaScript语言的执行环境是“单线程”,所谓单线程,就是一次只能够执行一个任务,如果有多个任务的话就要排队,前面一个任务完成后才可 ... 
- 【 D3.js 高级系列 — 10.0 】 思维导图
			思维导图的节点具有层级关系和隶属关系,很像枝叶从树干伸展开来的形状.在前面讲解布局的时候,提到有五个布局是由层级布局扩展来的,其中的树状图(tree layout)和集群图(cluster layou ... 
- Azure SQL 数据库与新的数据库吞吐量单位
			在这一期中,Scott 与 Azure SQL 数据库性能首席项目经理主管 Tobias Ternstrom 一起详细阐释了新的数据库吞吐量单位 (Database Throughput Unit, ... 
