[Hive - LanguageManual] Sampling
Sampling Syntax 抽样语法
Sampling Bucketized Table
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname]) |
The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.
In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.
SELECT *FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; |
Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.
Example:
So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'
TABLESAMPLE(BUCKET 3 OUT OF 16 ON id) |
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
On the other hand the tablesample clause
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id) |
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.
Block Sampling
Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
block_sample: TABLESAMPLE (n PERCENT) |
This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT *FROM source TABLESAMPLE(0.1 PERCENT) s; |
Sometimes you want to sample the same data with different blocks, you can change this seed number:
set hive.sample.seednumber=<INTEGER>; |
Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
block_sample: TABLESAMPLE (ByteLengthLiteral)ByteLengthLiteral : (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G') |
In the following example the input size 100M or more will be used for the query.
SELECT *FROM source TABLESAMPLE(100M) s; |
Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
block_sample: TABLESAMPLE (n ROWS) |
For example, the following query will take the first 10 rows from each input split.
SELECT * FROM source TABLESAMPLE(10 ROWS); |
[Hive - LanguageManual] Sampling的更多相关文章
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual] Select base use
Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...
- [Hive - LanguageManual] Import/Export
LanguageManual ImportExport Skip to end of metadata Added by Carl Steinbach, last edited by Le ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table
Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...
随机推荐
- Csharp日常笔记
1. 1.退出程序 this.Close(); //方法退关闭当前窗口. Application.Exit(); //方法退出整 ...
- .md文件 Markdown 语法说明
Markdown 语法说明 (简体中文版) / (点击查看快速入门) 概述 宗旨 兼容 HTML 特殊字符自动转换 区块元素 段落和换行 标题 区块引用 列表 代码区块 分隔线 区段元素 链接 强调 ...
- Java多线程3:Thread中start()和run()的区别
原文:http://www.cnblogs.com/skywang12345/p/3479083.html start() 和 run()的区别说明start():它的作用是启动一个新线程,新线程会执 ...
- C/c++输入输出函数
最全输入函数 c/c++一:c=getchar();功能:读入一个字符说明:调用此函数时要求在程序的第一行有预编译命令:#include<stdio>,不过在做c++时 有#include ...
- 传感器(3)传感器的X,Y,Z轴
设备正面水平向上. X轴 : 左右方向,向右是正值. Y轴 : 远近方向,远离你是负. Z轴 : 上下方向,向上是正值.
- sdut 2351 In Danger (找规律)
题目:http://acm.sdut.edu.cn/sdutoj/problem.php?action=showproblem&problemid=2351 题意:xyez, xy表示一个十进 ...
- 五大主流SQL数据库
一. 开放性 1. SQL Server 只能在windows上运行,没有丝毫的开放性,操作系统的系统的稳定对数据库是十分重要的.Windows9X系列产品是偏重于桌面应用,NT server只适合中 ...
- php linux部署相关
http://www.itbulu.com/wdcp-php55.html http://www.wdlinux.cn/wdcp/install.html http://www.yiichina.co ...
- 省常中模拟 Test4
prime 数论 题意:分别求 1*n.2*n.3*n.... n*n 关于模 p 的逆元.p 是质数,n < p. 初步解法:暴力枚举.因为 a 关于模 p 的逆元 b 满足 ab mod p ...
- 【转】Android Studio系列教程一--下载与安装
原文网址:http://stormzhang.com/devtools/2014/11/25/android-studio-tutorial1/ 背景 相信大家对Android Studio已经不陌生 ...