[Hive - LanguageManual] Sampling
Sampling Syntax 抽样语法
Sampling Bucketized Table
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname]) |
The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.
In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; |
Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.
Example:
So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'
TABLESAMPLE(BUCKET 3 OUT OF 16 ON id) |
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
On the other hand the tablesample clause
TABLESAMPLE(BUCKET 3 OUT OF 64 ON id) |
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.
Block Sampling
Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121
block_sample: TABLESAMPLE (n PERCENT) |
This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT * FROM source TABLESAMPLE( 0.1 PERCENT) s; |
Sometimes you want to sample the same data with different blocks, you can change this seed number:
set hive.sample.seednumber=<INTEGER>; |
Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
block_sample: TABLESAMPLE (ByteLengthLiteral) ByteLengthLiteral : (Digit)+ ( 'b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G' ) |
In the following example the input size 100M or more will be used for the query.
SELECT * FROM source TABLESAMPLE(100M) s; |
Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)
block_sample: TABLESAMPLE (n ROWS) |
For example, the following query will take the first 10 rows from each input split.
SELECT * FROM source TABLESAMPLE( 10 ROWS); |
[Hive - LanguageManual] Sampling的更多相关文章
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
LanguageManual WindowingAndAnalytics Skip to end of metadata Added by Lefty Leverenz, last edi ...
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...
- [Hive - LanguageManual] Select base use
Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...
- [Hive - LanguageManual] Import/Export
LanguageManual ImportExport Skip to end of metadata Added by Carl Steinbach, last edited by Le ...
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...
- [Hive - LanguageManual] Alter Table/Partition/Column
Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...
- [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table
Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...
- Hive LanguageManual DDL
hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...
随机推荐
- HDU5087——Revenge of LIS II(BestCoder Round #16)
Revenge of LIS II Problem DescriptionIn computer science, the longest increasing subsequence problem ...
- linux命令-shopt
shopt命令 shopt命令用于显示和设置shell中的行为选项,通过这些选项以增强shell易用性.shopt命令若不带任何参数选项,则可以显示所有可以设置的shell操作选项. 开启与关闭 开启 ...
- 开源入侵检测系统OSSEC搭建之一:服务端安装
OSSEC是一款开源的多平台的入侵检测系统,可以运行于Windows, Linux, OpenBSD/FreeBSD, 以及 MacOS等操作系统中.主要功能有日志分析.完整性检查.rootkit检测 ...
- NC / Netcat - 文件传输
文件传输:将文件从B用户机器传输到A用户机器. 实验环境1: A用户,windows系统,IP:192.168.12.109 B用户,linux系统,IP:192.168.79.3 A用户作为接受传输 ...
- 10个提供免费PHP脚本下载的网站
本文将重点介绍10个PHP脚本的免费资源下载站.之前推荐 <16个下载超酷脚本的热门网站>,这些网站除了PHP脚本,还有JavaScript.Java.Perl.ASP等脚本.如果你已是脚 ...
- javascript算法汇总(持续更新中)
1. 线性查找 <!doctype html> <html lang="en"> <head> <meta charset="U ...
- Oracle 数据集成的实际解决方案
就针对市场与企业的发展的需求,Oracle公司提供了一个相对统一的关于企业级的实时数据解决方案,即Oracle数据集成的解决方案.以下的文章主要是对其解决方案的具体描述,望你会有所收获. Oracle ...
- "=="和equals方法的区别
.==和equal .栈内存和对内存 单独把一个东西说清楚,然后再说清楚另一个,这样,它们的区别自然就出来了,混在一起说,则很难说清楚) ==操作符专门用来比较两个变量的值是否相等,也就是用于比较变量 ...
- POJ 3211 (分组01背包) Washing Clothes
题意: 小明有一个贤妻良母型的女朋友,他们两个一起洗衣服. 有M种颜色的N件衣服,要求洗完一种颜色的衣服才能洗另外一种颜色. 两人可以同时洗,一件衣服只能被一个人洗. 给出洗每件衣服所用的时间,求两个 ...
- UVALive 3661 Animal Run(最短路解最小割)
题意:动物要逃跑,工作人员要截断从START(左上角)到END(右下角)的道路,每条边权表示拦截该条道路需要多少工作人员.问最少需要多少人才能完成拦截. 通俗地讲,就是把图一分为二所造成消耗的最小值. ...