Sampling Syntax  抽样语法

Sampling Bucketized Table

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

Example:

So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'

TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)

would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

On the other hand the tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)

would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

Block Sampling

Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121

block_sample: TABLESAMPLE (n PERCENT)

This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.

In the following example the input size 0.1% or more will be used for the query.

SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;

Sometimes you want to sample the same data with different blocks, you can change this seed number:

set hive.sample.seednumber=<INTEGER>;

Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

block_sample: TABLESAMPLE (ByteLengthLiteral)
 
ByteLengthLiteral : (Digit)+ ('b' 'B' 'k' 'K' 'm' 'M' 'g' 'G')

In the following example the input size 100M or more will be used for the query.

SELECT *
FROM source TABLESAMPLE(100M) s;

Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

block_sample: TABLESAMPLE (n ROWS)

For example, the following query will take the first 10 rows from each input split.

SELECT * FROM source TABLESAMPLE(10 ROWS);

[Hive - LanguageManual] Sampling的更多相关文章

  1. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  2. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  3. [Hive - LanguageManual] Select base use

    Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...

  4. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  5. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  6. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  7. [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table

    Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...

  8. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  9. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

随机推荐

  1. CAD导入ArcScene中线被打断 求解决方案

    cad中是这样 但在arcscene里中是这样

  2. 每用户订阅上的所有者 SID 不存在 (异常来自 HRESULT:0x80040207)

    出现这个问题是因为pQueryFilter.WhereClause = "RoomNumber=" +cmbFromPoint.SelectedItem;中的cmbFromPoin ...

  3. SELinux开启与关闭

    SELinux是「Security-Enhanced Linux」的简称,是美国国家安全局「NSA=The National Security Agency」 和SCC(Secure Computin ...

  4. 第三章:推荐系统冷启动与CB

    3.1冷启动问题简介: 冷启动问题(cold start)主要分三类: •     用户冷启动 •     物品冷启动 •     系统冷启动 参考解决方案: •     推热门 •     利用用户 ...

  5. linux源码Makefile详解

    1.Makefile的作用 (1)决定编译哪些文件 (2)怎样编译这些文件 (3)怎样连接这些文件,最重要的是它们的顺序如何 2.Linux内核Makefile分类 ***************** ...

  6. NuGet学习笔记

    NuGet学习笔记(1)——初识NuGet及快速安装使用 NuGet学习笔记(2)——使用图形化界面打包自己的类库 NuGet学习笔记(3)——搭建属于自己的NuGet服务器

  7. heatmap.2

    heatmap.2 {gplots} R Documentation Enhanced Heat Map Description A heat map is a false color image ( ...

  8. Hadoop1.x与Hadoop2的区别

    转自:http://blog.csdn.net/fenglibing/article/details/32916445 六.Hadoop1.x与Hadoop2的区别 1.变更介绍 Hadoop2相比较 ...

  9. hdu 1243 反恐训练营(dp 最大公共子序列变形)

    题目:http://acm.hdu.edu.cn/showproblem.php?pid=1243 d[i][j] 代表第i 个字符与第 j 个字符的最大的得分.,, 最大公共子序列变形 #inclu ...

  10. HDU 1672 Cuckoo Hashing

    Cuckoo Hashing Description One of the most fundamental data structure problems is the dictionary pro ...