Sampling Syntax  抽样语法

Sampling Bucketized Table

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

Example:

So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'

TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)

would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

On the other hand the tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)

would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

Block Sampling

Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121

block_sample: TABLESAMPLE (n PERCENT)

This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.

In the following example the input size 0.1% or more will be used for the query.

SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;

Sometimes you want to sample the same data with different blocks, you can change this seed number:

set hive.sample.seednumber=<INTEGER>;

Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

block_sample: TABLESAMPLE (ByteLengthLiteral)
 
ByteLengthLiteral : (Digit)+ ('b' 'B' 'k' 'K' 'm' 'M' 'g' 'G')

In the following example the input size 100M or more will be used for the query.

SELECT *
FROM source TABLESAMPLE(100M) s;

Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

block_sample: TABLESAMPLE (n ROWS)

For example, the following query will take the first 10 rows from each input split.

SELECT * FROM source TABLESAMPLE(10 ROWS);

[Hive - LanguageManual] Sampling的更多相关文章

  1. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  2. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  3. [Hive - LanguageManual] Select base use

    Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...

  4. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  5. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  6. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  7. [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table

    Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...

  8. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  9. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

随机推荐

  1. iOS 精确定时器

    Do I need a high precision timer? Don't use a high precision timer unless you really need it. They c ...

  2. iOS KVO的原理

    KVO(Key Value Observing),是观察者模式在Foundation中的实现.   KVO的原理   简而言之就是:   1.当一个object有观察者时,动态创建这个object的类 ...

  3. windows环境下安装 zookeeper

    我们下载下来的zookeeper的安装包是.tar.gz格式的,但是还是可以在windows下运行. 下载地址 http://mirrors.hust.edu.cn/apache/zookeeper/ ...

  4. Java API —— 泛型

    1.泛型概述及使用 JDK1.5以后出现的机制 泛型是一种特殊的类型,它把指定类型的工作推迟到客户端代码声明并实例化类或方法的时候进行.也被称为参数化类型,可以把类型当作参数一样传递过来,在传递过来之 ...

  5. 重写hashCode()的方法

    重写hashCode()方法的基本规则: 1.在程序运行过程中,同一个对象多次调用hashCode()方法应该返回相同的值 2.当两个对象通过equals()方法比较返回true时,这两个对象的has ...

  6. nyoj-291 互素数个数 欧拉函数

    LK的数学题 时间限制:1000 ms  |  内存限制:65535 KB 难度:3   描述 LK最近遇到一个问题,需要你帮她一下.一个整数n,求[1,n)中,和n互素的数的个数.   输入 多组测 ...

  7. oracle 11g SQL Developer instead of isqlplus

    Oracle11g的新工具SQL DEVELOPER,替代了 isqlplus 1.创建连接方式 2.SQL输入窗口 3.表的结构及其他信息查阅 4.SQL语句测试 5.创建表 6.用户授权 7.数据 ...

  8. [Lintcode 3sum]三数之和(python,二分)

    题目链接:http://www.lintcode.com/zh-cn/problem/3sum/?rand=true# 用这个OJ练练python…这个题意和解法就不多说了,O(n^2lgn)就行了, ...

  9. n人比赛,可轮空,比赛轮数和场数

    #include<stdio.h> int chang(int x,int s){ ) return s; ) ; !=){ s+=(x-)/; )/,s); } else{ s+=x/; ...

  10. 深入理解Java虚拟机

    1. Java内存模型,Java内存管理,Java堆和栈,垃圾回收 http://nibnait.com/6f8dd084-about-Java-Virtual-Machine/ 2. JVM性能调优 ...