Sampling Syntax  抽样语法

Sampling Bucketized Table

table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname])

The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which belong to bucket x are returned.

In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.

SELECT *
FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

Input pruning: Typically, TABLESAMPLE will scan the entire table and fetch the sample. But, that is not very efficient. Instead, the table can be created with a CLUSTERED BY clause which indicates the set of columns on which the table is hash-partitioned/clustered on. If the columns specified in the TABLESAMPLE clause match the columns in the CLUSTERED BY clause, TABLESAMPLE scans only the required hash-partitions of the table.

Example:

So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'

TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)

would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.

On the other hand the tablesample clause

TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)

would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.

For information about creating bucketed tables with the CLUSTERED BY clause, see Create Table (especially Bucketed Sorted Tables) and Bucketed Tables.

Block Sampling

Block sampling is available starting with Hive 0.8. Addressed under JIRA - https://issues.apache.org/jira/browse/HIVE-2121

block_sample: TABLESAMPLE (n PERCENT)

This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs. Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.

In the following example the input size 0.1% or more will be used for the query.

SELECT *
FROM source TABLESAMPLE(0.1 PERCENT) s;

Sometimes you want to sample the same data with different blocks, you can change this seed number:

set hive.sample.seednumber=<INTEGER>;

Or user can specify total length to be read, but it has same limitation with PERCENT sampling. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

block_sample: TABLESAMPLE (ByteLengthLiteral)
 
ByteLengthLiteral : (Digit)+ ('b' 'B' 'k' 'K' 'm' 'M' 'g' 'G')

In the following example the input size 100M or more will be used for the query.

SELECT *
FROM source TABLESAMPLE(100M) s;

Hive also supports limiting input by row count basis, but it acts differently with above two. First, it does not need CombineHiveInputFormat which means this can be used with non-native tables. Second, the row count given by user is applied to each split. So total row count can be vary by number of input splits. (As of Hive 0.10.0 - https://issues.apache.org/jira/browse/HIVE-3401)

block_sample: TABLESAMPLE (n ROWS)

For example, the following query will take the first 10 rows from each input split.

SELECT * FROM source TABLESAMPLE(10 ROWS);

[Hive - LanguageManual] Sampling的更多相关文章

  1. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  2. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  3. [Hive - LanguageManual] Select base use

    Select Syntax WHERE Clause ALL and DISTINCT Clauses Partition Based Queries HAVING Clause LIMIT Clau ...

  4. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  5. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  6. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  7. [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table

    Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...

  8. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

  9. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

随机推荐

  1. Go推出的主要目的之一就是G内部大东西太多了,系统级开发巨型项目非常痛苦,Go定位取代C++,Go以简单取胜(KISS)

    以前为了做compiler,研读+实现了几乎所有种类的语言.现在看语法手册几乎很快就可以理解整个语言的内容.后来我对比了一下go和rust,发现go的类型系统简直就是拼凑的.这会导致跟C语言一样,需要 ...

  2. JavaScript基础精华03(String对象,Array对象,循环遍历数组,JS中的Dictionary,Array的简化声明)

    String对象(*) length属性:获取字符串的字符个数.(无论中文字符还是英文字符都算1个字符.) charAt(index)方法:获取指定索引位置的字符.(索引从0开始) indexOf(‘ ...

  3. Maven Failed to execute goal org.apache.maven.plugins:maven-clean-plugin:2.5:clean Failed to delete access_log

    I'm trying to run simple struts project using maven and tomcat. When I'm trying to exucute next goal ...

  4. C++:析构函数

    析构函数的特点: 1.析构函数与类名相同,但它前面必须加上波浪号~ 2.析构函数不返回任何值,在定义析构函数时,是不能说明它的类型的,甚至说明void类型也不能 3.析构函数没有参数,因此不能被重载. ...

  5. Python第一天——初识Python

    python是由荷兰人Guido van Rossum 于1989年发明的一种面向对象的的解释型计算机程序设语言,也可以称之为编程语言.例如java.php.c语言等都是编程语言. 那么为什么会有编程 ...

  6. Java API —— TreeSet类

    1.TreeSet类    1)TreeSet类概述         使用元素的自然顺序对元素进行排序         或者根据创建 set 时提供的 Comparator 进行排序          ...

  7. WIN7建立网络映射磁盘

    建立网络映射磁盘 如果需要经常访问网络中的同一个共享文件夹,则可以将这个共享文件夹直接映射为本地计算机中的一个虚拟驱动器.其具体操作如下. (1)双击桌面上"计算机"图标,打开&q ...

  8. JS对象基础

    JavaScript 对象 JavaScript 提供多个内建对象,比如 String.Date.Array 等等. 对象只是带有属性和方法的特殊数据类型. 访问对象的属性 属性是与对象相关的值. 访 ...

  9. zlib代码生成

    1.主页下载zlib-1.2.8的source code的压缩包:F:\Develop Tools\zlib-1.2.8 2.下载安装cmake-2.8.1-win32-x86 3.用cmake生成z ...

  10. setBackgroundDrawable和setBackgroundColor的用法

        1.设置背景图片,图片来源于drawable: flightInfoPanel.setBackgroundDrawable(getResources().getDrawable(R.drawa ...