Statistics in Hive

This document describes the support of statistics for Hive tables (see HIVE-33).

Motivation

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. 查询优化 Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.

Scope

Table and Partition Statistics

The first milestone(里程碑) in supporting statistics was to support table and partition level statistics. Table and partition statistics are now stored in the Hive Metastore for either newly created or existing tables. The following statistics are currently supported for partitions:

  • Number of rows
  • Number of files
  • Size in Bytes

For tables, the same statistics are supported with the addition of the number of partitions of the table.

Version: Table and partition statistics

Icon

Table and partition level statistics were added in Hive 0.7.0 by HIVE-1361.

Column Statistics

The second milestone was to support column level statistics. See Column Statistics in Hive in the Design Documents.

Version: Column statistics

Icon

Column level statistics were added in Hive 0.10.0 by HIVE-1362.

Top K Statistics

Column level top K statistics are still pending; see HIVE-3421.

Implementation

The way the statistics are calculated is similar for both newly created and existing tables.

For newly created tables, the job that creates a new table is a MapReduce job. During the creation, every mapper while copying the rows from the source table in the FileSink operator, gathers statistics for the rows it encounters and publishes them into a Database (possibly MySQL). At the end of the MapReduce job, published statistics are aggregated and stored in the MetaStore.

A similar process happens in the case of already existing tables, where a Map-only job is created and every mapper while processing the table in the TableScan operator, gathers statistics for the rows it encounters and the same process continues.

It is clear that there is a need for a database that stores temporary gathered statistics. Currently there are two implementations, one is using MySQL and the other is using HBase. There are two pluggable interfaces IStatsPublisher and IStatsAggregator that the developer can implement to support any other storage. The interfaces are listed below:

package org.apache.hadoop.hive.ql.stats;
 
import org.apache.hadoop.conf.Configuration;
 
/**
 * An interface for any possible implementation for publishing statics.
 */
 
public interface IStatsPublisher {
 
  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);
 
  /**
 * This method publishes a given statistic into a disk storage, possibly HBase or MySQL.
   *
 * rowID : a string identification the statistics to be published then gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be published. Ex: "numRows".
   *
 * value : an integer noting the value of the published key.
 * */
  public boolean publishStat(String rowID, String key, String value);
 
  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();
 
}
package org.apache.hadoop.hive.ql.stats;
 
import org.apache.hadoop.conf.Configuration;
 
/**
 * An interface for any possible implementation for gathering statistics.
 */
 
public interface IStatsAggregator {
 
  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);
 
  /**
 * This method aggregates a given statistic from a disk storage.
 * After aggregation, this method does cleaning by removing all records from the disk storage that have the same given rowID.
   *
 * rowID : a string identification the statistic to be gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be gathered. Ex: "numRows".
   *
 * */
  public String aggregateStats(String rowID, String key);
 
  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();
 
}

Usage 用法

Configuration Variables

See Statistics in Configuration Properties for a list of the variables that configure Hive table statistics. Configuring Hive describes how to use the variables.

Newly Created Tables

For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default. The user has to explicitly(明确的) set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive MetaStore.

set hive.stats.autogather=false;

The user can also specify the implementation to be used for the storage of temporary statistics setting the variable hive.stats.dbclass. For example, to set HBase as the implementation of temporary statistics storage (the default is jdbc:derby or fs, depending on the Hive version) the user should issue the following command:

set hive.stats.dbclass=hbase;

In case of JDBC implementations of temporary stored statistics (ex. Derby or MySQL), the user should specify the appropriate (适当的)connection string to the database by setting the variablehive.stats.dbconnectionstring. Also the user should specify the appropriate JDBC driver by setting the variable hive.stats.jdbcdriver.

set hive.stats.dbclass=jdbc:derby;
set hive.stats.dbconnectionstring="jdbc:derby:;databaseName=TempStatsStore;create=true";
set hive.stats.jdbcdriver="org.apache.derby.jdbc.EmbeddedDriver";

Queries can fail to collect stats completely accurately. There is a setting hive.stats.reliable that fails queries if the stats can't be reliably collected. This is false by default.

Existing Tables

For existing tables and/or partitions, the user can issue the ANALYZE command to gather statistics and write them into Hive MetaStore. The syntax for that command is described below:

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
  COMPUTE STATISTICS 
  [FOR COLUMNS]          -- (Note: Hive 0.10.0 and later.)
  [NOSCAN];

When the user issues that command, he may or may not specify the partition specs. If the user doesn't specify any partition specs, statistics are gathered for the table as well as all the partitions (if any). If certain partition specs are specified, then statistics are gathered for only those partitions. When computing statistics across all partitions, the partition columns still need to be listed.

When the optional parameter NOSCAN is specified, the command won't scan files so that it's supposed to be fast. Instead of all statistics, it just gathers the following statistics:

  • Number of files
  • Physical size in bytes

Version 0.10.0: FOR COLUMNS

Icon

As of Hive 0.10.0, the optional parameter FOR COLUMNS computes column statistics for all columns in the specified table (and for all partitions if the table is partitioned). See Column Statistics in Hive for details.

To display these statistics, use DESCRIBE FORMATTED [db_name.]table_name.column_name [PARTITION (partition_spec)].

Examples 示例

Suppose table Table1 has 4 partitions with the following specs:

  • Partition1: (ds='2008-04-08', hr=11)
  • Partition2: (ds='2008-04-08', hr=12)
  • Partition3: (ds='2008-04-09', hr=11)
  • Partition4: (ds='2008-04-09', hr=12)

and you issue the following command:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS;

then statistics are gathered for partition3 (ds='2008-04-09', hr=11) only.

If you issue the command:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS FOR COLUMNS;

then column statistics are gathered for all columns for partition3 (ds='2008-04-09', hr=11). This is available in Hive 0.10.0 and later.

If you issue the command:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS;

then statistics are gathered for partitions 3 and 4 only (hr=11 and hr=12).

If you issue the command:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS FOR COLUMNS;

then column statistics for all columns are gathered for partitions 3 and 4 only (Hive 0.10.0 and later).

If you issue the command:

ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS;

then statistics are gathered for all four partitions.

If you issue the command:

ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS FOR COLUMNS;

then column statistics for all columns are gathered for all four partitions (Hive 0.10.0 and later).

For a non-partitioned table, you can issue the command:

ANALYZE TABLE Table1 COMPUTE STATISTICS;

to gather statistics of the table.

For a non-partitioned table, you can issue the command:

ANALYZE TABLE Table1 COMPUTE STATISTICS FOR COLUMNS;

to gather column statistics of the table (Hive 0.10.0 and later).

If Table1 is a partitioned table,  then for basic statistics you have to specify partition specifications like above in the analyze statement. Otherwise a semantic analyzer exception will be thrown.

However for column statistics, if no partition specification is given in the analyze statement, statistics for all partitions are computed.

You can view the stored statistics by issuing the DESCRIBE command. Statistics are stored in the Parameters array. Suppose you issue the analyze command for the whole table Table1, then issue the command:

DESCRIBE EXTENDED TABLE1;

then among the output, the following would be displayed:

... , parameters:{numPartitions=4, numFiles=16, numRows=2000, totalSize=16384, ...}, ....

If you issue the command:

DESCRIBE EXTENDED TABLE1 PARTITION(ds='2008-04-09', hr=11);

then among the output, the following would be displayed:

... , parameters:{numFiles=4, numRows=500, totalSize=4096, ...}, ....

If you issue the command:

ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS NOSCAN;

then statistics, number of files and physical size in bytes are gathered for partitions 3 and 4 only.

[Hive - LanguageManual] Statistics in Hive的更多相关文章

  1. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  2. [Hive - LanguageManual] Describe

    Describe Describe Database Describe Table/View/Column Display Column Statistics Describe Partition D ...

  3. [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization

    Status of Hive Authorization before Hive 0.13 SQL Standards Based Hive Authorization (New in Hive 0. ...

  4. [HIve - LanguageManual] Subqueries

    Subqueries in the FROM Clause Subqueries in the WHERE Clause Subqueries in the FROM Clause SELECT .. ...

  5. [HIve - LanguageManual] Join Optimization (不懂)

    Join Optimization Join Optimization Improvements to the Hive Optimizer Star Join Optimization Star S ...

  6. [Hive - LanguageManual] Archiving for File Count Reduction

    Archiving for File Count Reduction Note: Archiving should be considered an advanced command due to t ...

  7. [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table

    Hive Data Definition Language Hive Data Definition Language Overview Create/Drop/Alter Database Crea ...

  8. Hive 官方手册翻译 -- Hive DDL(数据定义语言)

    Hive DDL(数据定义语言) Confluence Administrator创建, Janaki Lahorani修改于 2018年9月19日 原文链接 https://cwiki.apache ...

  9. 【Hive学习之八】Hive 调优【重要】

    环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk8 hadoop-3.1.1 apache-hive-3.1.1 ...

随机推荐

  1. 转 Android的消息处理机制

    来自:http://blog.csdn.net/andyhuabing/article/details/7368217 Windows编程的朋友可能知道Windows程序是消息驱动的,并且有全局的消息 ...

  2. (三)CSS高级语法

    选择器分组 可以对选择器进行分组,被分组的选择器可以分享相同的声明,用逗号将需要分组的选择器分开.例如: h1,h2,h3,h4,h5,h6 { color: green; } 继承以及其问题一般,子 ...

  3. 使用 GIT 获得Linux Kernel的代码并查看,追踪历史记录

    Linux kernel  的官方 GIT地址是: http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git 可以从这个地 ...

  4. [HDOJ2473]Junk-Mail Filter(并查集,删除操作,马甲)

    题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=2473 给两个操作:M X Y:将X和Y看成一类. S X:将X单独划归成一类. 最后问的是有多少类. ...

  5. [POJ2828]Buy Tickets(线段树,单点更新,二分,逆序)

    题目链接:http://poj.org/problem?id=2828 由于最后一个人的位置一定是不会变的,所以我们倒着做,先插入最后一个人. 我们每次处理的时候,由于已经知道了这个人的位置k,这个位 ...

  6. word引用错误

    错误 4317 无法嵌入互操作类型“Microsoft.Office.Interop.Word.ApplicationClass”.请改用适用的接口. 类型“Microsoft.Office.Inte ...

  7. cf 151 C. Win or Freeze (博弈 求大数质因子)

    题目 题意: 给一个数N,两人轮流操作每次将N变为一个N的非1非自身的因数,第一个无法进行操作的人获胜问先手是否有必胜策略,如果有的话在第二行输出第一步换成哪个数,如果第一步就不能操作则输出0数据规模 ...

  8. HDU 4946 共线凸包

    题目大意: 一些点在一张无穷图上面,每个点可以控制一些区域,这个区域满足这个点到达这个区域的时间严格小于其他点.求哪些点能够控制无穷面积的区域. 题目思路: 速度小的控制范围一定有限. 速度最大当且仅 ...

  9. bzoj1499: [NOI2005]瑰丽华尔兹

    dp. 首先我们可以看到每个时间段只能往一个方向转移最多t步(t为时间段的长度),所以我们可以按时间段dp.因为这个前后值互不影响,也不用占用这一维空间就可以省去. 然后每个时间段内是一列一列(行) ...

  10. 【笨嘴拙舌WINDOWS】编码历史

    在介绍历史之前,有必要将一个经常使用的词语"标准"解释一下: " 标准是"为了在一定的范围内获得最佳秩序,经协商一致制定并由公认机构批准,共同使用的和重复使用的 ...