Creating, Showing, Altering, and Dropping Tables

See Hive Data Definition Language for detailed information about creating, showing, altering, and dropping tables.

Creating Tables

An example statement that would create the page_view table mentioned above would be like:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

In this example the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter.

The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

The row deliminator currently cannot be changed since it is not determined by Hive but Hadoop. e delimiters.

It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
               , ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In this example the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above example represent the system defaults.

Table names and column names are case insensitive.

Browsing Tables and Partitions

SHOW TABLES;

To list existing tables in the warehouse; there are many of these, likely more than you want to browse.

SHOW TABLES 'page.*';

To list tables with prefix 'page'. The pattern follows Java regular expression syntax (so the period is a wildcard).

SHOW PARTITIONS page_view;

To list partitions of a table. If the table is not a partitioned table then an error is thrown.

DESCRIBE page_view;

To list columns and column types of table.

DESCRIBE EXTENDED page_view;

To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

To list columns and all other properties of a partition. This also prints lot of information which is usually used for debugging.

Altering Tables

To rename existing table to a new name. If a table with new name already exists then an error is returned:

ALTER TABLE old_table_name RENAME TO new_table_name;

To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column:

ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

To add columns to an existing table:

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');

Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns.

In the later versions we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.

Dropping Tables and Partitions

Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command is

DROP TABLE pv_users;

To dropping a partition. Alter the table to drop the partition.

ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')
  • Note that any data for this table or partitions will be dropped and may not be recoverable. *

Loading Data

There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this:

从一个现有大表中,读取一部分(某分区)数据加载到表中。

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
 
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
 
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, nullnull, pvs.ip
WHERE pvs.country = 'US';

In the example above nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified.

This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive.

Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax:

从本地磁盘加载数据文件

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory - it cannot contain subdirectories. Similarly - the wildcard must match file names only.

In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table:

从HDFS加载数据文件

LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

It is assumed that the array and map fields in the input.txt files are null fields for these examples.

See Hive Data Manipulation Language for more information about loading data into Hive tables.

[Hive - Tutorial] Creating, Showing, Altering, and Dropping Tables的更多相关文章

  1. Hive Tutorial 阅读记录

    Hive Tutorial 目录 Hive Tutorial 1.Concepts 1.1.What Is Hive 1.2.What Hive Is NOT 1.3.Getting Started ...

  2. Hive Tutorial(上)(Hive 入门指导)

    用户指导 Hive 指导 Hive指导 概念 Hive是什么 Hive不是什么 获得和开始 数据单元 类型系统 内置操作符和方法 语言性能 用法和例子(在<下>里面) 概念 Hive是什么 ...

  3. [Hive - Tutorial] Type System 数据类型

    数据类型Type System Hive supports primitive and complex data types, as described below. See Hive Data Ty ...

  4. Part 2 Creating, altering and dropping a database

    A SQL Server database can be created, altered and dropped1. Graphically using SQL Server Management ...

  5. [Hive - Tutorial] Built In Operators and Functions 内置操作符与内置函数

    Built-in Operators Relational Operators The following operators compare the passed operands and gene ...

  6. [Hive - Tutorial] Querying and Inserting Data 查询和插入数据

    Querying and Inserting Data Simple Query Partition Based Query Joins Aggregations Multi Table/File I ...

  7. [Hive - Tutorial] Data Units 数据存储单位

    Data Units In the order of granularity - Hive data is organized into: 数据库.表.分区.桶 Databases: Namespac ...

  8. hive Tutorial

    hive数据单元按照粒度从大到小,依次为 1.数据库database:可以用show databases; 命令查看所有的数据库,并用use d1; 命令来选中d1数据库,接下来就可以操作d1数据库中 ...

  9. 【大数据系列】apache hive 官方文档翻译

    GettingStarted 开始 Created by Confluence Administrator, last modified by Lefty Leverenz on Jun 15, 20 ...

随机推荐

  1. Windows下gcc以及Qt的DLL文件调用之总结(三种方法)

    DLL与LIB的区别 :1.DLL是一个完整程序,其已经经过链接,即不存在同名引用,且有导出表,与导入表lib是一个代码集(也叫函数集)他没有链接,所以lib有冗余,当两个lib相链接时地址会重新建立 ...

  2. Spring与Struts2整合VS Spring与Spring MVC整合

    Spring与Struts2整合,struts.xml在src目录下 1.在web.xml配置监听器 web.xml <!-- 配置Spring的用于初始化ApplicationContext的 ...

  3. 1、探究java方法参数传递——引用传递?值传递!

    原创博文,转载请注明出处.谢谢~~ java程序运行时,其对象是怎么进行放置和安排的呢?内存是怎么分配的呢?理解好这个很有好处!java有5个地方可以存储数据: 1.寄存器.这是最快的存储区,位于处理 ...

  4. 《数据通信与网络》笔记--TCP中的拥塞控制

    1.拥塞窗口 发送方窗口的大小不仅取决于接收方,而.而且还取决于网络拥塞的情况. 发送方有2种信息:接收方通告的窗口大小和拥塞窗口的大小,实际的窗口大小事这两者中的最小者. 实际窗口大小 = min( ...

  5. Tornado 中的 get() 或 post() 方法

    ---恢复内容开始--- Tornado 中的 get() 或 post() 方法 请求处理程序和请求参数 Tornado 的 Web 程序会将 URL 或者 URL 范式映射到 tornado.we ...

  6. Event-based Asynchronous Pattern Overview基于事件的异步模式概览

    https://msdn.microsoft.com/zh-cn/library/wewwczdw(v=vs.110).aspx Applications that perform many task ...

  7. WEBBROWSER中模拟鼠标点击(SendMessage/PostMessage)

    好久没有写文章,发一篇顶顶博客访问量.别人建议转一些比较好的代码也贴过来,但是我打算这里主要发自己原创的代码,所以么..流量该多少就多少吧... 回到主题,在webbrowser中点击某链接网上几乎都 ...

  8. 函数 xdes_get_descriptor_with_space_hdr

    获得区描述符 xdes entry 的offset /********************************************************************//** ...

  9. bzoj2437

    会做jsoi那道game,这题就非常简单了吧 我们考虑空格的移动,显然,初始与空格位置距离为奇数的黑棋和距离为偶数的白棋并没有什么用, 空格不会移到那,我们直接把他们当作障碍,其他点我们当作可移动区域 ...

  10. bzoj2326: [HNOI2011]数学作业

    矩阵快速幂,分1-9,10-99...看黄学长的代码理解...然而他直接把答案保存在最后一行(没有说明...好吧应该是我智障这都不知道... #include<cstdio> #inclu ...