【转】Hive Data Manipulation Language
Hive Data Manipulation Language
There are two primary ways of modifying data in Hive:
Loading files into tables
Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Syntax
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
Synopsis
Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
- filepath can be:
- a relative path, such as
project/data1 - an absolute path, such as
/user/hive/project/data1 - a full URI with scheme and (optionally) an authority, such as
hdfs://namenode:9000/user/hive/project/data1
- a relative path, such as
- The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
- filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.
- If the keyword LOCAL is specified, then:
- the load command will look for filepath in the local file system. If a relative path is specified, it will be interpreted relative to the user's current working directory. The user can specify a full URI for local files as well - for example:
file:///user/hive/project/data1 - the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table.
- the load command will look for filepath in the local file system. If a relative path is specified, it will be interpreted relative to the user's current working directory. The user can specify a full URI for local files as well - for example:
- If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath, if one is specified, or will apply the following rules:
- If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable
fs.default.namethat specifies the Namenode URI. - If the path is not absolute, then Hive will interpret it relative to
/user/<username> - Hive will move the files addressed by filepath into the table (or partition)
- If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable
- If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table.
- Note that if the target table (or partition) already has a file whose name collides with any of the filenames contained in filepath, then the existing file will be replaced with the new file.
Notes
- filepath cannot contain subdirectories.
- If the keyword LOCAL is not given, filepath must refer to files within the same filesystem as the table's (or partition's) location.
- Hive does some minimal checks to make sure that the files being loaded match the target table. Currently it checks that if the table is stored in sequencefile format, the files being loaded are also sequencefiles, and vice versa.
- Please read CompressedStorage if your datafile is compressed.
Inserting data into Hive Tables from queries
Query Results can be inserted into tables by using the insert clause.
Syntax
Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
Synopsis
- INSERT OVERWRITE will overwrite any existing data in the table or partition
- unless
IF NOT EXISTSis provided for a partition (as of Hive 0.9.0)
- unless
- INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8)
- Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.
- Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query.
- The output of each of the select statements is written to the chosen table (or partition). Currently the OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are replaced with the output of corresponding select statement.
- The output format and serialization class is determined by the table's metadata (as specified via DDL commands on the table).
Notes
- Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.
Dynamic Partition Inserts
Version information
Icon
This information reflects the situation in Hive 0.12; dynamic partition inserts were added in Hive 0.6.
In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.
Dynamic Partition inserts are disabled by default. These are the relevant configuration properties for dynamic partition inserts:
|
Configuration property |
Default |
Note |
|---|---|---|
|
|
|
Needs to be set to |
|
|
|
In |
|
|
100 |
Maximum number of dynamic partitions allowed to be created in each mapper/reducer node |
|
|
1000 |
Maximum number of dynamic partitions allowed to be created in total |
|
|
100000 |
Maximum number of HDFS files created by all mappers/reducers in a MapReduce job |
|
|
|
Whether to throw an exception if dynamic partition insert generates empty results |
Example
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt
Here the country partition will be dynamically created by the last column from the SELECT clause (i.e. pvs.cnt). Note that the name is not used. In nonstrict mode the dt partition could also be dynamically created.
Additional Documentation
Writing data into the filesystem from queries
Query results can be inserted into filesystem directories by using a slight variation of the syntax above:
Syntax
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ... Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
Synopsis
- Directory can be a full URI. If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable
fs.default.namethat specifies the Namenode URI. - If LOCAL keyword is used, Hive will write data to the directory on the local file system.
- Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.
Notes
- INSERT OVERWRITE statements to directories, local directories, and tables (or partitions) can all be used together within the same query.
- INSERT OVERWRITE statements to HDFS filesystem directories are the best way to extract large amounts of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job.
- The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.
- As of Hive 0.11.0 the separator used can be specified, in earlier versions it was always the ^A character (\001)
【转】Hive Data Manipulation Language的更多相关文章
- 数据库原理及应用-SQL数据操纵语言(Data Manipulation Language)和嵌入式SQL&存储过程
2018-02-19 18:03:54 一.数据操纵语言(Data Manipulation Language) 数据操纵语言是指插入,删除和更新语言. 二.视图(View) 数据库三级模式,两级映射 ...
- Hive 6、Hive DML(Data Manipulation Language)
DML主要是对Hive 表中的数据进行操作的(增 删 改),但是由于Hadoop的特性,所以单条的修改.删除,其性能会非常的低所以不支持进行级操作: 主要说明一下最常用的批量插入数据较为常用的方法: ...
- oracle数据操纵语言(DML)data manipulation language(续集)
SQL查询语句(SELECT)进阶分组函数(Group Functions):对多行进行操作,并为每一组给出一个结果. AVG([DISTINCT|ALL] expression) 平均值COUNT ...
- oracle数据操纵语言(DML)data manipulation language
数据库操纵语言(DML)用于查询和操纵模式对象中的数据,它不隐式地提交当前事务. SELECTINSERTUPDATEDELETECALLEXPLAIN PLANLOCK TABLEMERGE使用算术 ...
- Data manipulation primitives in R and Python
Data manipulation primitives in R and Python Both R and Python are incredibly good tools to manipula ...
- Best packages for data manipulation in R
dplyr and data.table are amazing packages that make data manipulation in R fun. Both packages have t ...
- The dplyr package has been updated with new data manipulation commands for filters, joins and set operations.(转)
dplyr 0.4.0 January 9, 2015 in Uncategorized I’m very pleased to announce that dplyr 0.4.0 is now av ...
- java.sql.SQLException: Can not issue data manipulation statements with executeQuery().
1.错误描写叙述 java.sql.SQLException: Can not issue data manipulation statements with executeQuery(). at c ...
- Can not issue data manipulation statements with executeQuery()错误解决
转: Can not issue data manipulation statements with executeQuery()错误解决 2012年03月27日 15:47:52 katalya 阅 ...
随机推荐
- 用mysql查询某字段是否有索引
可以使用SHOW INDEX FROM table_name来查看表的索引,从而查看字段的索引:查询结果中table为表名,key_name为索引名,Column_name为列名
- Git上传本地项目到码云
前提:本地安装git.注册码云 1.进入本地项目文件夹,鼠标右键代开 Git Bash Here 2.输入命令 初始化库管理文件 git init 3.输入命名 修改Git的全局配置 git conf ...
- hdu 2069 Coin Change(完全背包)
Coin Change Time Limit: 1000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total Su ...
- laravel .env 文件的使用
转载地址 http://www.cnblogs.com/Eden-cola/p/DotEnv-in-lumen.html umen 是 laravel 的衍生品,核心功能的使用和 laravel 都 ...
- Codeforces Round #188 (Div. 1 + Div. 2)
A. Even Odds 奇数个数\(\lfloor \frac{n+1}{2}\rfloor\) B. Strings of Power 从位置0开始,统计heavy个数,若当前为metal,则可以 ...
- JPA一对多循环引用的解决&&JackSon无限递归问题
说是解决,其实不是很完美的解决的,写出来只是想记录一下这个问题或者看一下有没有哪位仁兄会的,能否知道一二. 下面说说出现问题: 问题是这样的,当我查询一个一对多的实体的时候,工具直接就爆了,差不多我就 ...
- H3C 更新发送全部路由表浪费网络资源
- Can you find it?——[二分查找]
Description Give you three sequences of numbers A, B, C, then we give you a number X. Now you need t ...
- P1090 后缀表达式
题目描述 所谓后缀表达式是指这样的一个表达式:式中不再引用括号,运算符号放在两个运算对象之后,所有计算按运算符号出现的顺序,严格地由左而右新进行(不用考虑运算符的优先级). 如:3(5–2)+7对应的 ...
- Linux 内核引用计数的操作
一个 kobject 的其中一个关键函数是作为一个引用计数器, 给一个它被嵌入的对象. 只 要对这个对象的引用存在, 这个对象( 和支持它的代码) 必须继续存在. 来操作一个 kobject 的引用计 ...