Hive简记
在大数据工作中难免遇到数据仓库(OLAP)架构,以及通过Hive SQL简化分布式计算的场景。所以想通过这篇博客对Hive使用有一个大致总结,希望道友多多指教!
摘要:
1.Hive安装
2.Hive DDL命令
3.Hive DML初步
4.Hive DML高级
5.Hive与HBASE,MongoDB等整合
6.Hive 优化与配置参数
附:SQL执行顺序
7.Hive 复杂数据类型
8.Hive group by聚合增强
内容:
1.Hive安装
依赖:mysql,jdk,hadoop
安装文档参考:官方文档;注意这里hive默认使用Derby数据库,只支持单用户登录。修改具体配置请参考官网说明:
Metadata Store
Metadata is in an embedded Derby database whose disk storage location is determined by the Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).
Right now, in the default configuration, this metadata can only be seen by one user at a time.
Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model.
In the future, the metastore itself can be a standalone server.
If you want to run the metastore as a network server so it can be accessed from multiple nodes, see Hive Using Derby in Server Mode.
2.Hive DDL命令
建表语句:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later) [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] --分区 [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] --分桶 [SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)] ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) [STORED AS DIRECTORIES] [ --存储格式 [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later) ] [LOCATION hdfs_path] --外部表指定存储路径 [TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later) [AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name --复制表 LIKE existing_table_or_view_name [LOCATION hdfs_path]; |
删除表:DROP TABLE [IF EXISTS] table_name [PURGE];
截断表:TRUNCATE TABLE table_name [PARTITION partition_spec];
查看表结构:
DESCRIBE [EXTENDED|FORMATTED] table_name[.col_name ( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )* ]; 其他请查看官网DDL文档 LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES ( value [, value ...] ) [, ( value [, value ...] ) INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0) SELECT ... FROM ...insert overwrite local DIRECTORY '/tmp/out/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
详细部分请参考官网DML部分(load/insert/update/delete/merge, import/export, explain plan)
Hive内置函数
|
int |
year(string date) |
Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
|
string |
upper(string A) |
returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR' |
|
string |
ucase(string A) |
Same as upper |
|
string |
trim(string A) |
returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar' |
|
string |
to_date(string timestamp) |
Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
|
string |
substr(string A, int start, int length) |
returns the substring of A starting from start position with the given length, for example, |
|
string |
substr(string |
returns the |
|
int |
size(Map<K.V>) |
returns the |
|
int |
size(Array<T>) |
returns the |
|
string |
rtrim(string |
returns the |
|
BIGINT |
round(double |
returns the |
|
string |
regexp_replace(string |
returns the |
|
double |
rand(), |
returns a |
|
int |
month(string |
Return the |
|
string |
ltrim(string |
returns the |
|
string |
lower(string |
returns the |
|
string |
lcase(string |
Same as lower |
|
string |
get_json_object(string |
Extract json |
|
string |
from_unixtime(int |
convert the |
|
BIGINT |
floor(double |
returns the |
|
int |
day(string |
Return the day |
|
string |
concat(string |
returns the |
|
BIGINT |
ceil(double a) |
returns the |
|
BIGINT |
count(*), count(expr), count(DISTINCT expr[, expr_.]) |
count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL. |
|
DOUBLE |
avg(col), avg(DISTINCT col) |
returns the average of the elements in the group or the average of the distinct values of the column in the group |
|
DOUBLE |
max(col) |
returns the maximum value of the column in the group |
|
DOUBLE |
min(col) |
returns the minimum value of the column in the group |
|
DOUBLE |
sum(col), sum(DISTINCT col) |
returns the sum of the elements in the group or the sum of the distinct values of the column in the group |
Built-in Aggregate Functions (UDAF)
The following built-in aggregate functions are supported in Hive:
|
Return Type |
Name(Signature) |
Description |
|---|---|---|
|
BIGINT |
count(*), count(expr), count(DISTINCT expr[, expr...]) |
count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Execution of this can be optimized with hive.optimize.distinct.rewrite. |
|
DOUBLE |
sum(col), sum(DISTINCT col) |
Returns the sum of the elements in the group or the sum of the distinct values of the column in the group. |
|
DOUBLE |
avg(col), avg(DISTINCT col) |
Returns the average of the elements in the group or the average of the distinct values of the column in the group. |
|
DOUBLE |
min(col) |
Returns the minimum of the column in the group. |
|
DOUBLE |
max(col) |
Returns the maximum value of the column in the group. |
|
DOUBLE |
variance(col), var_pop(col) |
Returns the variance of a numeric column in the group. |
|
DOUBLE |
var_samp(col) |
Returns the unbiased sample variance of a numeric column in the group. |
|
DOUBLE |
stddev_pop(col) |
Returns the standard deviation of a numeric column in the group. |
|
DOUBLE |
stddev_samp(col) |
Returns the unbiased sample standard deviation of a numeric column in the group. |
|
DOUBLE |
covar_pop(col1, col2) |
Returns the population covariance of a pair of numeric columns in the group. |
|
DOUBLE |
covar_samp(col1, col2) |
Returns the sample covariance of a pair of a numeric columns in the group. |
|
DOUBLE |
corr(col1, col2) |
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group. |
|
DOUBLE |
percentile(BIGINT col, p) |
Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral. |
|
array<double> |
percentile(BIGINT col, array(p1 [, p2]...)) |
Returns the exact percentiles p1, p2, ... of a column in the group (does not work with floating point types). pimust be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral. |
|
DOUBLE |
percentile_approx(DOUBLE col, p [, B]) |
Returns an approximate pth percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value. |
|
array<double> |
percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B]) |
Same as above, but accepts and returns an array of percentile values instead of a single one. |
| double |
regr_avgx(independent, dependent) |
Equivalent to avg(dependent). As of Hive 2.2.0. |
| double |
regr_avgy(independent, dependent) |
Equivalent to avg(independent). As of Hive 2.2.0. |
| double |
regr_count(independent, dependent) |
Returns the number of non-null pairs used to fit the linear regression line. As of Hive 2.2.0. |
| double |
regr_intercept(independent, dependent) |
Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b. As of Hive 2.2.0. |
| double |
regr_r2(independent, dependent) |
Returns the coefficient of determination for the regression. As of Hive 2.2.0. |
| double |
regr_slope(independent, dependent) |
Returns the slope of the linear regression line, i.e. the value of a in the equation dependent = a * independent + b. As of Hive 2.2.0. |
| double |
regr_sxx(independent, dependent) |
Equivalent to regr_count(independent, dependent) * var_pop(dependent). As of Hive 2.2.0. |
| double |
regr_sxy(independent, dependent) |
Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent). As of Hive 2.2.0. |
| double | regr_syy(independent, dependent) |
Equivalent to regr_count(independent, dependent) * var_pop(independent). As of Hive 2.2.0. |
|
array<struct { |
histogram_numeric(col, b) |
Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights |
|
array |
collect_set(col) |
Returns a set of objects with duplicate elements eliminated. |
|
array |
collect_list(col) |
Returns a list of objects with duplicates. (As of Hive 0.13.0.) |
| INTEGER | ntile(INTEGER x) |
Divides an ordered partition into |
Built-in Table-Generating Functions (UDTF)
Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows.
|
Row-set columns types |
Name(Signature) |
Description |
|---|---|---|
|
T |
explode(ARRAY<T> a) |
Explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array. |
|
Tkey,Tvalue |
explode(MAP<Tkey,Tvalue> m) |
Explodes a map to multiple rows. Returns a row-set with a two columns (key,value) , one row for each key-value pair from the input map. (As of Hive 0.8.0.). |
| int,T | posexplode(ARRAY<T> a) | Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array. |
|
T1,...,Tn |
inline(ARRAY<STRUCT<f1:T1,...,fn:Tn>> a) |
Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.) |
| T1,...,Tn/r | stack(int r,T1 V1,...,Tn/r Vn) | Breaks up n values V1,...,Vn into r rows. Each row will have n/r columns. r must be constant. |
|
string1,...,stringn |
json_tuple(string jsonStr,string k1,...,string kn) |
Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the |
|
string 1,...,stringn |
parse_url_tuple(string urlStr,string p1,...,string pn) |
Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the |
- RANK
- ROW_NUMBER
- DENSE_RANK
- CUME_DIST
- PERCENT_RANK
- NTILE
更多例子参考官网窗口分析函数文档
select 1,tag_id
from (
select parse_json_array(get_json_object('{"data":[{"id":1082},{"id":1082},{"id":1082}]}','$.data'),'id') as ids
from dual
) a
lateral view explode(split(a.ids,'_')) s as tag_id
; <!--
+------+---------+--+
| _c0 | tag_id |
+------+---------+--+
| 1 | 1082 |
| 1 | 1082 |
| 1 | 1082 |
+------+---------+--+
-->
5.Hive与HBASE,MongoDB等整合
官网文档:HBaseIntegration
以下做简单步骤记录:
1.添加mongodb整合包和驱动包:
add jar /data/dmp/hive/lib/hive-mongo-1.0.0-jar-with-dependencies.jar;
add jar /data/dmp/hive/lib/mongo-java-driver-3.2.2.jar;
add jar /data/dmp/hive/lib/mongo-hadoop-hive-1.5.1.jar;
add jar /data/dmp/hive/lib/mongo-hadoop-core-1.5.1.jar;
2.创建hive外部表
DROP TABLE IF EXISTS mongodb_table;
CREATE EXTERNAL TABLE mongodb_table
(
id int
)
stored by 'com.mongodb.hadoop.hive.MongoStorageHandler'
with serdeproperties(
'mongo.columns.mapping'='{"id":"_id"}'
)
TBLPROPERTIES('mongo.uri'='mongodb://username:passwd@ip:port/db.table');
这里遇到一个问题:
If the username or password contains a colon (:) or an at-sign (@) then it must be urlencoded (state=,code=0)
按照说明,把用户名或者密码的特殊字符用url编码就可以了
6.Hive 优化与配置参数
附:SQL执行顺序
当一个查询语句同时出现了where,group by,having,order by的时候,执行顺序和编写顺序是:
1.执行where xx对全表数据做筛选,返回第1个结果集。 2.针对第1个结果集使用group by分组,返回第2个结果集。
3.针对第2个结果集中的每1组数据执行select xx,有几组就执行几次,返回第3个结果集。
4.针对第3个结集执行having xx进行筛选,返回第4个结果集。 5.针对第4个结果集排序。
7.Hive 复杂数据类型
7.1 array
7.2 map
7.3 struct
Complex Type Constructors
The following functions construct instances of complex types.
|
Constructor Function |
Operands |
Description |
|---|---|---|
|
map |
(key1, value1, key2, value2, ...) |
Creates a map with the given key/value pairs. |
|
struct |
(val1, val2, val3, ...) |
Creates a struct with the given field values. Struct field names will be col1, col2, .... |
|
named_struct |
(name1, val1, name2, val2, ...) |
Creates a struct with the given field names and values. (As of Hive 0.8.0.) |
|
array |
(val1, val2, ...) |
Creates an array with the given elements. |
|
create_union |
(tag, val1, val2, ...) |
Creates a union type with the value that is being pointed to by the tag parameter. |
Operators on Complex Types
The following operators provide mechanisms to access elements in Complex Types.
|
Operator |
Operand types |
Description |
|---|---|---|
|
A[n] |
A is an Array and n is an int |
Returns the nth element in the array A. The first element has index 0. For example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'. |
|
M[key] |
M is a Map<K, V> and key has type K |
Returns the value corresponding to the key in the map. For example, if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'. |
|
S.x |
S is a struct |
Returns the x field of S. For example for the struct foobar {int foo, int bar}, foobar.foo returns the integer stored in the foo field of the struct. |
Collection Functions
The following built-in collection functions are supported in Hive:
|
Return Type |
Name(Signature) |
Description |
|---|---|---|
|
int |
size(Map<K.V>) |
Returns the number of elements in the map type. |
|
int |
size(Array<T>) |
Returns the number of elements in the array type. |
|
array<K> |
map_keys(Map<K.V>) |
Returns an unordered array containing the keys of the input map. |
|
array<V> |
map_values(Map<K.V>) |
Returns an unordered array containing the values of the input map. |
|
boolean |
array_contains(Array<T>, value) |
Returns TRUE if the array contains value. |
|
array<t> |
sort_array(Array<T>) |
Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0). |
8.Hive group by聚合增强
8.1.grouping sets
grouping sets子句都可以根据UNION连接的多个GROUP BY查询进行逻辑表示
```
SELECT a,b,SUM(c)FROM tab1 GROUP BY a,b GROUPING SETS((a,b),a,b,())
等价于
SELECT a,b,SUM(c)FROM tab1 GROUP BY a,b
union
SELECT a,null,SUM(c)FROM tab1 GROUP BY a,null
union
SELECT null,b,SUM(c)FROM tab1 GROUP BY null,b
union
SELECT null,null,SUM(c)FROM tab1
```
8.2.GROUPING__ID
注意是两个下划线相连,说明聚合结果是属于(grouping sets)哪一个子集的的。
```
SELECT key, value, GROUPING__ID,count(*)
FROM T1
GROUP BY key, value
GROUPING SETS((key,value),key,value)
;
等价于
SELECT key, value,1,count(*) -- 属于第1个GROUPING SETS子集,即(key,value)
FROM T1
GROUP BY key, value
union
SELECT key, NULL,2,count(*) -- 属于第2个GROUPING SETS子集,即key
FROM T1
GROUP BY key
union
SELECT NULL, value,3,count(*) -- 属于第3个GROUPING SETS子集,即value
FROM T1
GROUP BY value
```
8.3 WITH CUBE
CUBE是是group by字段的所有组合
```
GROUP BY a,b,c WITH CUBE
等同于
GROUP BY a,b,c GROUPING SETS((a,b,c),(a,b),(b,c), (a,c),(a),(b),(c),()
```
8.4 WITH ROLLUP
ROLLUP子句与GROUP BY一起用于计算维度的层次结构级别的聚合。
```
GROUP BY a,b,c,WITH ROLLUP
等同于
GROUP BY a,b,c GROUPING SETS((a,b,c),(a,b),(a),())。
```
官网文档:
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup
Hive简记的更多相关文章
- 初识Hadoop、Hive
2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...
- Hive安装配置指北(含Hive Metastore详解)
个人主页: http://www.linbingdong.com 本文介绍Hive安装配置的整个过程,包括MySQL.Hive及Metastore的安装配置,并分析了Metastore三种配置方式的区 ...
- Hive on Spark安装配置详解(都是坑啊)
个人主页:http://www.linbingdong.com 简书地址:http://www.jianshu.com/p/a7f75b868568 简介 本文主要记录如何安装配置Hive on Sp ...
- HIVE教程
完整PDF下载:<HIVE简明教程> 前言 Hive是对于数据仓库进行管理和分析的工具.但是不要被“数据仓库”这个词所吓倒,数据仓库是很复杂的东西,但是如果你会SQL,就会发现Hive是那 ...
- 基于Ubuntu Hadoop的群集搭建Hive
Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...
- hive
Hive Documentation https://cwiki.apache.org/confluence/display/Hive/Home 2016-12-22 14:52:41 ANTLR ...
- RangePartitioner 实现简记
摘要: 1.背景 2.rangeBounds 上边界数组源码走读 3.RangePartitioner的sketch 源码走读 4.determineBounds 源码走读 5.关于RangePart ...
- 深入浅出数据仓库中SQL性能优化之Hive篇
转自:http://www.csdn.net/article/2015-01-13/2823530 一个Hive查询生成多个Map Reduce Job,一个Map Reduce Job又有Map,R ...
- Hive读取外表数据时跳过文件行首和行尾
作者:Syn良子 出处:http://www.cnblogs.com/cssdongl 转载请注明出处 有时候用hive读取外表数据时,比如csv这种类型的,需要跳过行首或者行尾一些和数据无关的或者自 ...
随机推荐
- WinFom解决最小化最大化后重绘窗口造成闪烁的问题
网上两种方案(可协同) 1 设置双缓冲: SetStyle(ControlStyles.UserPaint, true); SetStyle(ControlStyles.AllPaintingInWm ...
- 《Java从入门到放弃》JavaSE入门篇:面向对象语法一(入门版)
前一次简单说明了一下面向对象编程的概念,今天我们就把这些概念通过Java语法来实现,然后看看效果. 来看第一个案例:定义女神类,再根据女神类创建三个女神对象,并使用女神对象的属性和方法. 第一步:定义 ...
- 深入浅出数据结构C语言版(15)——优先队列(堆)
在普通队列中,元素出队的顺序是由元素入队时间决定的,也就是谁先入队,谁先出队.但是有时候我们希望有这样的一个队列:谁先入队不重要,重要的是谁的"优先级高",优先级越高越先出队.这样 ...
- jmeter系列-------脚本编写格式
1.通常会将用户和服务器的一次交互(页面访问或者提交)请求放在一个简单控制器或者事务控制器,例如微课首页里面包含4个接口都放到简单控制器里 或者一个提交可能,会触发3个接口,那么这3个接口放到一个简单 ...
- 一起来学linux:shell script(一)关于变量
p { margin-bottom: 0.25cm; line-height: 120% } (一)首先来看什么是变量,在shell中打印出变量采用的是echo $path或者echo ${path} ...
- Java面试系列之HashMap大扫盲汇总
PS:整理的稍微有点急,不足之处,望各路道友指正,List相关可以查看前一篇随笔! HashMap的工作原理是近年来常见的Java面试题,几乎每个Java程序员都知道HashMap,都知道哪里要用Ha ...
- sql in 和 exist的区别
详见:http://blog.yemou.net/article/query/info/tytfjhfascvhzxcytp41 select * from A where id in(select ...
- TypeScript入门知识二(参数新特性)
一,参数类型 1.在参数的名称后面使用冒号来指定参数的类型,当赋值的不是指定类型数值时会报错. var myname: string = "zhang san"; 2.当你没有指定 ...
- 【ACM小白成长撸】--贪婪法解硬币找零问题
question:假设有一种货币,它有面值为1分.2分.5分和1角的硬币,最少需要多少个硬币来找出K分钱的零钱.按照贪婪法的思想,需要不断地使用面值最大的硬币.如果找零的值小于最大的硬币值,则尝试第二 ...
- 03-TypeScript中的强类型
在js中不能定义类型,而是根据赋值后,js运行时推断类型.在ts中支持强类型,强类型包括string.number(浮点型,不是整型).boolean.any(任意类型).Array<T> ...