hive深入使用

Hive表的创建和数据类型

https://cwiki.apache.org/confluence/display/Hive/Home

管理表和外部的区别

# 管理表

1. 内部表也称之为MANAGED_TABLE；

2. 默认存储在/user/hive/warehouse下，也可以通过location指定；

3. 删除表时，会删除表数据以及元数据；

# 托管表（外部表）

1. 外部表称之为EXTERNAL_TABLE；

2. 在创建表时可以自己指定目录位置(LOCATION)；

3. 删除表时，只会删除元数据不会删除表数据；

分区表创建及查询

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成较小的数据集。

# hdfs中目录结构：

/user/hive/warehouse/bf_log/

    /20150911/

        20150911.log

    /20150912/

        20150912.log

# 创建分区表

create EXTERNAL table IF NOT EXISTS default.emp_partition(

empno int,

ename string,

job string,

mgr int,

hiredate string,

sal double,

comm double,

deptno int

)

partitioned by (month string,day string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;

# 导入数据到分区表

load data local inpath '/opt/datas/emp.txt' into table default.emp_partition partition (month='201509',day='13') ;

# 查询

select * from emp_partition where month = '201509' and day = '13' ;

分区表需要注意的事项

创建普通表（不是分区表）后，直接把数据put到hdfs表目录

create table IF NOT EXISTS default.dept_nopart(

deptno int,

dname string,

loc string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

# 把数据直接上传到分区目录

dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_nopart ;

# 可以正常查询

select * from dept_nopart ;

创建分区表后，直接把数据put到hdfs表目录

create table IF NOT EXISTS default.dept_part(

deptno int,

dname string,

loc string

)

partitioned by (day string)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

## 第一种方式

dfs -mdir -p /user/hive/warehouse/dept_part/day=20150913 ;

dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150913 ;

## 修复分区

msck repair table dept_part ;

## 第二种方式

dfs -mkdir -p /user/hive/warehouse/dept_part/day=20150914 ;

dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150914 ;

## 手动添加分区

alter table dept_part add partition(day='20150914');

show partitions dept_part ;

加载数据到Hive表中常用的方式

1）加载本地文件到hive表

load data local inpath '/opt/datas/emp.txt' into table default.emp ;

2）加载hdfs文件到hive中（overwrite 覆盖掉原有文件）

load data inpath '/user/beifeng/hive/datas/emp.txt' overwrite into table default.emp ;

3）加载数据覆盖表中已有的数据（默认在原文件中追加）

load data inpath '/user/beifeng/hive/datas/emp.txt' into table default.emp ;

4）创建表时通过insert加载

create table default.emp_ci like emp ;

insert into table default.emp_ci select * from default.emp ;

5）创建表的时候通过location指定加载

create EXTERNAL table IF NOT EXISTS default.emp_ext(

empno int,

ename string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

location '/user/beifeng/hive/warehouse/emp_ext';

Hive导出数据的几种方式

1）insert overwrite local directory 导出到本地

# 没有格式化

insert overwrite local directory '/opt/datas/hive_exp_emp'

select * from default.emp ;

# FORMAT格式化

insert overwrite local directory '/opt/datas/hive_exp_emp2'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\n'

select * from default.emp ;

2）通过管道流输出方式

bin/hive -e "select * from default.emp ;" > /opt/datas/exp_res.txt

3）导出到hdfs,再从hdfs -get

insert overwrite directory '/user/beifeng/hive/hive_exp_emp'

select * from default.emp ;

Hive常用的查询

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

Hive中 Export / Import

Export

导出，将Hive表中的数据，导出到外部。

EXPORT TABLE default.emp TO '/user/beifeng/hive/export/emp_exp' ;

export_target_path：指的是HDFS上路径

Import

导入，将外部数据导入Hive表中。

# 创建一个和default.emp结构一样的表

create table db_hive.emp like default.emp ;

import table db_hive.emp from '/user/beifeng/hive/export/emp_exp';

Hive分区相关

1）order by

# 对全局数据的一个排序，仅仅只有个reduce

select * from emp order by empno desc ;

2）sort by

# 对每一个reduce内部数据进行排序的，全局结果集来说不是排序

set mapreduce.job.reduces= 3;   

insert overwrite local directory '/opt/datas/sortby-res' select * from emp sort by empno asc ;

3) distribute by

# 分区partition 类似于MapReduce中分区partition,对数据进行分区，结合sort by进行使用

insert overwrite local directory '/opt/datas/distby-res' select * from emp distribute by deptno sort by empno asc ;

# 注意事项：

distribute by 必须要在sort by  前面。

4）cluster by

# 当distribute by和sort by 字段相同时，可以使用cluster by ;

insert overwrite local directory '/opt/datas/cluster-res' select * from emp cluster by empno ;

Hive UDF自定义函数

pom.xml添加hive依赖（当然还需要hadoop相关）

    <!-- Hive Client -->

    <dependency>

        <groupId>org.apache.hive</groupId>

        <artifactId>hive-jdbc</artifactId>

        <version>${hive.version}</version>

    </dependency>

    <dependency>

        <groupId>org.apache.hive</groupId>

        <artifactId>hive-exec</artifactId>

        <version>${hive.version}</version>

    </dependency>

Creating Custom UDFs

官方文档：https://cwiki.apache.org/confluence/display/Hive/HivePlugins

1) First, you need to create a new class that extends UDF, with one or more methods named evaluate.

package com.example.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public final class Lower extends UDF {

    public Text evaluate(final Text s) {

        if (s == null) {

            return null;

        }

        return new Text(s.toString().toLowerCase());

    }

}

2）After compiling your code to a jar, you need to add this to the Hive classpath. See the section below on deploying jars.

3) 两种使用方式

# 把jar包添加到hive的环境变量中

add jar /opt/datas/hiveudf.jar 

# 创建一个function

create temporary function my_lower as "com.beifeng.senior.hive.udf.LowerUDF" ;

# 使用

select ename, my_lower(ename) lowername from emp limit 5 ;

CREATE FUNCTION self_lower AS 'com.beifeng.senior.hive.udf.LowerUDF' USING JAR 'hdfs://hadoop-senior.ibeifeng.com:8020/user/beifeng/hive/jars/hiveudf.jar';

select ename, self_lower(ename) lowername from emp limit 5 ;