[Hive_add_5] Hive 的 join 操作

0. 说明

　　在 Hive 中进行 join 操作

1. 操作步骤

　　1.0 建表

　　在 hiveserver2 服务启动的前提下，在 Beeline客户端中输入以下命令

# 新建顾客表

create table customers(id int, name string, age int) row format delimited fields terminated by '\t';

# 新建订单表

create table orders(oid int, oname string, oprice float, uid int) row format delimited fields terminated by '\t';

　　1.1 创建并插入数据

　　创建，略

　　插入命令如下：

# 插入顾客数据

load data local inpath '/home/centos/files/customers.txt' into table customers;

# 插入订单数据

load data local inpath '/home/centos/files/orders.txt' into table orders;

　　1.2 使用 join

# 内连接

select a.id, a.name, b.oname, b.oprice from customers a inner join orders b on a.id=b.uid;

# 左外连接

select a.id, a.name, b.oname, b.oprice from customers a left outer join orders b on a.id=b.uid;

# 右外连接

select a.id, a.name, b.oname, b.oprice from customers a right outer join orders b on a.id=b.uid;

# 全外连接

select a.id, a.name, b.oname, b.oprice from customers a full outer join orders b on a.id=b.uid;

2. join 的分类与优化

　　2.1 普通 join

select a.id, a.name, b.orderno, b.oprice from customers a inner join orders b on a.id=b.cid;

　　a inner join b　　// 返回行数 a ∩ b

　　a left [outer] join b　　 // 返回行数 a

　　a right [outer] join b　　// 返回行数 b

　　a full [outer] join b　　// 返回行数 a+b - (a ∩ b)

　　a cross join b　　// 返回行数 a * b

2.2 特殊 join 优化

　　map join

　　小表+大表 => 将小表加入到分布式缓存，通过迭代大表所有数据进行处理

　　在老版的 Hive 中(0.7)之前，所有的 join 操作都是在 reduce 端执行的(reduce 端 join)
　　想要进行 map 端 join，需要进行以下操作

　　SET hive.auto.convert.join=true;
　　声明暗示 a join b , a小表，b大表
　　/*+ mapjoin(小表) */

　　SELECT /*+ MAPJOIN(a) */ a.id, a.name, b.orderno, b.oprice from customers a inner join orders b on a.id=b.cid;

　　在新版 Hive 中，如果想要进行 map 端 join

　　jdbc:hive2://> SET hive.auto.convert.join=true;　　//设置自动转换成 map 端 join
　　jdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;　　//设置 map 端 join 中小表的最大值，默认25M

common join

　　即 reduce 端 join
　　1. 声明暗示，指定大表
　　/*+ STREAMTABLE(大表) */

　　2. 将大表放在右侧

2.3 测试

　　测试：customers 和 orders

　　1. 不写任何暗示，观察是 map 端 join 还是 reduce join

SELECT a.no, a.name, b.oname, b.oprice from customers a inner join orders b on a.no=b.uid;

　　2. 写暗示，观察效果

SELECT /*+ MAPJOIN(a) */ a.no, a.name, b.oname, b.oprice from customers a inner join orders b on a.no=b.uid;

　　3. 将自动转换 map join 设置成 false

SET hive.auto.convert.join=false;

　　4. 写 reduce 端 join 的暗示，观察结果

SELECT /*+ STREAMTABLE(a) */ a.no, a.name, b.oname, b.oprice from customers a inner join orders b on a.no=b.uid;