HIVE JOIN

概述

Hive join的实现包含了：

Common (Reduce-side) Join
Broadcast (Map-side) Join
Bucket Map Join
Sort Merge Bucket Join
Skew Join

这里记录下前两种.

第一种是common join，就像字面意思那样，它是一种最常见的join实现方式，但是不够灵活，并且性能也不够好。

一个common join包含了一个map阶段和一个shuffle阶段，以及一个reduce阶段。Map阶段会生成根据join的条件生成所需要的join key

和join value，并将这些信息保存在中间文件中。 Shuffle阶段会对这些文件按照join key进行排序，并且将key相同的数据合并到一个文件

中。Ruduce会进行最终的合并，并产生结果数据。

第二种是broadcast join，这种方式是取消shuffle和reduce阶段，将join动作在map 阶段完成，它会将join中的小表加载到内存中，所有

mapper都可以直接使用内存中的表数据进行join。所有的join 动作都可以在map阶段完成。

如何将小表加载到内存中也是挺讲究的，先要讲小表加载到内存中，然后将其序列化到一个hashtable file。当map阶段开始的时候，将这个

hashtable file 加载到distributed cache中，并将其分发到每个mapper所在的硬盘里，然后这些mapper将hashtable file加载到内存中，并进行join运算。通过优化，这些小表只需要读一次就OK，如果很多个mappper在同一台机器上，那么就只需要一个份hashtable file。

通过EXPLAIN查看

准备了两张表，分别是test_a和test_city。

test_a的数据如下：

test_a.id	test_a.uid	test_a.city_id
1	1	1
2	2	2
3	3	3

test_city的数据如下：

test_city.id	test_city.name
1	beijing
2	shanghai
3	hangzhou

LEFT JOIN

具体的SQL如下：

explain

select a.id, a.uid, b.name

from

    temp.test_a as a

left join

    temp.test_city as b

on a.city_id = b.id;

因为表很小，所以就使用了 map side join,具体过程如下：

STAGE DEPENDENCIES:

2	  Stage-4 is a root stage

3	  Stage-3 depends on stages: Stage-4

4	  Stage-0 depends on stages: Stage-3

5

6	STAGE PLANS:

7	  Stage: Stage-4

8	    Map Reduce Local Work

9	      Alias -> Map Local Tables://从文件中读取数据

10	        $hdt$_1:b

11	          Fetch Operator

12	            limit: -1

13	      Alias -> Map Local Operator Tree:

14	        $hdt$_1:b

15	          TableScan //扫描表 test_city，一行一行读取数据

16	            alias: b

17	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

18	            Select Operator //选取数据

19	              expressions: id (type: bigint), name (type: string)

20	              outputColumnNames: _col0, _col1

21	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

22	              HashTable Sink Operator //我理解这里应该在将数据放到distribute cache中所用到的key，但是不是很确定。

23	                keys:

24	                  0 _col2 (type: bigint)

25	                  1 _col0 (type: bigint)

26

27	  Stage: Stage-3

28	    Map Reduce

29	      Map Operator Tree:

30	          TableScan

31	            alias: a

32	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

33	            Select Operator

34	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

35	              outputColumnNames: _col0, _col1, _col2

36	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

37	              Map Join Operator //注意这里用到了map side join

38	                condition map:

39	                     Left Outer Join0 to 1

40	                keys:

41	                  0 _col2 (type: bigint)

42	                  1 _col0 (type: bigint)

43	                outputColumnNames: _col0, _col1, _col4

44	                Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

45	                Select Operator

46	                  expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

47	                  outputColumnNames: _col0, _col1, _col2

48	                  Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

49	                  File Output Operator

50	                    compressed: false

51	                    Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

52	                    table:

53	                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat

54	                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

55	                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

56	      Local Work:

57	        Map Reduce Local Work

58

59	  Stage: Stage-0

60	    Fetch Operator

61	      limit: -1

62	      Processor Tree:

63	        ListSink

如果设置了

set hive.auto.convert.join=false;

就会变为 Reduce-side join. 这是最普遍用到的join实现。整个过程包含了两部分：

STAGE DEPENDENCIES:

2	  Stage-1 is a root stage

3	  Stage-0 depends on stages: Stage-1

4

5	STAGE PLANS:

6	  Stage: Stage-1

7	    Map Reduce

8	      Map Operator Tree: //map过程

9	          TableScan

10	            alias: a

11	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

12	            Select Operator

13	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

14	              outputColumnNames: _col0, _col1, _col2

15	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

16	              Reduce Output Operator //map端的Reduce，然后输出到reduce整体的Reduce阶段

17	                key expressions: _col2 (type: bigint)

18	                sort order: +

19	                Map-reduce partition columns: _col2 (type: bigint)

20	                Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

21	                value expressions: _col0 (type: bigint), _col1 (type: bigint)

22	          TableScan

23	            alias: b

24	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

25	            Select Operator

26	              expressions: id (type: bigint), name (type: string)

27	              outputColumnNames: _col0, _col1

28	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

29	              Reduce Output Operator

30	                key expressions: _col0 (type: bigint)

31	                sort order: +

32	                Map-reduce partition columns: _col0 (type: bigint)

33	                Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

34	                value expressions: _col1 (type: string)

35	      Reduce Operator Tree:

36	        Join Operator

37	          condition map:

38	               Left Outer Join0 to 1

39	          keys:

40	            0 _col2 (type: bigint)

41	            1 _col0 (type: bigint)

42	          outputColumnNames: _col0, _col1, _col4

43	          Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

44	          Select Operator

45	            expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

46	            outputColumnNames: _col0, _col1, _col2

47	            Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

48	            File Output Operator

49	              compressed: false

50	              Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

51	              table:

52	                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat

53	                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

54	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

55

56	  Stage: Stage-0

57	    Fetch Operator

58	      limit: -1

59	      Processor Tree:

60	        ListSink

参考：

HIVE JOIN_1的更多相关文章

初识Hadoop、Hive
2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...
Hive安装配置指北（含Hive Metastore详解）
个人主页: http://www.linbingdong.com 本文介绍Hive安装配置的整个过程,包括MySQL.Hive及Metastore的安装配置,并分析了Metastore三种配置方式的区 ...
Hive on Spark安装配置详解（都是坑啊）
个人主页:http://www.linbingdong.com 简书地址:http://www.jianshu.com/p/a7f75b868568 简介本文主要记录如何安装配置Hive on Sp ...
HIVE教程
完整PDF下载:<HIVE简明教程> 前言 Hive是对于数据仓库进行管理和分析的工具.但是不要被“数据仓库”这个词所吓倒,数据仓库是很复杂的东西,但是如果你会SQL,就会发现Hive是那 ...
基于Ubuntu Hadoop的群集搭建Hive
Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...
hive
Hive Documentation https://cwiki.apache.org/confluence/display/Hive/Home 2016-12-22 14:52:41 ANTLR ...
深入浅出数据仓库中SQL性能优化之Hive篇
转自:http://www.csdn.net/article/2015-01-13/2823530 一个Hive查询生成多个Map Reduce Job,一个Map Reduce Job又有Map,R ...
Hive读取外表数据时跳过文件行首和行尾
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处有时候用hive读取外表数据时,比如csv这种类型的,需要跳过行首或者行尾一些和数据无关的或者自 ...
Hive索引功能测试
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处从Hive的官方wiki来看,Hive0.7以后增加了一个对表建立index的功能,想试下性能是 ...

随机推荐

紫书例题11-9 UVa 1658 （拆点+最小费用流）
这道题要求每个节点只能经过一次,也就是结点容量为1, 要拆点, 拆成两个点, 中间连一条弧容量为1, 费用为0. 因为拆成两个点, 所以要经过原图中的这个节点就要经过拆成的这两个点, 又因为这两个点的 ...
紫书习题8-11 UVa 1615 （区间选点问题）
这个点就是贪心策略中的区间选点问题. 把右端点从大到小排序, 左端点从小到大排序. 每次取区间右端点就可以了, 到不能覆盖的时候就ans++, 重新取点 ps:这道题不考虑精度也可以过要着重复习一下 ...
洛谷P4994 终于结束的起点
希望是这道题的第一篇题解,并且真的做到了! upd 2018/11/4:规律补锅,让代码更加易懂本来月赛时想打个表,打到一半,发现\(n\)稳定在\(m\)附近? 题目的意思是\(n < m ...
Linux进程管理之状态（二）
二.进程的生命周期进程是一个动态的实体,所以他是有生命的.从创建到消亡,是一个进程的整个生命周期.在这个周期中,进程可能会经历各种不同的状态.一般来说,所有进程都要经历以下的3个状态: 就绪态.指进 ...
【codeforces 411B】Multi-core Processor
[题目链接]:http://codeforces.com/problemset/problem/411/B [题意] 处理器有n个核;然后有k个存储单元; 有m轮工作;每轮工作都会给每个核确定一个数字 ...
[terry笔记]data guard基础知识
如下介绍了data guard的基础知识,整理自网络: Data Gurad 通过冗余数据来提供数据保护,Data Gurad 通过日志同步机制保证冗余数据和主数据之前的同步,这种同步可以是实时,延时 ...
洛谷 P2970 [USACO09DEC]自私的放牧Selfish Grazing
P2970 [USACO09DEC]自私的放牧Selfish Grazing 题目描述 Each of Farmer John's N (1 <= N <= 50,000) cows li ...
HDU 4756 Install Air Conditioning（次小生成树）
题目大意:给你n个点然后让你求出去掉一条边之后所形成的最小生成树. 比較基础的次小生成树吧. ..先prime一遍求出最小生成树.在dfs求出次小生成树. Install Air Conditioni ...
cocos2d_android 第一个游戏
依据上一篇文章.创建好cocos2d--android的开发环境先上效果图实现该效果的代码: package com.cn.firstgame; import org.cocos2d.layers ...
Maven配置Spring+SpringMVC+MyBatis(3.2.2)Pom 以及 IntelliJ IDEA 怎样打开依赖视图
Maven配置Spring+SpringMVC+MyBatis(3.2.2)Pom 配置原则: 利用依赖,将所需的jar包加载到project中. 先依赖主要jar包 Spring + Spring ...

HIVE JOIN_1

HIVE JOIN

概述

通过EXPLAIN查看

LEFT JOIN

HIVE JOIN_1的更多相关文章

随机推荐

热门专题