HIVE JOIN

概述

Hive join的实现包含了：

Common (Reduce-side) Join
Broadcast (Map-side) Join
Bucket Map Join
Sort Merge Bucket Join
Skew Join

这里记录下前两种.

第一种是common join，就像字面意思那样，它是一种最常见的join实现方式，但是不够灵活，并且性能也不够好。

一个common join包含了一个map阶段和一个shuffle阶段，以及一个reduce阶段。Map阶段会生成根据join的条件生成所需要的join key

和join value，并将这些信息保存在中间文件中。 Shuffle阶段会对这些文件按照join key进行排序，并且将key相同的数据合并到一个文件

中。Ruduce会进行最终的合并，并产生结果数据。

第二种是broadcast join，这种方式是取消shuffle和reduce阶段，将join动作在map 阶段完成，它会将join中的小表加载到内存中，所有

mapper都可以直接使用内存中的表数据进行join。所有的join 动作都可以在map阶段完成。

如何将小表加载到内存中也是挺讲究的，先要讲小表加载到内存中，然后将其序列化到一个hashtable file。当map阶段开始的时候，将这个

hashtable file 加载到distributed cache中，并将其分发到每个mapper所在的硬盘里，然后这些mapper将hashtable file加载到内存中，并进行join运算。通过优化，这些小表只需要读一次就OK，如果很多个mappper在同一台机器上，那么就只需要一个份hashtable file。

通过EXPLAIN查看

准备了两张表，分别是test_a和test_city。

test_a的数据如下：

test_a.id	test_a.uid	test_a.city_id
1	1	1
2	2	2
3	3	3

test_city的数据如下：

test_city.id	test_city.name
1	beijing
2	shanghai
3	hangzhou

LEFT JOIN

具体的SQL如下：

explain

select a.id, a.uid, b.name

from

    temp.test_a as a

left join

    temp.test_city as b

on a.city_id = b.id;

因为表很小，所以就使用了 map side join,具体过程如下：

STAGE DEPENDENCIES:

2	  Stage-4 is a root stage

3	  Stage-3 depends on stages: Stage-4

4	  Stage-0 depends on stages: Stage-3

5

6	STAGE PLANS:

7	  Stage: Stage-4

8	    Map Reduce Local Work

9	      Alias -> Map Local Tables://从文件中读取数据

10	        $hdt$_1:b

11	          Fetch Operator

12	            limit: -1

13	      Alias -> Map Local Operator Tree:

14	        $hdt$_1:b

15	          TableScan //扫描表 test_city，一行一行读取数据

16	            alias: b

17	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

18	            Select Operator //选取数据

19	              expressions: id (type: bigint), name (type: string)

20	              outputColumnNames: _col0, _col1

21	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

22	              HashTable Sink Operator //我理解这里应该在将数据放到distribute cache中所用到的key，但是不是很确定。

23	                keys:

24	                  0 _col2 (type: bigint)

25	                  1 _col0 (type: bigint)

26

27	  Stage: Stage-3

28	    Map Reduce

29	      Map Operator Tree:

30	          TableScan

31	            alias: a

32	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

33	            Select Operator

34	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

35	              outputColumnNames: _col0, _col1, _col2

36	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

37	              Map Join Operator //注意这里用到了map side join

38	                condition map:

39	                     Left Outer Join0 to 1

40	                keys:

41	                  0 _col2 (type: bigint)

42	                  1 _col0 (type: bigint)

43	                outputColumnNames: _col0, _col1, _col4

44	                Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

45	                Select Operator

46	                  expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

47	                  outputColumnNames: _col0, _col1, _col2

48	                  Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

49	                  File Output Operator

50	                    compressed: false

51	                    Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

52	                    table:

53	                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat

54	                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

55	                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

56	      Local Work:

57	        Map Reduce Local Work

58

59	  Stage: Stage-0

60	    Fetch Operator

61	      limit: -1

62	      Processor Tree:

63	        ListSink

如果设置了

set hive.auto.convert.join=false;

就会变为 Reduce-side join. 这是最普遍用到的join实现。整个过程包含了两部分：

STAGE DEPENDENCIES:

2	  Stage-1 is a root stage

3	  Stage-0 depends on stages: Stage-1

4

5	STAGE PLANS:

6	  Stage: Stage-1

7	    Map Reduce

8	      Map Operator Tree: //map过程

9	          TableScan

10	            alias: a

11	            Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

12	            Select Operator

13	              expressions: id (type: bigint), uid (type: bigint), city_id (type: bigint)

14	              outputColumnNames: _col0, _col1, _col2

15	              Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

16	              Reduce Output Operator //map端的Reduce，然后输出到reduce整体的Reduce阶段

17	                key expressions: _col2 (type: bigint)

18	                sort order: +

19	                Map-reduce partition columns: _col2 (type: bigint)

20	                Statistics: Num rows: 3 Data size: 15 Basic stats: COMPLETE Column stats: NONE

21	                value expressions: _col0 (type: bigint), _col1 (type: bigint)

22	          TableScan

23	            alias: b

24	            Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

25	            Select Operator

26	              expressions: id (type: bigint), name (type: string)

27	              outputColumnNames: _col0, _col1

28	              Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

29	              Reduce Output Operator

30	                key expressions: _col0 (type: bigint)

31	                sort order: +

32	                Map-reduce partition columns: _col0 (type: bigint)

33	                Statistics: Num rows: 3 Data size: 29 Basic stats: COMPLETE Column stats: NONE

34	                value expressions: _col1 (type: string)

35	      Reduce Operator Tree:

36	        Join Operator

37	          condition map:

38	               Left Outer Join0 to 1

39	          keys:

40	            0 _col2 (type: bigint)

41	            1 _col0 (type: bigint)

42	          outputColumnNames: _col0, _col1, _col4

43	          Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

44	          Select Operator

45	            expressions: _col0 (type: bigint), _col1 (type: bigint), _col4 (type: string)

46	            outputColumnNames: _col0, _col1, _col2

47	            Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

48	            File Output Operator

49	              compressed: false

50	              Statistics: Num rows: 3 Data size: 16 Basic stats: COMPLETE Column stats: NONE

51	              table:

52	                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat

53	                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

54	                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

55

56	  Stage: Stage-0

57	    Fetch Operator

58	      limit: -1

59	      Processor Tree:

60	        ListSink

参考：

HIVE JOIN_1的更多相关文章

初识Hadoop、Hive
2016.10.13 20:28 很久没有写随笔了,自打小宝出生后就没有写过新的文章.数次来到博客园,想开始新的学习历程,总是被各种琐事中断.一方面确实是最近的项目工作比较忙,各个集群频繁地上线加多版 ...
Hive安装配置指北（含Hive Metastore详解）
个人主页: http://www.linbingdong.com 本文介绍Hive安装配置的整个过程,包括MySQL.Hive及Metastore的安装配置,并分析了Metastore三种配置方式的区 ...
Hive on Spark安装配置详解（都是坑啊）
个人主页:http://www.linbingdong.com 简书地址:http://www.jianshu.com/p/a7f75b868568 简介本文主要记录如何安装配置Hive on Sp ...
HIVE教程
完整PDF下载:<HIVE简明教程> 前言 Hive是对于数据仓库进行管理和分析的工具.但是不要被“数据仓库”这个词所吓倒,数据仓库是很复杂的东西,但是如果你会SQL,就会发现Hive是那 ...
基于Ubuntu Hadoop的群集搭建Hive
Hive是Hadoop生态中的一个重要组成部分,主要用于数据仓库.前面的文章中我们已经搭建好了Hadoop的群集,下面我们在这个群集上再搭建Hive的群集. 1.安装MySQL 1.1安装MySQL ...
hive
Hive Documentation https://cwiki.apache.org/confluence/display/Hive/Home 2016-12-22 14:52:41 ANTLR ...
深入浅出数据仓库中SQL性能优化之Hive篇
转自:http://www.csdn.net/article/2015-01-13/2823530 一个Hive查询生成多个Map Reduce Job,一个Map Reduce Job又有Map,R ...
Hive读取外表数据时跳过文件行首和行尾
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处有时候用hive读取外表数据时,比如csv这种类型的,需要跳过行首或者行尾一些和数据无关的或者自 ...
Hive索引功能测试
作者:Syn良子出处:http://www.cnblogs.com/cssdongl 转载请注明出处从Hive的官方wiki来看,Hive0.7以后增加了一个对表建立index的功能,想试下性能是 ...

随机推荐

c traps and pitfalls reading note(1)
1. 一直知道char *p = 'a';这样写是错误的,但是为什么是错的,没想过,今天看书解惑. p指向一个字符,但是在c中,''引起来的一个字符代表一个整数,这样指针能不报错.o(^▽^)o 2. ...
redhat 7 cenos 7 网络配置文件
Cenos 7 TYPE=Ethernet PROXY_METHOD=none BROWSER_ONLY=no DEFROUTE=yes IPV4_FAILURE_FATAL=no NAME=eth0 ...
[vue插件]基于vue2.x的电商图片放大镜插件
最近在撸一个电商网站,有一个需求是要像淘宝商品详情页那样,鼠标放在主图上,显示图片放大镜效果,找了一下貌似没有什么合适的vue插件,于是自己撸了一个,分享一下.小白第一次分享,各位大神莫见笑. vue ...
【codeforces 816B】Karen and Coffee
[题目链接]:http://codeforces.com/contest/816/problem/B [题意] 给你很多个区间[l,r]; 1<=l<=r<=2e5 一个数字如果被k ...
[Angular] Configure an Angular App at Compile Time with the Angular CLI
Compile time configuration options allow you to provide different kind of settings based on the envi ...
浅谈 trie树及事实上现
定义:又称字典树,单词查找树或者前缀树,是一种用于高速检索的多叉树结构. 如英文字母的字典树是一个26叉树,数字的字典树是一个10叉树. 核心思想:是空间换时间.利用字符串的公共前缀来减少查询时间的开 ...
vijos - P1543极值问题(斐波那契数列 + 公式推导 + python)
P1543极值问题 Accepted 标签:[显示标签] 背景小铭的数学之旅2. 描写叙述已知m.n为整数,且满足下列两个条件: ① m.n∈1,2.-,K ② (n^ 2-mn-m^2)^2＝1 ...
What's the difference between returning void and returning a Task?
http://stackoverflow.com/questions/8043296/whats-the-difference-between-returning-void-and-returning ...
浅谈微信smali注入
作者:郭少雷搞android搞了几年也没搞出个啥牛逼app出来,眼看时下最火的app微信如此火热,实在想搞搞它,索性就想着给它加点东西进去. 以下内容纯属本人个人爱好,仅限个人学习android用途 ...
SVN在vs2013中使用
http://download.csdn.net/download/show_594/9112963 内包含VisualSVN 5.0.1的官方原版安装包及破解文件VisualSVN.Core.L.d ...

HIVE JOIN_1

HIVE JOIN

概述

通过EXPLAIN查看

LEFT JOIN

HIVE JOIN_1的更多相关文章

随机推荐

热门专题