SQL里面通常都会用Join来连接两个表,做复杂的关联查询。比如用户表和订单表,能通过join得到某个用户购买的产品;或者某个产品被购买的人群....

Hive也支持这样的操作,而且由于Hive底层运行在hadoop上,因此有很多地方可以进行优化。比如小表到大表的连接操作、小表进行缓存、大表进行避免缓存等等...

下面就来看看hive里面的连接操作吧!其实跟SQL还是差不多的...

数据准备:创建数据-->创建表-->导入数据

首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果:

[xingoo@localhost tmp]$ cat aa.txt
1 a 3
2 b 4
3 c 1
[xingoo@localhost tmp]$ cat bb.txt
1 xxx 2
2 yyy 3
3 zzz 5

接下来创建两个表,需要注意的是表的字段分隔符为空格,另一个表可以直接基于当前的表创建。

hive> create table aa
> (a string,b string,c string)
> row format delimited
> fields terminated by ' ';
OK
Time taken: 0.19 seconds
hive> create table bb like aa;
OK
Time taken: 0.188 seconds

查看两个表的结构:

hive> describe aa;
OK
a string
b string
c string
Time taken: 0.068 seconds, Fetched: 3 row(s)
hive> describe bb;
OK
a string
b string
c string
Time taken: 0.045 seconds, Fetched: 3 row(s)

下面可以基于本地的文件,导入数据

hive> load data local inpath '/usr/tmp/aa.txt' overwrite into table aa;
Loading data to table test.aa
OK
Time taken: 0.519 seconds
hive> load data local inpath '/usr/tmp/bb.txt' overwrite into table bb;
Loading data to table test.bb
OK
Time taken: 0.321 seconds

内连接

内连接即基于on语句,仅列出表1和表2符合连接条件的数据。

hive> select * from aa a join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824161233_f9ecefa2-e5d7-416d-8d90-e191937e7313
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:12:44 Starting to launch local task to process map join; maximum memory = 518979584
2016-08-24 16:12:47 Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2016-08-24 16:12:47 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (332 bytes)
2016-08-24 16:12:47 End of local task; Time Taken: 3.425 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:12:50,222 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local944389202_0007
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1264 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3 c 1 1 xxx 2
1 a 3 3 zzz 5
Time taken: 17.083 seconds, Fetched: 2 row(s)

左连接

左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示null

ive> select * from aa a left outer join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824161637_6d540592-13fd-4f59-a2cf-0a91c0fc9533
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:16:48 Starting to launch local task to process map join; maximum memory = 518979584
2016-08-24 16:16:51 Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2016-08-24 16:16:51 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (338 bytes)
2016-08-24 16:16:51 End of local task; Time Taken: 2.634 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:16:53,843 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local1670258961_0008
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1282 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 a 3 3 zzz 5
2 b 4 NULL NULL NULL
3 c 1 1 xxx 2
Time taken: 16.048 seconds, Fetched: 3 row(s)

右连接

类似左连接,同理。

hive> select * from aa a right outer join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162227_5d0f0090-1a9b-4a3f-9e82-e93c4d180f4b
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:22:37 Starting to launch local task to process map join; maximum memory = 518979584
2016-08-24 16:22:40 Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2016-08-24 16:22:40 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (332 bytes)
2016-08-24 16:22:40 End of local task; Time Taken: 2.368 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:22:43,060 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local2001415675_0009
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1306 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3 c 1 1 xxx 2
NULL NULL NULL 2 yyy 3
1 a 3 3 zzz 5
Time taken: 15.483 seconds, Fetched: 3 row(s)

全连接

相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.

hive> select * from aa a full outer join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162252_c71b2fae-9768-4b9a-b5ad-c06d7cdb60fb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-08-24 16:22:54,111 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local1766586034_0010
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 4026 HDFS Write: 270 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3 c 1 1 xxx 2
NULL NULL NULL 2 yyy 3
1 a 3 3 zzz 5
2 b 4 NULL NULL NULL
Time taken: 1.689 seconds, Fetched: 4 row(s)

左半开连接

这个比较特殊,SEMI-JOIN仅仅会显示表1的数据,即左边表的数据。但是效率会比左连接快,因为他会先拿到表1的数据,然后在表2中查找,只要查找到结果立马就返回数据。

hive> select * from aa a left semi join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162327_e7fc72a7-ef91-4d39-83bc-ff8159ea8816
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:23:37 Starting to launch local task to process map join; maximum memory = 518979584
2016-08-24 16:23:41 Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2016-08-24 16:23:41 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (317 bytes)
2016-08-24 16:23:41 End of local task; Time Taken: 3.586 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:23:43,798 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local521961878_0011
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1366 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 a 3
3 c 1
Time taken: 16.811 seconds, Fetched: 2 row(s)

笛卡尔积

笛卡尔积会针对表1和表2的每条数据做连接...

hive> select * from aa join bb;
Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162449_20e4b5ec-768f-48cf-a840-7d9ff360975f
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:25:00 Starting to launch local task to process map join; maximum memory = 518979584
2016-08-24 16:25:02 Dump the side-table for tag: 0 with group count: 1 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
2016-08-24 16:25:02 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable (305 bytes)
2016-08-24 16:25:02 End of local task; Time Taken: 2.892 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:25:05,677 Stage-3 map = 100%, reduce = 0%
Ended Job = job_local2068422373_0012
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1390 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 a 3 1 xxx 2
2 b 4 1 xxx 2
3 c 1 1 xxx 2
1 a 3 2 yyy 3
2 b 4 2 yyy 3
3 c 1 2 yyy 3
1 a 3 3 zzz 5
2 b 4 3 zzz 5
3 c 1 3 zzz 5

上面就是hive中的连接查询,其实与SQL一样的。

[Hadoop大数据]——Hive连接JOIN用例详解的更多相关文章

  1. Hive学习:Hive连接JOIN用例详解

    1 准备数据: 1.1 t_1 01 张三 02 李四 03 王五 04 马六 05 小七 06 二狗 1.2 t_2 01 11 03 33 04 44 06 66 07 77 08 88 1.3 ...

  2. [Hadoop大数据]——Hive初识

    Hive出现的背景 Hadoop提供了大数据的通用解决方案,比如存储提供了Hdfs,计算提供了MapReduce思想.但是想要写出MapReduce算法还是比较繁琐的,对于开发者来说,需要了解底层的h ...

  3. 大数据入门第七天——MapReduce详解(一)入门与简单示例

    一.概述 1.map-reduce是什么 Hadoop MapReduce is a software framework for easily writing applications which ...

  4. [Hadoop大数据]——Hive数据的导入导出

    Hive作为大数据环境下的数据仓库工具,支持基于hadoop以sql的方式执行mapreduce的任务,非常适合对大量的数据进行全量的查询分析. 本文主要讲述下hive载cli中如何导入导出数据: 导 ...

  5. [Hadoop大数据]——Hive部署入门教程

    Hive是为了解决hadoop中mapreduce编写困难,提供给熟悉sql的人使用的.只要你对SQL有一定的了解,就能通过Hive写出mapreduce的程序,而不需要去学习hadoop中的api. ...

  6. 大数据入门第七天——MapReduce详解(二)切片源码浅析与自定义patition

    一.mapTask并行度的决定机制 1.概述 一个job的map阶段并行度由客户端在提交job时决定 而客户端对map阶段并行度的规划的基本逻辑为: 将待处理数据执行逻辑切片(即按照一个特定切片大小, ...

  7. 单机,伪分布式,完全分布式-----搭建Hadoop大数据平台

    Hadoop大数据——随着计算机技术的发展,互联网的普及,信息的积累已经到了一个非常庞大的地步,信息的增长也在不断的加快.信息更是爆炸性增长,收集,检索,统计这些信息越发困难,必须使用新的技术来解决这 ...

  8. hadoop大数据平台安全基础知识入门

    概述 以 Hortonworks Data Platform (HDP) 平台为例 ,hadoop大数据平台的安全机制包括以下两个方面: 身份认证 即核实一个使用者的真实身份,一个使用者来使用大数据引 ...

  9. 【HADOOP】| 环境搭建:从零开始搭建hadoop大数据平台(单机/伪分布式)-下

    因篇幅过长,故分为两节,上节主要说明hadoop运行环境和必须的基础软件,包括VMware虚拟机软件的说明安装.Xmanager5管理软件以及CentOS操作系统的安装和基本网络配置.具体请参看: [ ...

随机推荐

  1. 代替jquery $.post 跨域提交数据的N种形式

    跨域的N种形式: 1.直接用jquery中$.getJSON进行跨域提交 优点:有返回值,可直接跨域: 缺点:数据量小: 提交方式:仅get (无$.postJSON) $.getJSON(" ...

  2. 《程序员的自我修养》读书笔记 - dllimport

    Math.c  { __declspec (dllexport)  double Add (xx, xx) {...}} MathApp.c  { __declspec(dllimport) doub ...

  3. Leetcode Palindrome Linked List

    Given a singly linked list, determine if it is a palindrome. Follow up:Could you do it in O(n) time ...

  4. jquery 的3D Carousel插件参数说明

    这个插件大家都很熟悉了,但是在网上找了很久找不到相关的资料,只有自己琢磨研究了一下.有些参数一眼都可以看出意思,在此我只说一下每个图片要想带一些扩展信息怎么处理. 1:首先需要创建一个ul对象,然后里 ...

  5. Ubuntu install codeblocks by ppa

    sudo add-apt-repository ppa:damien-moore/codeblocks-stable sudo apt-get update sudo apt-get install ...

  6. 应用TortoiseGit为github账号添加SSH keys

    每次同步或者上传代码到githun上的代码库时,需要每次都输入用户名和密码,这时我们设置一下SSH key就可以省去这些麻烦了.若果使用TortoiseGit作为github本地管理工具,Tortoi ...

  7. 关于html5新增的功能(百度)

    HTML5包含以下新增和更新功能: 1. 新增了一种HTML文档类型:<DOCTYPE html>   2. 新增了一些结构化标记的元素(<header>,<nav> ...

  8. oracle遍历表更新另一个表(一对多)

    declare cursor cur_test is select t.txt_desig, m.segment_id, s.code_type_direct, case when s.uom_dis ...

  9. java的值传递笔记

    1. 背景:开发小伙伴突然问我java是值传递还是引用传递,我说当然是值传递,只不过有时候传递一个对象时实际传递的是对象的地址值,所以让人容易产生一种引用传递的假象,貌似在李刚的疯狂java讲义有提到 ...

  10. @autowired和@resource的区别

    @Resource的作用相当于@Autowired,只不过@Autowired按byType自动注入,而@Resource默认按 byName自 动注入罢了.@Resource有两个属性是比较重要的, ...