Hive Joins

Join Syntax

Hive supports the following syntax for joining tables:

join_table:
    table_reference JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
 
table_reference:
    table_factor
  | join_table
 
table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )
 
join_condition:
    ON equality_expression ( AND equality_expression )*
 
equality_expression:
    expression = expression

Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more than two tables can be joined in Hive.

Hive 暂时只支持等值连接。

See Select Syntax for the context of this join syntax.

Version 0.13.0+: Implicit join notation

Icon

Implicit join notation is supported starting with Hive 0.13.0 (see HIVE-5558). This allows the FROM clause to join a comma-separated list of tables, omitting the JOIN keyword. For example:

SELECT * 
FROM table1 t1, table2 t2, table3 t3 
WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';

Version 0.13.0+: Unqualified column references

Icon

Unqualified column references are supported in join conditions, starting with Hive 0.13.0 (see HIVE-6393). Hive attempts to resolve these against the inputs to a Join. If an unqualified column reference resolves to more than one table, Hive will flag it as an ambiguous reference.

For example:

CREATE TABLE a (k1 string, v1 string);
CREATE TABLE b (k2 string, v2 string);

SELECT k1, v1, k2, v2
FROM a JOIN b ON k1 = k2; 

Examples

Some salient points to consider when writing join queries are as follows:

  • Only equality joins are allowed e.g.

    SELECT a.* FROM a JOIN b ON (a.id = b.id)
    SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department)

    are both valid joins, however

    SELECT a.* FROM a JOIN b ON (a.id <> b.id)

    is NOT allowed.

  • More than 2 tables can be joined in the same query e.g.

    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

    is a valid join.

  • Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g.

    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

    is converted into a single map/reduce job as only key1 column for b is involved in the join. On the other hand

    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

    is converted into two map/reduce jobs because key1 column from b is used in the first join condition and key2 column from b is used in the second one. The first map/reduce job joins a with b and the results are then joined with c in the second map/reduce job.

  • In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in

    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

    all the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join is computed with the buffered rows. Similarly for

    SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

    there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers the results of the first join while streaming the values of c through the reducers.

  • In every map/reduce stage of the join, the table to be streamed can be specified via a hint. e.g. in

    SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

    all the three tables are joined in a single map/reduce job and the values for a particular value of the key for tables b and c are buffered in the memory in the reducers. Then for each row retrieved from a, the join is computed with the buffered rows. If the STREAMTABLE hint is omitted, Hive streams the rightmost table in the join.

  • LEFT, RIGHT, and FULL OUTER joins exist in order to provide more control over ON clauses for which there is no match. For example, this query:

    SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)

    will return a row for every row in a. This output row will be a.val,b.val when there is a b.key that equals a.key, and the output row will be a.val,NULL when there is no corresponding b.key. Rows from b which have no corresponding a.key will be dropped. The syntax "FROM a LEFT OUTER JOIN b" must be written on one line in order to understand how it works--a is to the LEFT of b in this query, and so all rows from a are kept; a RIGHT OUTER JOIN will keep all rows from b, and a FULL OUTER JOIN will keep all rows from a and all rows from b. OUTER JOIN semantics should conform to standard SQL specs.

  • Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT of a join, a requirement should be in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables:

    SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
    WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'

    will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other columns of a and b that are in the output of the join, and then filter them out. However, whenever a row from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL,including the ds column. This is to say, you will filter out all rows of join output for which there was no valid b.key, and thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing, use this syntax:

    SELECT a.val, b.val FROM a LEFT OUTER JOIN b
    ON (a.key=b.key AND b.ds='2009-07-07' AND a.ds='2009-07-07')

    ..the result is that the output of the join is pre-filtered, and you won't get post-filtering trouble for rows that have a valid a.key but no matching b.key. The same logic applies to RIGHT and FULL joins.

  • Joins are NOT commutative! Joins are left-associative regardless of whether they are LEFT or RIGHT joins.

    SELECT a.val1, a.val2, b.val, c.val
    FROM a
    JOIN b ON (a.key = b.key)
    LEFT OUTER JOIN c ON (a.key = c.key)

    ...first joins a on b, throwing away everything in a or b that does not have a corresponding key in the other table. The reduced table is then joined on c. This provides unintuitive results if there is a key that exists in both a and c but not b: The whole row (including a.val1, a.val2, and a.key) is dropped in the "a JOIN b" step because it is not in b. The result does not have a.key in it, so when it is LEFT OUTER JOINed with c, c.val does not make it in because there is no c.key that matches an a.key (because that row from a was removed). Similarly, if this were a RIGHT OUTER JOIN (instead of LEFT), we would end up with an even weirder effect: NULL, NULL, NULL, c.val, because even though we specified a.key=c.key as the join key, we dropped all rows of a that did not match the first JOIN.
    To achieve the more intuitive effect, we should instead do FROM c LEFT OUTER JOIN a ON (c.key = a.key) LEFT OUTER JOIN b ON (c.key = b.key).

  • LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way.(Hive0.13开始支持In,Not In /Exists, Not Exists等操作) As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. (所有这些子查询再也不必手动操作了)。The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc. (SEMI JOIN的限制是,右边的表必须ON子句中而不是Where条件子句中)

    SELECT a.key, a.value
    FROM a
    WHERE a.key in
     (SELECT b.key
      FROM B);

    can be rewritten to:

    SELECT a.key, a.val
    FROM a LEFT SEMI JOIN b ON (a.key = b.key)
  • If all but one of the tables being joined are small, the join can be performed as a map only job. (如果所有的表都是小表,只有一个大表的时候,可以使用MapJoin)The query

    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM a JOIN b ON a.key = b.key

    does not need a reducer. For every mapper of A, B is read completely. The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.

  • If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join

    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM a JOIN b ON a.key = b.key

    can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter

    set hive.optimize.bucketmapjoin = true
  • If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,

    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM A a JOIN B b ON a.key = b.key

    can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:

    set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
    set hive.optimize.bucketmapjoin = true;
    set hive.optimize.bucketmapjoin.sortedmerge = true;

MapJoin Restrictions ??

  • If all but one of the tables being joined are small, the join can be performed as a map only job. The query

    SELECT /*+ MAPJOIN(b) */ a.key, a.value
    FROM a JOIN b ON a.key = b.key

    does not need a reducer. For every mapper of A, B is read completely.

  • The following is not supported.

    • Union Followed by a MapJoin
    • Lateral View Followed by a MapJoin
    • Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin
    • MapJoin Followed by Union
    • MapJoin Followed by Join
    • MapJoin Followed by MapJoin
  • The configuration variable hive.auto.convert.join (if set to true) automatically converts the joins to mapjoins at runtime if possible, and it should be used instead of the mapjoin hint. The mapjoin hint should only be used for the following query.

    • If all the inputs are bucketed or sorted, and the join should be converted to a bucketized map-side join or bucketized sort-merge join.
  • Consider the possibility of multiple mapjoins on different keys:

    select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
      ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
        bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)                                                  
      ) firstjoin                                                            
      JOIN                                                                 
      smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)                      

    The above query is not supported. Without the mapjoin hint, the above query would be executed as 2 map-only jobs. If the user knows in advance that the inputs are small enough to fit in memory, the following configurable parameters can be used to make sure that the query executes in a single map-reduce job.

    • hive.auto.convert.join.noconditionaltask - Whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this paramater is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
    • hive.auto.convert.join.noconditionaltask.size - If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.

Join Optimization 连接优化

Predicate Pushdown in Outer Joins

See Hive Outer Join Behavior for information about predicate pushdown in outer joins.

Enhancements in Hive Version 0.11

See Join Optimization for information about enhancements to join optimization introduced in Hive version 0.11.0. The use of hints is de-emphasized in the enhanced optimizations (HIVE-3784 and related JIRAs).

 

[HIve - LanguageManual] Joins的更多相关文章

  1. [Hive - LanguageManual ] Windowing and Analytics Functions (待)

    LanguageManual WindowingAndAnalytics     Skip to end of metadata   Added by Lefty Leverenz, last edi ...

  2. [HIve - LanguageManual] LateralView

    Lateral View Syntax Description Example Multiple Lateral Views Outer Lateral Views Lateral View Synt ...

  3. [HIve - LanguageManual] Union

    Union Syntax select_statement UNION ALL select_statement UNION ALL select_statement ... UNION is use ...

  4. [HIve - LanguageManual] Join Optimization (不懂)

    Join Optimization Join Optimization Improvements to the Hive Optimizer Star Join Optimization Star S ...

  5. [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)

    Hive Operators and User-Defined Functions (UDFs) Hive Operators and User-Defined Functions (UDFs) Bu ...

  6. [Hive - LanguageManual] Import/Export

    LanguageManual ImportExport     Skip to end of metadata   Added by Carl Steinbach, last edited by Le ...

  7. [Hive - LanguageManual] DML: Load, Insert, Update, Delete

    LanguageManual DML Hive Data Manipulation Language Hive Data Manipulation Language Loading files int ...

  8. [Hive - LanguageManual] Alter Table/Partition/Column

    Alter Table/Partition/Column Alter Table Rename Table Alter Table Properties Alter Table Comment Add ...

  9. Hive LanguageManual DDL

    hive语法规则LanguageManual DDL SQL DML 和 DDL 数据操作语言 (DML) 和 数据定义语言 (DDL) 一.数据库 增删改都在文档里说得也很明白,不重复造车轮 二.表 ...

随机推荐

  1. C++:运算符重载函数之成员运算符重载函数

    5.2.3 成员运算符重载函数 在C++中可以把运算符重载函数定义为某个类的成员函数,称之为成员运算符重载函数. 1. 定义成员运算符重载函数的语法形式 (1)在类的内部,定义成员运算符重载函数的格式 ...

  2. Android title和actionbar的区别

    我想在一个页面的顶端放入两个按钮,应该用title还是actionbar.他们两个什么区别?分别该什么时候用? 答: android title 是UI上的一小部分,它支持Text和Color,你可以 ...

  3. Java API —— HashMap类 & LinkedHashMap类

    1.HashMap类 1)HashMap类概述         键是哈希表结构,可以保证键的唯一性 2)HashMap案例         HashMap<String,String>   ...

  4. 在CentOS 6.X 上面安装 Python 2.7.X

    在CentOS 6.X 上面安装 Python 2.7.X CentOS 6.X 自带的python版本是 2.6 , 由于工作需要,很多时候需要2.7版本.所以需要进行版本升级.由于一些系统工具和服 ...

  5. [转载]深入理解JAVA的接口和抽象类

    深入理解Java的接口和抽象类 对于面向对象编程来说,抽象是它的一大特征之一.在Java中,可以通过两种形式来体现OOP的抽象:接口和抽象类.这两者有太多相似的地方,又有太多不同的地方.很多人在初学的 ...

  6. hive学习笔记——表的基本的操作

    1.hive的数据加载方式 1.1.load data 这中方式一般用于初始化的时候 load data [local] inpath '...' [overwrite] into table t1 ...

  7. BZOJ 1297 迷路(矩阵)

    题目链接:http://61.187.179.132/JudgeOnline/problem.php?id=1297 题意:给出一个带权有向图,权值为1-9,顶点个数最多为10.从1出发恰好在T时刻到 ...

  8. firebug的使用方法和技巧(web开发调试工具)

    Firebug是firefox下的一个插件,能够调试所有网站语言,如Html,Css等,但FireBug最吸引我的就是javascript调试功 能,使用起来非常方便,而且在各种浏览器下都能使用(IE ...

  9. Windows JAVA 环境配置

    Java SE Development Kit Downloads http://www.oracle.com/technetwork/java/javase/overview/index.html ...

  10. yeoman开始项目

    使用 yeoman 构建项目之前,你需要安装这两个环境:node,ruby. 为什么需要使用node?因为我们需要使用grunt自动化工具,而grunt工具则是依赖node. 为什么需要使用ruby? ...