Hive Essential (4):DML-project,filter,join,union
1. Project data with SELECT
The most common use case for Hive is to query data in Hadoop. To achieve this, we need to write and execute a SELECT statement. The typical work done by the SELECT statement is to project the whole row (with SELECT * ) or specified columns (with SELECT column1, column2, ... ) from a table, with or without conditions.Most simple SELECT statements will not trigger a Yarn job. Instead, a dump task is created just for dumping the data, such as the hdfs dfs -cat command. The SELECT statement is quite often used with the FROM and DISTINCT keywords. A FROM keyword followed by a table is where SELECT projects data. The DISTINCT keyword used after SELECT ensures only unique rows or combination of columns are returned from the table. In addition, SELECT also supports columns combined with user-defined functions, IF() , or a CASE WHEN THEN ELSE END statement, and regular expressions. The following are examples of projecting data with a SELECT statement:
SELECT * FROM employee; -- Project the whole row
SELECT name FROM employee; -- Project specified columns --List all columns match java regular expression
SET hive.support.quoted.identifiers = none; -- Enable this
SELECT `^work.*` FROM employee; -- All columns start with work SELECT DISTINCT name, work_place FROM employee; SELECT
CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
ELSE 'Mr.'
END as title,
name,
IF(array_contains(work_place, 'New York'), 'US', 'CA') as country
FROM employee;
Multiple SELECT statements can work together to build a complex query using nested queries or CTE. A nested query, which is also called a subquery, is a query projecting data from the result of another query. Nested queries can be rewritten using CTE with the WITH and AS keywords. When using nested queries, an alias should be given for the inner query (see t1 in the following example), or else Hive will report exceptions. The following are a few examples of using nested queries in HQL:
--1. A nested query example with the mandatory alias:
SELECT
name, gender_age.gender as gender
FROM (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
) t1; -- t1 here is mandatory --2. A nested query can be rewritten with CTE as follows.
--This is the recommended way of writing a complex single HQL query
WITH t1 as (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
)
SELECT name, gender_age.gender as gender
FROM t1;
In addition, a special SELECT followed by a constant expression can work without the FROM table clause. It returns the result of the expression. This is equivalent to querying a dummy table with one dummy record.
SELECT concat('','+','','=',cast((1 + 3) as string)) as res;
+-------+
| res |
+-------+
| 1+3=4 |
+-------+
2. Filtering data with conditions
It is quite common to narrow down the result set by using a condition clause, such as LIMIT , WHERE , IN / NOT IN , and EXISTS / NOT EXISTS . The LIMIT keyword limits the specified number of rows returned randomly. Compared with LIMIT , WHERE is a more powerful and generic condition clause to limit the returned result set by expressions, functions, and nested queries as in the following examples:
SELECT name FROM employee LIMIT 2; SELECT name, work_place FROM employee WHERE name = 'Michael'; -- All the conditions can use together and use after WHERE
SELECT name, work_place FROM employee WHERE name = 'Michael' LIMIT 1;
IN / NOT IN is used as an expression to check whether values belong to a set specified by IN or NOT IN . With effect from Hive v2.1.0, IN and NOT IN statements support more than one column.
SELECT name FROM employee WHERE gender_age.age in (27, 30); -- With multiple columns support after v2.1.0
SELECT
name, gender_age
FROM employee
WHERE (gender_age.gender, gender_age.age) IN
(('Female', 27), ('Male', 27 + 3)); -- Also support expression
In addition, filtering data can also use a subquery in the WHERE clause with IN / NOT IN and EXISTS / NOT EXISTS . A subquery that uses EXISTS or NOT EXISTS must refer to both inner and outer expressions:
SELECT
name, gender_age.gender as gender
FROM
employee
WHERE name IN
(
SELECT
name
FROM
employee
WHERE
gender_age.gender = 'Male'
); SELECT
name, gender_age.gender as gender
FROM
employee a
WHERE EXISTS (
SELECT *
FROM
employee b
WHERE
a.gender_age.gender = b.gender_age.gender AND
b.gender_age.gender = 'Male'
); -- This likes join table a and b with column gender
There are additional restrictions for subqueries used in WHERE clauses:
- Subqueries can only appear on the right-hand side of WHERE clauses
- Nested subqueries are not allowed
- IN / NOT IN in subqueries only support the use of a single column, although they support more in regular expressions
3. Linking data with JOIN
JOIN is used to link rows from two or more tables together. Hive supports most SQL JOIN operations, such as INNER JOIN and OUTER JOIN . In addition, HQL supports some special joins, such as MapJoin and Semi-Join too. In its earlier version, Hive only supported equal join. After v2.2.0, unequal join is also supported. However, you should be more careful when using unequal join unless you know what is expected, since unequal join is likely to return many rows by producing a Cartesian product of joined tables. When you want to restrict the output of a join, you should apply a WHERE clause after join as JOIN occurs before the WHERE clause. If possible, push filter conditions on the join conditions rather than where conditions to have data filtered earlier. What's more, all types of left/right joins are not commutative and always\ left/right associative, while INNER and FULL OUTER JOINS are both commutative and associative.
3.1 INNER JOIN
INNER JOIN or JOIN returns rows meeting the join conditions from both sides of joined tables. The JOIN keyword can also be omitted by comma-separated table names; this is called an implicit join . Here are examples of the HQL JOIN operation:
--1. First, prepare a table to join with and load data to it:
CREATE TABLE IF NOT EXISTS employee_hr (
name string,
employee_id int,
sin_number string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'; LOAD DATA INPATH '/tmp/hivedemo/data/employee_hr.txt'
OVERWRITE INTO TABLE employee_hr; --2. Perform an INNER JOIN between two tables with equal and unequal join
--conditions, along with complex expressions as well as a post join WHERE
--condition. Usually, we need to add a table name or table alias before columns in
--the join condition, although Hive always tries to resolve them:
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph
ON emp.name = emph.name; -- Equal Join
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091|
| Will | 527-948-090|
| Lucy | 577-928-094|
+-----------+------------------+ SELECT
emp.name, emph.sin_number
FROM employee emp
-- Unequal join supported since v2.2.0 returns more rows
JOIN employee_hr emph
ON emp.name != emph.name;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 527-948-090|
| Michael | 647-968-598|
| Michael | 577-928-094|
| Will | 547-968-091|
| Will | 647-968-598|
| Will | 577-928-094|
| Shelley | 547-968-091|
| Shelley | 527-948-090|
| Shelley | 647-968-598|
| Shelley | 577-928-094|
| Lucy | 547-968-091|
| Lucy | 527-948-090|
| Lucy | 647-968-598|
+----------+-----------------+ -- Join with complex expression in join condition
-- This is also the way to implement conditional join
-- Below, conditional ignore row with name = 'Will'
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph
ON IF(emp.name = 'Will', '', emp.name) =CASE WHEN emph.name = 'Will' THEN '' ELSE emph.name END;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 547-968-091|
| Lucy | 577-928-094|
+----------+-----------------+ -- Use where/limit to limit the output of join
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
WHERE emp.name = 'Will';
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Will | 527-948-090|
+----------+-----------------+ --3. The JOIN operation can be performed on more tables (such as table A, B, and C) with sequence joins.
--The tables can either join from A to B and B to C, or join from A to B and A to C
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emp.name = empi.name;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+ --4. Self-join is where one table joins itself. When doing such joins,
--a different alias should be given to distinguish the same table
> SELECT
> emp.name -- Use alias before column name
> FROM employee as emp
> JOIN employee as emp_b -- Here, use a different alias
> ON emp.name = emp_b.name;
+-----------+
| emp.name |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+-----------+ --5. Perform an implicit join without using the JOIN keyword.
--This is only applicable to the INNER JOIN
SELECT
emp.name, emph.sin_number
FROM
employee emp, employee_hr emph -- Only applies for inner join
WHERE emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+ --6. The join condition uses different columns, which will create an additional job
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+
If JOIN uses different columns in its conditions, it will request an additional job to complete the join. If the JOIN operation uses the same column in the join conditions, it will join on this condition using one job.
When JOIN is performed between multiple tables, Yarn/MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually, it is suggested to put the big table right at the end of the JOIN statement for better performance and to avoid Out Of Memory (OOM) exceptions. This is because the last table in the JOIN sequence is usually streamed through reducers where as the others are buffered in the reducer by default. Also, a hint, /*+STREAMTABLE (table_name)*/ , can be specified to advise which table should be streamed over the default decision, as in the following example
SELECT /*+ STREAMTABLE(employee_hr) */
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
3.2 OUTER JOIN
Besides INNER JOIN , HQL also supports regular OUTER JOIN and FULL JOIN . The logic of such a join is the same as what's in the SQL. The following table summarizes the differences between common joins. Here, we assume table_m has m rows and table_n has n rows with one-to-one mapping.
| Join type | Logic | Rows returned |
| table_m JOIN table_n | This returns all rows matched in both tables. | m ∩ n |
| table_m LEFT JOIN table_n | This returns all rows in the left table and matched rows in the right table. If there is no match in the right table, it returns NULL in the right table. | m |
| table_m RIGHT JOIN table_n | This returns all rows in the right table and matched rows in the left table. If there is no match in the left table, it returns NULL in the left table. | n |
| table_m FULL JOIN table_n | This returns all rows in both tables and matched rows in both tables. If there is no match in the left or right table, it returns NULL instead. | m + n - m ∩ n |
| table_m CROSS JOIN table_n | This returns all row combinations in both the tables to produce CROSS JOIN table_n a Cartesian product. | m*n |
The following examples demonstrate the different OUTER JOINs:
SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in left table returned
LEFT JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Shelley | NULL | -- NULL for mismatch
| Lucy | 577-928-094 |
+-----------+------------------+ SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in right table returned
RIGHT JOIN employee_hr emph
ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| NULL | 647-968-598 | -- NULL for mismatch
| Lucy | 577-928-094 |
+-----------+------------------+
4 rows selected (34.485 seconds) SELECT
emp.name, emph.sin_number
FROM employee emp -- Rows from both side returned
FULL JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Lucy | 577-928-094 |
| Michael | 547-968-091 |
| Shelley | NULL | -- NULL for mismatch
| NULL | 647-968-598 | -- NULL for mismatch
| Will | 527-948-090 |
+-----------+------------------+
The CROSS JOIN statement does not have a join condition. The CROSS JOIN statement can also be written using join without condition or with the always true condition, such as 1 = 1. In this case, we can join any datasets with cross joins. However, we only consider using such joins when we have to link data without relations in nature, such as adding headers with a row count to a table. The following are three equal ways of writing CROSS JOIN.
SELECT
emp.name, emph.sin_number
FROM employee as emp
CROSS JOIN
employee_hr as emph; SELECT
emp.name, emph.sin_number
FROM employee as emp
JOIN
employee_hr as emph; SELECT
emp.name, emph.sin_number
FROM employee as emp
JOIN
employee_hr as emph
on 1=1;
Although Hive did not support unequal joins explicitly in the earlier version, there are workarounds by using CROSS JOIN and WHERE , as in this example:
SELECT
emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph
WHERE emp.name <> emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
+-----------+------------------+
3.3 Special joins
HQL also supports some special joins that we usually do not see in relational databases, such as MapJoin and Semi-join .
MapJoin means doing the join operation only with map, without the reduce job. The MapJoin statement reads all the data from the small table to memory and broadcasts to all maps. During the map phase, the join operation is performed by comparing each row of data in the big table with small tables against the join conditions. Because there is no reduce needed, such kinds of join usually have better performance. In the newer version of Hive, Hive automatically converts join to MapJoin at runtime if possible. However, you can also manually specify the broadcast table by providing a join hint, /*+ MAPJOIN(table_name) */ . In addition, MapJoin can be used for unequal joins to improve performance since both MapJoin and WHERE are performed in the map phase. The following is an example of using a MapJoin hint with CROSS JOIN :
SELECT
/*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee as emp
CROSS JOIN
employee_hr as emph
WHERE emp.name <> emph.name;
The MapJoin operation does not support the following:
- Using MapJoin after UNION ALL , LATERAL VIEW , GROUP BY / JOIN / SORTBY / CLUSTER , and BY / DISTRIBUTE BY
- Using MapJoin before UNION , JOIN , and another MapJoin
Bucket MapJoin is a special type of MapJoin that uses bucket columns (the column specified by CLUSTERED BY in the CREATE TABLE statement) as the join condition. Instead of fetching the whole table, as done by the regular MapJoin , bucket MapJoin only fetches the required bucket data. To enable bucket MapJoin , we need to enable some settings and make sure the bucket number is are multiple of each other. If both joined tables are sorted and bucketed with the same number of buckets, a sort-merge join can be performed instead of caching all small tables in the memory:
SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;
SET hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
In addition, the LEFT SEMI JOIN statement is also a type of MapJoin . It is the same as a subquery with IN / EXISTS after v0.13.0 of Hive. However, it is not recommended for use since it is not part of standard SQL:
SELECT a.name
FROM employee as a
LEFT SEMI JOIN
employee_id as b
ON a.name = b.name;
4. Union
When we want to combine data with the same schema together, we often use set operations. Regular set operations in the relational database are INTERSECT , MINUS , and UNION / UNION ALL . HQL only supports UNION and UNION ALL . The difference between them is that UNION ALL does not remove duplicate rows while UNION does. In addition, all unioned data must have the same name and data type, or else an implicit conversion will be done and may cause a runtime exception. If ORDER BY , SORT BY , CLUSTER BY , DISTRIBUTE BY , or LIMIT are used, they are applied to the whole result set after the union:
SELECT a.name as nm
FROM employee a
UNION ALL -- Use column alias to make the same name for union
SELECT b.name as nm
FROM employee_hr b;
+-----------+
|nm |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
| Michael |
| Will |
| Steven |
| Lucy |
+-----------+ SELECT a.name as nm FROM employee a
UNION -- UNION removes duplicated names and slower
SELECT b.name as nm FROM employee_hr b;
+----------+
|nm |
+----------+
| Lucy |
| Michael |
| Shelley |
| Steven |
| Will |
+----------+ -- Order by applies to the unioned data
-- When you want to order only one data set,
-- Use order in the subquery
SELECT a.name as nm FROM employee a
UNION ALL
SELECT b.name as nm FROM employee_hr b
ORDER BY nm;
+----------+
|nm |
+----------+
| Lucy |
| Lucy |
| Michael |
| Michael |
| Shelley |
| Steven |
| Will |
| Will |
+----------+
For other set operations that HQL does not support yet, such as INTERCEPT and MINUS , we can use joins or left join to implement them as follows:
-- Use join for set intercept
SELECT a.name
FROM employee a
JOIN employee_hr b
ON a.name = b.name;
+----------+
| a.name |
+----------+
| Michael |
| Will |
| Lucy |
+----------+ -- Use left join for set minus
SELECT a.name
FROM employee a
LEFT JOIN employee_hr b
ON a.name = b.name
WHERE b.name IS NULL;
Hive Essential (4):DML-project,filter,join,union的更多相关文章
- hive中与hbase外部表join时内存溢出(hive处理mapjoin的优化器机制)
与hbase外部表(wizad_mdm_main)进行join出现问题: CREATE TABLE wizad_mdm_dev_lmj_edition_result as select * from ...
- Hive不支持非相等的join
由于 hive 与传统关系型数据库面对的业务场景及底层技术架构都有着很大差异,因此,传统数据库领域的一些技能放到 Hive 中可能已不再适用.关于 hive 的优化与原理.应用的文章,前面也陆陆续续的 ...
- sql inner join , left join, right join , union,union all 的用法和区别
Persons 表: Id_P LastName FirstName Address City 1 Adams John Oxford Street London 2 Bush George Fift ...
- Hive(七):HQL DML
HQL DML 主要涉到对Hive表中数据操作,包含有:load.INSERT.DELETE.EXPORT and IMPORT,详细资料参见:https://cwiki.apache.org/con ...
- 用实例展示left Join,right join,inner join,join,cross join,union 的区别
1.向TI,T2插入数据: T1 7条 ID Field2 Field3 Field41 1 3 542 1 3 543 1 3 544 2 3 545 3 3 546 4 3 547 5 3 54 ...
- left join, right join , inner join, join, union的意义
数据库在连接两张或以上的表来返回数据时,都会生成一张中间的临时表,然后再将临时表返回给用户left join,right join,inner join, join 与 on 配合用 select c ...
- Hive与HBase表联合使用Join的问题
hive与hbase表结合级联查询的问题,主要hive两个表以上涉及到join操作,就会长时间卡住,查询日志也不报错,也不会出现mr的进度百分比显示,shell显示如下图 如图: 解决这个问题,需要修 ...
- 数据库join union 区别
join 是两张表做交连后里面条件相同的部分记录产生一个记录集,union是产生的两个记录集(字段要一样的)并在一起,成为一个新的记录集. 1.JOIN和UNION区别 join 是两张表做交连后里 ...
- hive中的子查询改join操作(转)
这些子查询在oracle和mysql等数据库中都能执行,但是在hive中却不支持,但是我们可以把这些查询语句改为join操作: -- 1.子查询 select * from A a where a.u ...
随机推荐
- netty: 将传递数据格式转为String,并使用分隔符发送多条数据
自定义分割符,用:DelimiterBasedFrameDecoder类 ByteBuf转String,用StringDecoder类 参考代码: //设置连接符/分隔符,换行显示 ByteBuf b ...
- php 程序执行时间检测
我们有的时经常需要做程序的执行时间执行效率判断.大理石平台检定规程 实现的思路如下: <?php //记录开始时间 //记录结整时 // 开始时间 减去(-) 结束时间 得到程序的运行时间 ...
- BurpSuite安装、使用
本周学习内容: 1.学习<网络是怎么连接的>和JavaScript: 2.学习MySQL和Linux: 3.熟悉burpsuite: 4.使用wireshark观察数据包: 5.XAMPP ...
- Xshell5 安装JDK
.执行命令yum -y list java*查看可安装java版本.执行成功后可以看见如下的结果 安装java-1.8全部相关 yum install -y java-1.8.0-openjdk* 使 ...
- 97: cf 983E 倍增+树套树
$des$一棵 $n$ 个点的树,树上有 $m$ 条双向的公交线路,每条公交线路都在两个节点之间沿最短路径往返.$q$ 次询问从一个点要到达另一个点,在只坐公交的情况下,至少需要坐几辆公交车:或者判断 ...
- Vijos 1057 盖房子
二次联通门 : Vijos 1057 盖房子 /* Vijos 1057 盖房子 简单的dp 当前点(i, j)所能构成的最大的正方形的边长 为点(i - 1, j - 1)与(i, j - 1), ...
- python模块之psutil
一.模块安装 1.简介 psutil是一个跨平台库(http://pythonhosted.org/psutil/)能够轻松实现获取系统运行的进程和系统利用率(包括CPU.内存.磁盘.网络等)信息. ...
- 洛谷P2751 工序安排Job Processing
题目 任务调度贪心. 需要明确一点,任务调度贪心题,并不是简单地应用排序的贪心,而是动态的运用堆,使每次选择是都能保持局部最优,并更新状态使得下次更新答案可以取到正确的最小值. 这是A过程的解. 然后 ...
- [WARNING] 找不到编译器:wepy-compiler-less。 [Error] 未发现相关 less 编译器配置,请检查wepy.config.js文件。
npm install less 之后 npm install wepy-compiler-less 解决 请点赞!因为你的鼓励是我写作的最大动力! 吹逼交流群:711613774
- mysql 开放端口 外网访问
mysql 开放端口 外网访问 作者: moyixi 时间: April 24, 2018 分类: 默认分类,数据库,mysql 前提: 如果是云服务器,请先把安全组件相应的开发 查看服务器的端口33 ...