Hive Essential (4):DML-project,filter,join,union
1. Project data with SELECT
The most common use case for Hive is to query data in Hadoop. To achieve this, we need to write and execute a SELECT statement. The typical work done by the SELECT statement is to project the whole row (with SELECT * ) or specified columns (with SELECT column1, column2, ... ) from a table, with or without conditions.Most simple SELECT statements will not trigger a Yarn job. Instead, a dump task is created just for dumping the data, such as the hdfs dfs -cat command. The SELECT statement is quite often used with the FROM and DISTINCT keywords. A FROM keyword followed by a table is where SELECT projects data. The DISTINCT keyword used after SELECT ensures only unique rows or combination of columns are returned from the table. In addition, SELECT also supports columns combined with user-defined functions, IF() , or a CASE WHEN THEN ELSE END statement, and regular expressions. The following are examples of projecting data with a SELECT statement:
SELECT * FROM employee; -- Project the whole row
SELECT name FROM employee; -- Project specified columns --List all columns match java regular expression
SET hive.support.quoted.identifiers = none; -- Enable this
SELECT `^work.*` FROM employee; -- All columns start with work SELECT DISTINCT name, work_place FROM employee; SELECT
CASE WHEN gender_age.gender = 'Female' THEN 'Ms.'
ELSE 'Mr.'
END as title,
name,
IF(array_contains(work_place, 'New York'), 'US', 'CA') as country
FROM employee;
Multiple SELECT statements can work together to build a complex query using nested queries or CTE. A nested query, which is also called a subquery, is a query projecting data from the result of another query. Nested queries can be rewritten using CTE with the WITH and AS keywords. When using nested queries, an alias should be given for the inner query (see t1 in the following example), or else Hive will report exceptions. The following are a few examples of using nested queries in HQL:
--1. A nested query example with the mandatory alias:
SELECT
name, gender_age.gender as gender
FROM (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
) t1; -- t1 here is mandatory --2. A nested query can be rewritten with CTE as follows.
--This is the recommended way of writing a complex single HQL query
WITH t1 as (
SELECT * FROM employee WHERE gender_age.gender = 'Male'
)
SELECT name, gender_age.gender as gender
FROM t1;
In addition, a special SELECT followed by a constant expression can work without the FROM table clause. It returns the result of the expression. This is equivalent to querying a dummy table with one dummy record.
SELECT concat('','+','','=',cast((1 + 3) as string)) as res;
+-------+
| res |
+-------+
| 1+3=4 |
+-------+
2. Filtering data with conditions
It is quite common to narrow down the result set by using a condition clause, such as LIMIT , WHERE , IN / NOT IN , and EXISTS / NOT EXISTS . The LIMIT keyword limits the specified number of rows returned randomly. Compared with LIMIT , WHERE is a more powerful and generic condition clause to limit the returned result set by expressions, functions, and nested queries as in the following examples:
SELECT name FROM employee LIMIT 2; SELECT name, work_place FROM employee WHERE name = 'Michael'; -- All the conditions can use together and use after WHERE
SELECT name, work_place FROM employee WHERE name = 'Michael' LIMIT 1;
IN / NOT IN is used as an expression to check whether values belong to a set specified by IN or NOT IN . With effect from Hive v2.1.0, IN and NOT IN statements support more than one column.
SELECT name FROM employee WHERE gender_age.age in (27, 30); -- With multiple columns support after v2.1.0
SELECT
name, gender_age
FROM employee
WHERE (gender_age.gender, gender_age.age) IN
(('Female', 27), ('Male', 27 + 3)); -- Also support expression
In addition, filtering data can also use a subquery in the WHERE clause with IN / NOT IN and EXISTS / NOT EXISTS . A subquery that uses EXISTS or NOT EXISTS must refer to both inner and outer expressions:
SELECT
name, gender_age.gender as gender
FROM
employee
WHERE name IN
(
SELECT
name
FROM
employee
WHERE
gender_age.gender = 'Male'
); SELECT
name, gender_age.gender as gender
FROM
employee a
WHERE EXISTS (
SELECT *
FROM
employee b
WHERE
a.gender_age.gender = b.gender_age.gender AND
b.gender_age.gender = 'Male'
); -- This likes join table a and b with column gender
There are additional restrictions for subqueries used in WHERE clauses:
- Subqueries can only appear on the right-hand side of WHERE clauses
- Nested subqueries are not allowed
- IN / NOT IN in subqueries only support the use of a single column, although they support more in regular expressions
3. Linking data with JOIN
JOIN is used to link rows from two or more tables together. Hive supports most SQL JOIN operations, such as INNER JOIN and OUTER JOIN . In addition, HQL supports some special joins, such as MapJoin and Semi-Join too. In its earlier version, Hive only supported equal join. After v2.2.0, unequal join is also supported. However, you should be more careful when using unequal join unless you know what is expected, since unequal join is likely to return many rows by producing a Cartesian product of joined tables. When you want to restrict the output of a join, you should apply a WHERE clause after join as JOIN occurs before the WHERE clause. If possible, push filter conditions on the join conditions rather than where conditions to have data filtered earlier. What's more, all types of left/right joins are not commutative and always\ left/right associative, while INNER and FULL OUTER JOINS are both commutative and associative.
3.1 INNER JOIN
INNER JOIN or JOIN returns rows meeting the join conditions from both sides of joined tables. The JOIN keyword can also be omitted by comma-separated table names; this is called an implicit join . Here are examples of the HQL JOIN operation:
--1. First, prepare a table to join with and load data to it:
CREATE TABLE IF NOT EXISTS employee_hr (
name string,
employee_id int,
sin_number string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'; LOAD DATA INPATH '/tmp/hivedemo/data/employee_hr.txt'
OVERWRITE INTO TABLE employee_hr; --2. Perform an INNER JOIN between two tables with equal and unequal join
--conditions, along with complex expressions as well as a post join WHERE
--condition. Usually, we need to add a table name or table alias before columns in
--the join condition, although Hive always tries to resolve them:
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph
ON emp.name = emph.name; -- Equal Join
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091|
| Will | 527-948-090|
| Lucy | 577-928-094|
+-----------+------------------+ SELECT
emp.name, emph.sin_number
FROM employee emp
-- Unequal join supported since v2.2.0 returns more rows
JOIN employee_hr emph
ON emp.name != emph.name;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 527-948-090|
| Michael | 647-968-598|
| Michael | 577-928-094|
| Will | 547-968-091|
| Will | 647-968-598|
| Will | 577-928-094|
| Shelley | 547-968-091|
| Shelley | 527-948-090|
| Shelley | 647-968-598|
| Shelley | 577-928-094|
| Lucy | 547-968-091|
| Lucy | 527-948-090|
| Lucy | 647-968-598|
+----------+-----------------+ -- Join with complex expression in join condition
-- This is also the way to implement conditional join
-- Below, conditional ignore row with name = 'Will'
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph
ON IF(emp.name = 'Will', '', emp.name) =CASE WHEN emph.name = 'Will' THEN '' ELSE emph.name END;
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Michael | 547-968-091|
| Lucy | 577-928-094|
+----------+-----------------+ -- Use where/limit to limit the output of join
SELECT
emp.name, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
WHERE emp.name = 'Will';
+----------+-----------------+
| emp.name | emph.sin_number |
+----------+-----------------+
| Will | 527-948-090|
+----------+-----------------+ --3. The JOIN operation can be performed on more tables (such as table A, B, and C) with sequence joins.
--The tables can either join from A to B and B to C, or join from A to B and A to C
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emp.name = empi.name;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+ --4. Self-join is where one table joins itself. When doing such joins,
--a different alias should be given to distinguish the same table
> SELECT
> emp.name -- Use alias before column name
> FROM employee as emp
> JOIN employee as emp_b -- Here, use a different alias
> ON emp.name = emp_b.name;
+-----------+
| emp.name |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+-----------+ --5. Perform an implicit join without using the JOIN keyword.
--This is only applicable to the INNER JOIN
SELECT
emp.name, emph.sin_number
FROM
employee emp, employee_hr emph -- Only applies for inner join
WHERE emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+ --6. The join condition uses different columns, which will create an additional job
SELECT
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+
If JOIN uses different columns in its conditions, it will request an additional job to complete the join. If the JOIN operation uses the same column in the join conditions, it will join on this condition using one job.
When JOIN is performed between multiple tables, Yarn/MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually, it is suggested to put the big table right at the end of the JOIN statement for better performance and to avoid Out Of Memory (OOM) exceptions. This is because the last table in the JOIN sequence is usually streamed through reducers where as the others are buffered in the reducer by default. Also, a hint, /*+STREAMTABLE (table_name)*/ , can be specified to advise which table should be streamed over the default decision, as in the following example
SELECT /*+ STREAMTABLE(employee_hr) */
emp.name, empi.employee_id, emph.sin_number
FROM employee emp
JOIN employee_hr emph ON emp.name = emph.name
JOIN employee_id empi ON emph.employee_id = empi.employee_id;
3.2 OUTER JOIN
Besides INNER JOIN , HQL also supports regular OUTER JOIN and FULL JOIN . The logic of such a join is the same as what's in the SQL. The following table summarizes the differences between common joins. Here, we assume table_m has m rows and table_n has n rows with one-to-one mapping.
| Join type | Logic | Rows returned |
| table_m JOIN table_n | This returns all rows matched in both tables. | m ∩ n |
| table_m LEFT JOIN table_n | This returns all rows in the left table and matched rows in the right table. If there is no match in the right table, it returns NULL in the right table. | m |
| table_m RIGHT JOIN table_n | This returns all rows in the right table and matched rows in the left table. If there is no match in the left table, it returns NULL in the left table. | n |
| table_m FULL JOIN table_n | This returns all rows in both tables and matched rows in both tables. If there is no match in the left or right table, it returns NULL instead. | m + n - m ∩ n |
| table_m CROSS JOIN table_n | This returns all row combinations in both the tables to produce CROSS JOIN table_n a Cartesian product. | m*n |
The following examples demonstrate the different OUTER JOINs:
SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in left table returned
LEFT JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Shelley | NULL | -- NULL for mismatch
| Lucy | 577-928-094 |
+-----------+------------------+ SELECT
emp.name, emph.sin_number
FROM employee emp -- All rows in right table returned
RIGHT JOIN employee_hr emph
ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| NULL | 647-968-598 | -- NULL for mismatch
| Lucy | 577-928-094 |
+-----------+------------------+
4 rows selected (34.485 seconds) SELECT
emp.name, emph.sin_number
FROM employee emp -- Rows from both side returned
FULL JOIN employee_hr emph ON emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Lucy | 577-928-094 |
| Michael | 547-968-091 |
| Shelley | NULL | -- NULL for mismatch
| NULL | 647-968-598 | -- NULL for mismatch
| Will | 527-948-090 |
+-----------+------------------+
The CROSS JOIN statement does not have a join condition. The CROSS JOIN statement can also be written using join without condition or with the always true condition, such as 1 = 1. In this case, we can join any datasets with cross joins. However, we only consider using such joins when we have to link data without relations in nature, such as adding headers with a row count to a table. The following are three equal ways of writing CROSS JOIN.
SELECT
emp.name, emph.sin_number
FROM employee as emp
CROSS JOIN
employee_hr as emph; SELECT
emp.name, emph.sin_number
FROM employee as emp
JOIN
employee_hr as emph; SELECT
emp.name, emph.sin_number
FROM employee as emp
JOIN
employee_hr as emph
on 1=1;
Although Hive did not support unequal joins explicitly in the earlier version, there are workarounds by using CROSS JOIN and WHERE , as in this example:
SELECT
emp.name, emph.sin_number
FROM employee emp
CROSS JOIN employee_hr emph
WHERE emp.name <> emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 527-948-090 |
| Michael | 647-968-598 |
| Michael | 577-928-094 |
| Will | 547-968-091 |
| Will | 647-968-598 |
| Will | 577-928-094 |
| Shelley | 547-968-091 |
| Shelley | 527-948-090 |
| Shelley | 647-968-598 |
| Shelley | 577-928-094 |
| Lucy | 547-968-091 |
| Lucy | 527-948-090 |
| Lucy | 647-968-598 |
+-----------+------------------+
3.3 Special joins
HQL also supports some special joins that we usually do not see in relational databases, such as MapJoin and Semi-join .
MapJoin means doing the join operation only with map, without the reduce job. The MapJoin statement reads all the data from the small table to memory and broadcasts to all maps. During the map phase, the join operation is performed by comparing each row of data in the big table with small tables against the join conditions. Because there is no reduce needed, such kinds of join usually have better performance. In the newer version of Hive, Hive automatically converts join to MapJoin at runtime if possible. However, you can also manually specify the broadcast table by providing a join hint, /*+ MAPJOIN(table_name) */ . In addition, MapJoin can be used for unequal joins to improve performance since both MapJoin and WHERE are performed in the map phase. The following is an example of using a MapJoin hint with CROSS JOIN :
SELECT
/*+ MAPJOIN(employee) */ emp.name, emph.sin_number
FROM employee as emp
CROSS JOIN
employee_hr as emph
WHERE emp.name <> emph.name;
The MapJoin operation does not support the following:
- Using MapJoin after UNION ALL , LATERAL VIEW , GROUP BY / JOIN / SORTBY / CLUSTER , and BY / DISTRIBUTE BY
- Using MapJoin before UNION , JOIN , and another MapJoin
Bucket MapJoin is a special type of MapJoin that uses bucket columns (the column specified by CLUSTERED BY in the CREATE TABLE statement) as the join condition. Instead of fetching the whole table, as done by the regular MapJoin , bucket MapJoin only fetches the required bucket data. To enable bucket MapJoin , we need to enable some settings and make sure the bucket number is are multiple of each other. If both joined tables are sorted and bucketed with the same number of buckets, a sort-merge join can be performed instead of caching all small tables in the memory:
SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;
SET hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
In addition, the LEFT SEMI JOIN statement is also a type of MapJoin . It is the same as a subquery with IN / EXISTS after v0.13.0 of Hive. However, it is not recommended for use since it is not part of standard SQL:
SELECT a.name
FROM employee as a
LEFT SEMI JOIN
employee_id as b
ON a.name = b.name;
4. Union
When we want to combine data with the same schema together, we often use set operations. Regular set operations in the relational database are INTERSECT , MINUS , and UNION / UNION ALL . HQL only supports UNION and UNION ALL . The difference between them is that UNION ALL does not remove duplicate rows while UNION does. In addition, all unioned data must have the same name and data type, or else an implicit conversion will be done and may cause a runtime exception. If ORDER BY , SORT BY , CLUSTER BY , DISTRIBUTE BY , or LIMIT are used, they are applied to the whole result set after the union:
SELECT a.name as nm
FROM employee a
UNION ALL -- Use column alias to make the same name for union
SELECT b.name as nm
FROM employee_hr b;
+-----------+
|nm |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
| Michael |
| Will |
| Steven |
| Lucy |
+-----------+ SELECT a.name as nm FROM employee a
UNION -- UNION removes duplicated names and slower
SELECT b.name as nm FROM employee_hr b;
+----------+
|nm |
+----------+
| Lucy |
| Michael |
| Shelley |
| Steven |
| Will |
+----------+ -- Order by applies to the unioned data
-- When you want to order only one data set,
-- Use order in the subquery
SELECT a.name as nm FROM employee a
UNION ALL
SELECT b.name as nm FROM employee_hr b
ORDER BY nm;
+----------+
|nm |
+----------+
| Lucy |
| Lucy |
| Michael |
| Michael |
| Shelley |
| Steven |
| Will |
| Will |
+----------+
For other set operations that HQL does not support yet, such as INTERCEPT and MINUS , we can use joins or left join to implement them as follows:
-- Use join for set intercept
SELECT a.name
FROM employee a
JOIN employee_hr b
ON a.name = b.name;
+----------+
| a.name |
+----------+
| Michael |
| Will |
| Lucy |
+----------+ -- Use left join for set minus
SELECT a.name
FROM employee a
LEFT JOIN employee_hr b
ON a.name = b.name
WHERE b.name IS NULL;
Hive Essential (4):DML-project,filter,join,union的更多相关文章
- hive中与hbase外部表join时内存溢出(hive处理mapjoin的优化器机制)
与hbase外部表(wizad_mdm_main)进行join出现问题: CREATE TABLE wizad_mdm_dev_lmj_edition_result as select * from ...
- Hive不支持非相等的join
由于 hive 与传统关系型数据库面对的业务场景及底层技术架构都有着很大差异,因此,传统数据库领域的一些技能放到 Hive 中可能已不再适用.关于 hive 的优化与原理.应用的文章,前面也陆陆续续的 ...
- sql inner join , left join, right join , union,union all 的用法和区别
Persons 表: Id_P LastName FirstName Address City 1 Adams John Oxford Street London 2 Bush George Fift ...
- Hive(七):HQL DML
HQL DML 主要涉到对Hive表中数据操作,包含有:load.INSERT.DELETE.EXPORT and IMPORT,详细资料参见:https://cwiki.apache.org/con ...
- 用实例展示left Join,right join,inner join,join,cross join,union 的区别
1.向TI,T2插入数据: T1 7条 ID Field2 Field3 Field41 1 3 542 1 3 543 1 3 544 2 3 545 3 3 546 4 3 547 5 3 54 ...
- left join, right join , inner join, join, union的意义
数据库在连接两张或以上的表来返回数据时,都会生成一张中间的临时表,然后再将临时表返回给用户left join,right join,inner join, join 与 on 配合用 select c ...
- Hive与HBase表联合使用Join的问题
hive与hbase表结合级联查询的问题,主要hive两个表以上涉及到join操作,就会长时间卡住,查询日志也不报错,也不会出现mr的进度百分比显示,shell显示如下图 如图: 解决这个问题,需要修 ...
- 数据库join union 区别
join 是两张表做交连后里面条件相同的部分记录产生一个记录集,union是产生的两个记录集(字段要一样的)并在一起,成为一个新的记录集. 1.JOIN和UNION区别 join 是两张表做交连后里 ...
- hive中的子查询改join操作(转)
这些子查询在oracle和mysql等数据库中都能执行,但是在hive中却不支持,但是我们可以把这些查询语句改为join操作: -- 1.子查询 select * from A a where a.u ...
随机推荐
- Centos7 安装谷歌浏览器
配置下载yum源 cd /etc/yum.repos.d vim google-chrome.repo [google-chrome] name=google-chrome baseurl=http: ...
- ip address control获取ip字符串
1.环境:vs2010 & 默认项目字符集(貌似是unicode) 2.首先为ip address control添加control类型变量m_ipaddressedit, BYTE ips[ ...
- Vue基础入门笔记
不是面向DOM进行编程,而是面向数据去编程.当数据发生改变,页面就会随着改变. 属性绑定(v-bind)和双向数据绑定(v-model) 模板指令(v-bind:)后面跟的内容不再是字符串而是: js ...
- Vector(同步)和Arraylist(异步)的异同
// 同步 异步 //1 同步 //2 异步 //未响应 = 假死 占用内存过多 内存无法进行处理 //请求方式:同步 异步 //网页的展现过程中:1 css文件的下载 ...
- SignalR2实时聊天
SignalR2实时聊天 NuGet包中搜索SignalR添加引用 using Microsoft.AspNet.SignalR; 创建OWIN启动类 namespace SignalRChat { ...
- ClickHouse 分布式高可用集群搭建(转载)
一.ClickHouse安装方式: 源码编译安装 Docker安装 RPM包安装 为了方便使用,一般采用RPM包方式安装,其他两种方式这里不做说明. 二.下载安装包 官方没有提供rpm包,但是Alti ...
- PostgreSQL 11 新特性之覆盖索引(Covering Index)(转载)
通常来说,索引可以用于提高查询的速度.通过索引,可以快速访问表中的指定数据,避免了表上的扫描.有时候,索引不仅仅能够用于定位表中的数据.某些查询可能只需要访问索引的数据,就能够获取所需要的结果,而不需 ...
- 洛谷 P1842 奶牛玩杂技 题解
P1842 奶牛玩杂技 题目背景 Farmer John 养了N(1<=N<=50,000)头牛,她们已经按1~N依次编上了号.FJ所不知道的是,他的所有牛都梦想着从农场逃走,去参加马戏团 ...
- (24)打鸡儿教你Vue.js
学习Vue基础语法 Vue中的组件 Vue-cli的使用 1.使用Vue2.0版本实现响应式编程 2.理解Vue编程理念与直接操作Dom的差异 3.Vue常用的基础语法 4.使用Vue编写TodoLi ...
- P1098 字符串的展开——细节决定成败
P1098 字符串的展开 规则有点多吧: isdigit(char) 表示如果字符是数字返回一,否则返回0: 倒序做一个下标就行了: 巧用三目运算符避开分类讨论,避开开头和结尾,根据条件层层逼近: # ...