最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug

Query 1 返回结果数6071

select count(distinct reviewid) as dis_reviewcnt
from
(select a.reviewid
from bi.dpods_dp_reviewreport a
left outer join bi.dpods_dp_reviewlog b
on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24'
where to_date(a.feedadddate) >= '2013-07-01' and a.hp_statdate='2013-07-24'
) a

Query 2 返回结果数6443

select count(distinct reviewid) as dis_reviewcnt
from
(select a.reviewid
from bi.dpods_dp_reviewreport a
left outer join bi.dpods_dp_reviewlog b
on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24' and a.hp_statdate='2013-07-24'
where to_date(a.feedadddate) >= '2013-07-01'
) a

第二条query比第一条多了372条数据,而且在子查询的左表中并不存在

两条语句唯一的区别是dpods_dp_reviewreport的分区过滤条件(hp_statdate是partition column)一个在where后面,另一个在on后面

粗看感觉出来的数据应该是一样的,但是玄机其实就在where和on的区别。

where 后面跟的是过滤条件,query 1 中的a.hp_statdate='2013-07-24', 在table scan之前就会Partition Pruner 过滤分区,所以只有'2013-07-24'下的数据会和dpods_dp_reviewlog进行join。

而query 2中会读入所有partition下的数据,再和dpods_dp_reviewlog join,并且根据join的关联条件只有a.hp_statdate='2013-07-24'的时候才会真正执行join,其余情况下又由于是left outer join, join不上右面会留NULL,query 2中其实是取出了所有的reviewid,所以会和query 1 结果不一样

可以做一个实验,query2去掉on后面的a.hp_statdate='2013-07-24',其余不动,执行语句,出来的distinct reviewcnt 也是 6443

select count(distinct reviewid) as dis_reviewcnt
from
(select a.reviewid
from bi.dpods_dp_reviewreport a
left outer join bi.dpods_dp_reviewlog b
on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24'
where to_date(a.feedadddate) >= '2013-07-01'
) a

query 1的query plan

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (and (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01') (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24'))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt)))) STAGE DEPENDENCIES:
Stage-5 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage STAGE PLANS:
Stage: Stage-5
Conditional Operator Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (to_date(feedadddate) >= '2013-07-01')
type: boolean
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 0
value expressions:
expr: feedadddate
type: string
expr: reviewid
type: int
expr: hp_statdate
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col5} {VALUE._col8} {VALUE._col17}
1
handleSkewJoin: false
outputColumnNames: _col5, _col8, _col17
Select Operator
expressions:
expr: _col8
type: int
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: int
outputColumnNames: _col0
Group By Operator
aggregations:
expr: count(DISTINCT _col0)
bucketGroup: false
keys:
expr: _col0
type: int
mode: hash
outputColumnNames: _col0, _col1
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col0:0._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0
Fetch Operator
limit: -1

Query 2的query plan

ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')) (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01')))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt)))) STAGE DEPENDENCIES:
Stage-5 is a root stage , consists of Stage-1
Stage-1
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage STAGE PLANS:
Stage: Stage-5
Conditional Operator Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (to_date(feedadddate) >= '2013-07-01')
type: boolean
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 0
value expressions:
expr: feedadddate
type: string
expr: reviewid
type: int
expr: hp_statdate
type: string
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: reviewid
type: int
sort order: +
Map-reduce partition columns:
expr: reviewid
type: int
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col5} {VALUE._col8}
1
filter predicates:
0 {(VALUE._col17 = '2013-07-24')}
1
handleSkewJoin: false
outputColumnNames: _col5, _col8
Select Operator
expressions:
expr: _col8
type: int
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: int
outputColumnNames: _col0
Group By Operator
aggregations:
expr: count(DISTINCT _col0)
bucketGroup: false
keys:
expr: _col0
type: int
mode: hash
outputColumnNames: _col0, _col1
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-13-32_879_3623450294049807419/-mr-10002
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col0:0._col0)
bucketGroup: false
mode: mergepartial
outputColumnNames: _col0
Select Operator
expressions:
expr: _col0
type: bigint
outputColumnNames: _col0
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0
Fetch Operator
limit: -1

参考:

http://blog.sina.com.cn/s/blog_6ff05a2c01010oxp.html

hive left outer join的问题的更多相关文章

  1. HIVE中join、semi join、outer join举例详解

    转自 http://www.cnblogs.com/xd502djj/archive/2013/01/18/2866662.html 举例子: hive> select * from zz0;  ...

  2. hive中left join、left outer join和left semi join的区别

    先说结论,再举例子.   hive中,left join与left outer join等价.   left semi join与left outer join的区别:left semi join相当 ...

  3. HIVE中join、semi join、outer join

    补充说明 left outer join where is not null与left semi join的联系与区别:两者均可实现exists in操作,不同的是,前者允许右表的字段在select或 ...

  4. hive 包含操作(left semi join)(left outer join = in)迪卡尔积

    目前hive不支持 in或not in 中包含查询子句的语法,所以只能通过left join实现. 假设有一个登陆表login(当天登陆记录,只有一个uid),和一个用户注册表regusers(当天注 ...

  5. hive regex insert join group cli

    1.insert Insert时,from子句既能够放在select子句后,也能够放在insert子句前,以下两句是等价的 hive> FROM invites a INSERT OVERWRI ...

  6. 一起学Hive——总结各种Join连接的用法

    Hive支持常用的SQL join语句,例如内连接.左外连接.右外连接以及HiVe独有的map端连接.其中map端连接是用于优化Hive连接查询的一个重要技巧. 在介绍各种连接之前,先准备好表和数据. ...

  7. hive中的join

    建表 : jdbc:hive2://localhost:10000> create database myjoin; No rows affected (3.78 seconds) : jdbc ...

  8. Oracle Partition Outer Join 稠化报表

    partition outer join实现将稀疏数据转为稠密数据,举例: with t as (select deptno, job, sum(sal) sum_sal from emp group ...

  9. SQL Server 2008 R2——使用FULL OUTER JOIN实现多表信息汇总

    =================================版权声明================================= 版权声明:原创文章 谢绝转载  请通过右侧公告中的“联系邮 ...

随机推荐

  1. 深入浅出MS06-040

    入浅出MS06-040 时至今日,网上已有颇多MS06-040的文章,当中不乏精辟之作.与其相比,本文突显业余,技术上无法超越,徒逞口舌之快.本文适合有一定计算机基础,初步了解溢出攻击原理,略微了解逆 ...

  2. Redhat Linux下的python版本号升级

    运行#Python与#python -V,看到版本是2.4.3,非常老了,并且之前写的都是跑在python3.X上面的,3.X和2.X有非常多不同, 有兴趣的朋友能够參考下这篇文章:  http:// ...

  3. 【百度地图API】小学生找哥哥——小学生没钱打车,所以此为公交查询功能

    原文:[百度地图API]小学生找哥哥--小学生没钱打车,所以此为公交查询功能 任务描述: 有位在魏公村附近上小学的小朋友,要去北京邮电大学找哥哥.他身上钱很少,只够坐公交的.所以,百度地图API快帮帮 ...

  4. 用bat启动sqlserver服务

    声明下这个脚本不是我写的,忘了是从哪看到的了,在此分享给大家,因为在我的理解中技术就是用来分享的,,希望原创作者看到了不要介意. 1.创建个文本,将后缀名改成.bat 2.将下边语句粘贴进去,然后保存 ...

  5. Humming Bird A20 SPI2驱动编译

    Humming Bird A20 SPI2使用编译 Yao.GUET 2014-07-17,请注明出处:http://blog.csdn.net/Yao_GUET A20上带有4个spi接口,因为Hu ...

  6. JavaScript/js把秒或者毫秒换算成xx-xx-xx 时-分-秒的形式

    function MillisecondToDate(msd) { // var time = parseFloat(msd) / 1000; var time=msd; if (null != ti ...

  7. Installshield设置feature为必须选中状态,即必定安装状态

    原文:Installshield设置feature为必须选中状态,即必定安装状态 上一篇: 解决卸载时残留目标文件夹的问题Installation Designer --> Organizati ...

  8. View中的Razor使用

    View中的Razor使用   上一节:ASP.NET MVC5 + EF6 入门教程 (5) Model和Entity Framework 源码下载:点我下载 一.Razor简介 在解决方案资源管理 ...

  9. PHP系列目录

    原文:PHP系列目录 PHP系列的对象是已经熟悉了一门或多门语言的开发人员.如果你是其中一份子,而且你也打算学习PHP,相信你根据本系列会很快掌握PHP的.欢迎大家给出意见或建议.同时也欢迎大家的批评 ...

  10. 异步提交form的时候利用jQuery validate实现表单验证

    异步提交form的时候利用jQuery validate实现表单验证相信很多人都用过jquery validate插件,非常好用,并且可以通过下面的语句来自定义验证规则    // 电话号码验证    ...