Etl之HiveSql调优(left join where的位置)
一、前言
公司实用Hadoop构建数据仓库,期间不可避免的实用HiveSql,在Etl过程中,速度成了避无可避的问题。本人有过几个数据表关联跑1个小时的经历,你可能觉得无所谓,可是多次Etl就要多个小时,非常浪费时间,所以HiveSql优化不可避免。
注:本文只是从sql层面介绍一下日常需要注意的点,不涉及Hadoop、MapReduce等层面,关于Hive的编译过程,请参考文章:http://tech.meituan.com/hive-sql-to-mapreduce.html
二、准备数据
假设咱们有两张数据表。
景区表:sight,12W条记录,数据表结构:
hive> desc sight;
OK
area string None
city string None
country string None
county string None
id string None
name string None
region string None
景区订单明细表:order_sight,1040W条记录,数据表结构:
hive> desc order_sight;
OK
create_time string None
id string None
order_id string None
sight_id bigint None
三、分析
3.1 where条件
那么咱们希望看见景区id是9718,日期是2015-10-10的所有订单id,那么sql需要如下书写:
hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id= and o.create_time = '2015-10-10';
Total MapReduce jobs =
Launching Job out of
Number of reduce tasks not specified. Estimated from input data size:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/
Kill Command = /home/q/hadoop/hadoop-2.2./bin/hadoop job -kill job_1434099279301_3562174
Hadoop job information for Stage-: number of mappers: ; number of reducers:
-- ::, Stage- map = %, reduce = %
-- ::, Stage- map = %, reduce = %, Cumulative CPU 4.73 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 4.73 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.87 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 15.22 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 15.22 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 15.22 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 15.3 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 15.3 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 15.3 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 21.85 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 21.85 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 21.85 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 21.85 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 37.62 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.06 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.06 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.17 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.17 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.17 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.25 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.25 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.25 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.32 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.32 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.32 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 38.41 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 49.13 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 49.59 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 49.76 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 49.76 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 52.79 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 52.79 sec
MapReduce Total cumulative CPU time: seconds msec
Ended Job = job_1434099279301_3562174
MapReduce Jobs Launched:
Job : Map: Reduce: Cumulative CPU: 52.79 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: seconds msec
OK Time taken: 52.068 seconds, Fetched: row(s)
可见需要的时间是52秒,如果咱们换一个sql的书写方式:
hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=;
Total MapReduce jobs =
Launching Job out of
Number of reduce tasks not specified. Estimated from input data size:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/
Kill Command = /home/q/hadoop/hadoop-2.2./bin/hadoop job -kill job_1434099279301_3562218
Hadoop job information for Stage-: number of mappers: ; number of reducers:
-- ::, Stage- map = %, reduce = %
-- ::, Stage- map = %, reduce = %, Cumulative CPU 2.24 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 2.24 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 2.24 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 5.53 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 5.53 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 14.62 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 18.66 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 18.66 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 18.66 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 18.66 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 18.66 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.09 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.09 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.09 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.22 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.22 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.22 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.35 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.35 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.35 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.54 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.54 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.54 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.64 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.64 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 19.64 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 23.32 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 27.27 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 32.82 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 34.35 sec
-- ::, Stage- map = %, reduce = %, Cumulative CPU 34.35 sec
MapReduce Total cumulative CPU time: seconds msec
Ended Job = job_1434099279301_3562218
MapReduce Jobs Launched:
Job : Map: Reduce: Cumulative CPU: 34.35 sec HDFS Read: HDFS Write: SUCCESS
Total MapReduce CPU Time Spent: seconds msec
OK Time taken: 43.709 seconds, Fetched: row(s)
实用43秒,快了一些。当然咱们并不是仅仅分析说快了20%(我还多次测试,这次的差距最小),而是分析原因!
单从两个sql的写法上看的出来,特别是第二条的红色部分,我将left的条件写到里面了。那么执行的结果随之不一样,第二条的Reduce时间明显小于第一条的Reduce时间。
原因是这两个sql都分解成8个Map任务和1个Reduce任务,如果left的条件写在后面,那么这些关联操作会放在Reduce阶段,1个Reduce操作的时间必然大于8个Map的执行时间,造成执行时间超长。
结论:当使用外关联时,如果将副表的过滤条件写在Where后面,那么就会先全表关联,之后再过滤
Etl之HiveSql调优(left join where的位置)的更多相关文章
- Etl之HiveSql调优(设置map reduce 的数量)
前言: 最近发现hivesql的执行速度特别慢,前面我们已经说明了left和union的优化,下面咱们分析一下增加或者减少reduce的数量来提升hsql的速度. 参考:http://www.cnbl ...
- Etl之HiveSql调优(union all)
相信在Etl的过程中不可避免的实用union all来拼装数据,那么这就涉及到是否并行处理的问题了. 在hive中是否适用并行map,可以通过参数来设定: set hive.exec.parallel ...
- HiveSql调优系列之Hive严格模式,如何合理使用Hive严格模式
目录 综述 1.严格模式 1.1 参数设置 1.2 查看参数 1.3 严格模式限制内容及对应参数设置 2.实际操作 2.1 分区表查询时必须指定分区 2.2 order by必须指定limit 2.3 ...
- HiveSql调优经验
背景 在刚使用hive的过程中,碰到过很多问题,任务经常需要运行7,8个小时甚至更久,在此记录一下这个过程中,我的一些收获 join长尾 背景 SQL在Join执行阶段会将Join Key相同的数据分 ...
- ETL调优的一些分享(下)(转载)
如在上篇文章<ETL调优的一些分享(上)>中已介绍的,ETL是构建数据仓库的必经一环,它的执行性能对于数据仓库构建性能有重要意义,因此对它进行有效的调优将十分重要.ETL业务的调优可以从若 ...
- ETL调优的一些分享(上)(转载)
ETL是构建数据仓库的重要一环.通过该过程用户将所需数据提取出来,并按照已定义的模型导入数据仓库.由于ETL是建立数据仓库的必经过程,它的效率将影响整个数据仓库的构建,因此它的有效调优具有很高的重要性 ...
- MySQL调优 —— Using temporary
DBA发来一个线上慢查询问题. SQL例如以下(为突出重点省略部分内容): select distinct article0_.id, 等字段 from article_table article ...
- 【Spark调优】大表join大表,少数key导致数据倾斜解决方案
[使用场景] 两个RDD进行join的时候,如果数据量都比较大,那么此时可以sample看下两个RDD中的key分布情况.如果出现数据倾斜,是因为其中某一个RDD中的少数几个key的数据量过大,而另一 ...
- 【Spark调优】小表join大表数据倾斜解决方案
[使用场景] 对RDD使用join类操作,或者是在Spark SQL中使用join语句时,而且join操作中的一个RDD或表的数据量比较小(例如几百MB或者1~2GB),比较适用此方案. [解决方案] ...
随机推荐
- 安卓系统上安装.net运行时 mono runtime
感谢以下博主: ubuntu指南 http://dawndiy.com/archives/229/ img大小调整 http://zebinj.blog.163.com/blog/static/206 ...
- .NET跨平台:在mac命令行下用vim手写ASP.NET 5 MVC程序
昨天在 Mac 上手写了一个最简单的 ASP.NET 5 程序,直接在 Startup.cs 中通过 Response.WriteAsync() 输出响应内容,详见 .NET跨平台:在Mac上跟着错误 ...
- 深入理解java虚拟机【内存溢出实例】
通过简单的小例子程序,演示java虚拟机各部分内存溢出情况: (1).java堆溢出: Java堆用于存储实例对象,只要不断创建对象,并且保证GC Roots到对象之间有引用的可达,避免垃圾收集器回收 ...
- Nginx学习笔记(六) 源码分析&启动过程
Nginx的启动过程 主要介绍Nginx的启动过程,可以在/core/nginx.c中找到Nginx的主函数main(),那么就从这里开始分析Nginx的启动过程. 涉及到的基本函数 源码: /* * ...
- Mac上搭建直播服务器Nginx+rtmp
简介 nginx是非常优秀的开源服务器,用它来做hls或者rtmp流媒体服务器是非常不错的选择,本人在网上整理了安装流程,分享给大家并且作备忘. 步骤安装 1.安装Homebrow Homebrew简 ...
- paip.杀不死进程的原因--僵尸进程的解决.txt
paip.杀不死进程的原因--僵尸进程的解决.txt 作者Attilax 艾龙, EMAIL:1466519819@qq.com 来源:attilax的专栏 地址:http://blog.csdn ...
- 看2015年TFC游戏大会,云计算何以唱主角
日前,第10界TFC游戏大会浩浩荡荡地在北京国际会议中心成功举办了.与往届不同的是,这一次TFC的金苹果奖被四家云计算公司夺走,它们分别是金山云.阿里云.腾讯云和首都在线.四家云计算公司夺走了游戏大会 ...
- Android MultiDex兼容包怎么使用?
在Android系统中安装应用的时候,需要对Dex进行优化,但由于其处理工具DexOpt的限制,导致其id的数目不能够超过65536个.而MultiDex兼容包的出现,就很好的解决了这个问题,它可以配 ...
- cygintl-8.dll 是cygwin的哪个包?|Windows查看man手册的方法-cygwin
答案是: 是 Release\gettext\libintl8\libintl8-0.18.1.1-2.tar.bz2 应该是gettext 项目的一部分吧. 下载地址 可以直接从 cygwin的镜像 ...
- Entity Framework 6.x Code Frist For Oracle 实践与注意点
Entity Framework 6.x Code Frist For Oracle 实践与注意点 开发环境 Visual Studio.net 2015/2017 Oracle 11g/12c 数据 ...