一、前言

公司实用Hadoop构建数据仓库，期间不可避免的实用HiveSql，在Etl过程中，速度成了避无可避的问题。本人有过几个数据表关联跑1个小时的经历，你可能觉得无所谓，可是多次Etl就要多个小时，非常浪费时间，所以HiveSql优化不可避免。

注：本文只是从sql层面介绍一下日常需要注意的点，不涉及Hadoop、MapReduce等层面，关于Hive的编译过程，请参考文章：http://tech.meituan.com/hive-sql-to-mapreduce.html

二、准备数据

假设咱们有两张数据表。

景区表：sight，12W条记录，数据表结构：

hive> desc sight;

OK

area                    string                  None

city                    string                  None

country                 string                  None

county                  string                  None

id                      string                  None

name                    string                  None

region                  string                  None

景区订单明细表：order_sight，1040W条记录，数据表结构：

hive> desc order_sight;

OK

create_time             string                  None

id                      string                  None

order_id                string                  None

sight_id                bigint                  None

三、分析

3.1 where条件

那么咱们希望看见景区id是9718，日期是2015-10-10的所有订单id，那么sql需要如下书写：

hive> select s.id,o.order_id from sight s left join order_sight o on o.sight_id=s.id where s.id= and o.create_time = '2015-10-10';

Total MapReduce jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapred.reduce.tasks=<number>

Starting Job = job_1434099279301_3562174, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562174/

Kill Command = /home/q/hadoop/hadoop-2.2./bin/hadoop job  -kill job_1434099279301_3562174

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 4.73 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 4.73 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.87 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 15.22 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 15.22 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 15.22 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 15.3 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 15.3 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 15.3 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 21.85 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 21.85 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 21.85 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 21.85 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 37.62 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.06 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.06 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.17 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.17 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.17 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.25 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.25 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.25 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.32 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.32 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.32 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 38.41 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 49.13 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 49.59 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 49.76 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 49.76 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 52.79 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 52.79 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1434099279301_3562174

MapReduce Jobs Launched:

Job : Map:   Reduce:    Cumulative CPU: 52.79 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

Time taken: 52.068 seconds, Fetched:  row(s)

可见需要的时间是52秒，如果咱们换一个sql的书写方式：

hive> select s.id,o.order_id from sight s left join (select order_id,sight_id from order_sight where create_time = '2015-10-10') o on o.sight_id=s.id where s.id=;

Total MapReduce jobs =

Launching Job  out of

Number of reduce tasks not specified. Estimated from input data size:

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapred.reduce.tasks=<number>

Starting Job = job_1434099279301_3562218, Tracking URL = http://l-hdpm4.data.cn5.qunar.com:9981/proxy/application_1434099279301_3562218/

Kill Command = /home/q/hadoop/hadoop-2.2./bin/hadoop job  -kill job_1434099279301_3562218

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.24 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.24 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.24 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 5.53 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 5.53 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 14.62 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 18.66 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 18.66 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 18.66 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 18.66 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 18.66 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.09 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.09 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.09 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.22 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.22 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.22 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.35 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.35 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.35 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.54 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.54 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.54 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.64 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.64 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 19.64 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 23.32 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 27.27 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 32.82 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 34.35 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 34.35 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1434099279301_3562218

MapReduce Jobs Launched:

Job : Map:   Reduce:    Cumulative CPU: 34.35 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

Time taken: 43.709 seconds, Fetched:  row(s)

实用43秒，快了一些。当然咱们并不是仅仅分析说快了20%（我还多次测试，这次的差距最小），而是分析原因！

单从两个sql的写法上看的出来，特别是第二条的红色部分，我将left的条件写到里面了。那么执行的结果随之不一样，第二条的Reduce时间明显小于第一条的Reduce时间。

原因是这两个sql都分解成8个Map任务和1个Reduce任务，如果left的条件写在后面，那么这些关联操作会放在Reduce阶段，1个Reduce操作的时间必然大于8个Map的执行时间，造成执行时间超长。

结论：当使用外关联时，如果将副表的过滤条件写在Where后面，那么就会先全表关联，之后再过滤

Etl之HiveSql调优(left join where的位置)的更多相关文章

Etl之HiveSql调优(设置map reduce 的数量)
前言: 最近发现hivesql的执行速度特别慢,前面我们已经说明了left和union的优化,下面咱们分析一下增加或者减少reduce的数量来提升hsql的速度. 参考:http://www.cnbl ...
Etl之HiveSql调优(union all)
相信在Etl的过程中不可避免的实用union all来拼装数据,那么这就涉及到是否并行处理的问题了. 在hive中是否适用并行map,可以通过参数来设定: set hive.exec.parallel ...
HiveSql调优系列之Hive严格模式，如何合理使用Hive严格模式
目录综述 1.严格模式 1.1 参数设置 1.2 查看参数 1.3 严格模式限制内容及对应参数设置 2.实际操作 2.1 分区表查询时必须指定分区 2.2 order by必须指定limit 2.3 ...
HiveSql调优经验
背景在刚使用hive的过程中,碰到过很多问题,任务经常需要运行7,8个小时甚至更久,在此记录一下这个过程中,我的一些收获 join长尾背景 SQL在Join执行阶段会将Join Key相同的数据分 ...
ETL调优的一些分享（下）（转载）
如在上篇文章<ETL调优的一些分享(上)>中已介绍的,ETL是构建数据仓库的必经一环,它的执行性能对于数据仓库构建性能有重要意义,因此对它进行有效的调优将十分重要.ETL业务的调优可以从若 ...
ETL调优的一些分享（上）（转载）
ETL是构建数据仓库的重要一环.通过该过程用户将所需数据提取出来,并按照已定义的模型导入数据仓库.由于ETL是建立数据仓库的必经过程,它的效率将影响整个数据仓库的构建,因此它的有效调优具有很高的重要性 ...
MySQL调优 —— Using temporary
DBA发来一个线上慢查询问题. SQL例如以下(为突出重点省略部分内容): select distinct article0_.id, 等字段 from article_table article ...
【Spark调优】大表join大表，少数key导致数据倾斜解决方案
[使用场景] 两个RDD进行join的时候,如果数据量都比较大,那么此时可以sample看下两个RDD中的key分布情况.如果出现数据倾斜,是因为其中某一个RDD中的少数几个key的数据量过大,而另一 ...
【Spark调优】小表join大表数据倾斜解决方案
[使用场景] 对RDD使用join类操作,或者是在Spark SQL中使用join语句时,而且join操作中的一个RDD或表的数据量比较小(例如几百MB或者1~2GB),比较适用此方案. [解决方案] ...

随机推荐

Dynamic CRM 2013学习笔记（四十四）CRM技术支持
有时我们经常遇到一些CRM的问题,一时又无法解决,这时我们可能要找下外援,下面列出一些基本的技术支持. 1. CRM 论坛 https://community.dynamics.com/crm/f ...
我们一起学SASS
写在前面 sass大约是4年前(2011年)的新技术,sass官网有详细介绍,包括安装指南.学习教程.语法细节文档等等,很全面也很清晰为什么有必要学sass?因为很多前端自动化工具都用sass,比如 ...
我没发现Mvc里的 web.config 有什么用。
实验过程由于 Mvc2+ 引入 Area ,导致文件夹结构发生变化. Mvc下的 web.config 所在的位置是: ~/Areas/MySystem/Views/Web.config 对应的请求 ...
B-tree&B+tree
B-tree&B+tree B-tree,B是balance,一般用于数据库的索引.使用B-tree结构可以显著减少定位记录时所经历的中间过程,从而加快存取速度.而B+tree是B-tree的 ...
[JS3] 立即执行JS
<html> <head> <title>立即执行</title> <SCRIPT TYPE="text/JavaScript" ...
Oracle dmp文件导入（还原）到不同的表空间和不同的用户下
------------------------------------- 从生产环境拷贝一个dmp备份文件,在另外一台电脑上搭建测试环境,用imp命令导入dmp文件时提示如下错误: 问题描述: IM ...
使用bee自动生成api文档
beego中的bee工具可以方便的自动生成api文档,基于数据库字段,自动生成golang版基于beego的crud代码,方法如下: 1.进入到gopath目录的src下执行命令: bee api a ...
iOS -iPhone5、iPhone5s、iPhone6、iPhone6Plus 屏幕适配
现在由于苹果公司出了6和6Plus,让写苹果程序的哥们为了做兼容很头疼.用StoryBoard固然方便,但是后期做兼容要花费太多的时间和精力.使用AutoLayout虽然会在不同尺寸的屏幕下自动布局, ...
android: 调用摄像头拍照
很多应用程序都可能会使用到调用摄像头拍照的功能,比如说程序里需要上传一张图片作为用户的头像,这时打开摄像头拍张照是最简单快捷的.下面就让我们通过一个例子来学习一下,如何才能在应用程序里调用手机的摄 ...
使用retrofit注意
retrofit-1.7.1 依赖以下包 okhttp-2.0.0 okio-1.0.0 okhttp-urlconnection-2.0.0 ExtCertPathValidatorExceptio ...

Etl之HiveSql调优(left join where的位置)

一、前言

二、准备数据

3.1 where条件

Etl之HiveSql调优(left join where的位置)的更多相关文章

随机推荐

热门专题