hive SQL优化之distribute by和sort by

近期在优化hiveSQL。

以下是一段排序，分组后取每组第一行记录的SQL

INSERT OVERWRITE TABLE t_wa_funnel_distinct_temp PARTITION (pt='${SRCTIME}')
SELECT
bussiness_id,
cookie_id,
session_id,
funnel_id,
group_first(funnel_name) funnel_name,
step_id,
group_first(step_name) step_name,
group_first(log_type) log_type,
group_first(url_pattern) url_pattern,
group_first(url) url,
group_first(refer) refer,
group_first(log_time) log_time,
group_first(is_new_visitor) is_new_visitor,
group_first(is_mobile_traffic) is_mobile_traffic,
group_first(is_bounce) is_bounce,
group_first(campaign_name) campaign_name,
group_first(group_name) group_name,
group_first(slot_name) slot_name,
group_first(source_type) source_type,
group_first(next_page) next_page,
group_first(continent) continent,
group_first(sub_continent_region) sub_continent_region,
group_first(country) country,
group_first(region) region,
group_first(city) city,
group_first(language) language,
group_first(browser) browser,
group_first(os) os,
group_first(screen_color) screen_color,
group_first(screen_resolution) screen_resolution,
group_first(flash_version) flash_version,
group_first(java) java,
group_first(host) host
FROM
( SELECT *
FROM r_wa_funnel
WHERE pt='${SRCTIME}'
ORDER BY bussiness_id, cookie_id, session_id, funnel_id, step_id, log_time ASC
) t1
GROUP BY pt, bussiness_id, cookie_id, session_id, funnel_id, step_id;

group_first: 自己定义函数。用户取每组第一个字段

${SRCTIME}:
由外部oozie调度传入, 作为时间分区，精确到小时.eg: 2011.11.01.21

以下在hive上以SRCTIME = 2011.11.01.21
运行以上SQL. 2011.11.01.21小时分区记录数有10435486

运行时间:

从上面能够看出，reduce阶段仅仅有一个reduce，这是由于ORDER BY是全局排序，hive仅仅能通过一个reduce进行排序

从业务需求来看，仅仅要按bussiness_id, cookie_id, session_id, funnel_id, step_id分组，组内按

log_time升序排序就可以.

OK, 这样能够採用hive提供的distribute by 和 sort by,这样能够充分利用hadoop资源，在多个

reduce中局部按log_time 排序

优化有的hive代码:

INSERT OVERWRITE TABLE t_wa_funnel_distinct PARTITION (pt='2011.11.01.21')
SELECT
bussiness_id,
cookie_id,
session_id,
funnel_id,
group_first(funnel_name) funnel_name,
step_id,
group_first(step_name) step_name,
group_first(log_type) log_type,
group_first(url_pattern) url_pattern,
group_first(url) url,
group_first(refer) refer,
group_first(log_time) log_time,
group_first(is_new_visitor) is_new_visitor,
group_first(is_mobile_traffic) is_mobile_traffic,
group_first(is_bounce) is_bounce,
group_first(campaign_name) campaign_name,
group_first(group_name) group_name,
group_first(slot_name) slot_name,
group_first(source_type) source_type,
group_first(next_page) next_page,
group_first(continent) continent,
group_first(sub_continent_region) sub_continent_region,
group_first(country) country,
group_first(region) region,
group_first(city) city,
group_first(language) language,
group_first(browser) browser,
group_first(os) os,
group_first(screen_color) screen_color,
group_first(screen_resolution) screen_resolution,
group_first(flash_version) flash_version,
group_first(java) java,
group_first(host) host
FROM
( SELECT *
FROM r_wa_funnel
WHERE pt='2011.11.01.21'
distribute by bussiness_id, cookie_id, session_id, funnel_id, step_id sort by log_time ASC
) t1
GROUP BY bussiness_id, cookie_id, session_id, funnel_id, step_id;

运行时间:

第一个须要运行6:43，而优化有仅仅要运行0:35秒。性能得到大幅提升

hive SQL优化之distribute by和sort by的更多相关文章

Hive SQL 优化面试题整理
Hive优化目标在有限的资源下,执行效率更高常见问题: 数据倾斜 map数设置 reduce数设置其他 Hive执行 HQL --> Job --> Map/Reduce 执行计划 ...
深入浅出Hive企业级架构优化、Hive Sql优化、压缩和分布式缓存(企业Hadoop应用核心产品)
一.本课程是怎么样的一门课程(全面介绍) 1.1.课程的背景作为企业Hadoop应用的核心产品,Hive承载着FaceBook.淘宝等大佬 95%以上的离线统计,很多企业里的离线统 ...
Hive SQL优化思路
Hive的优化主要分为:配置优化.SQL语句优化.任务优化等方案.其中在开发过程中主要涉及到的可能是SQL优化这块. 优化的核心思想是: 减少数据量(例如分区.列剪裁) 避免数据倾斜(例如加参数.Ke ...
hive的高级查询（group by、 order by、 join 、 distribute by、sort by、 clusrer by、 union all等）
查询操作 group by. order by. join . distribute by. sort by. clusrer by. union all 底层的实现 mapreduce 常见的聚合操 ...
[转]hive中order by,distribute by,sort by,cluster by
转至http://my.oschina.net/repine/blog/296562 order by,distribute by,sort by,cluster by 查询使用说明 1 2 3 4 ...
hive中order by、distribute by、sort by和cluster by的区别和联系
hive中order by.distribute by.sort by和cluster by的区别和联系 order by order by 会对数据进行全局排序,和oracle和mysql等数据库中 ...
Hive使用Calcite CBO优化流程及SQL优化实战
目录 Hive SQL执行流程 Hive debug简单介绍 Hive SQL执行流程 Hive 使用Calcite优化 Hive Calcite优化流程 Hive Calcite使用细则 Hive向 ...
016-Hadoop Hive sql语法详解6-job输入输出优化、数据剪裁、减少job数、动态分区
一.job输入输出优化善用muti-insert.union all,不同表的union all相当于multiple inputs,同一个表的union all,相当map一次输出多条示例二. ...
Hive篇---Hive使用优化
一.前述本节主要描述Hive的优化使用,Hive的优化着重强调一个把Hive SQL 当做Mapreduce程序去优化二.主要优化点 1.Hive运行方式:本地模式集群模式本地模式开启本地模式 ...

随机推荐

hbase xshell
用Xshell登陆linux主机后,在hbase shell下死活不能使用backspace和delete删除误输的指令,只得不停退出,重登,仔细输..又错了,再退出,再登,仔细输...又错了...又 ...
Intellij IDEA 部署Web项目，解决 404 错误
https://blog.csdn.net/eaphyy/article/details/72513914
cf1089d Distance Sum
题目大意给一个有n个点,m条边的无向连通图,求所有点两两之间的最短路.$(2<=n<=10^5;n-1<=m<=n+42)$ solution 我们注意到$m-n+1$很小. ...
IE兼容性开发的笔记
当前项目组开发的产品对外承诺支持IE9和IE11,但在推广应用过程中发现存在相当比例的用户实际上还在使用IE8.而这相当比例中的用户还包含了大部分的公司领导.为了满足公司内部各阶层人士体验我们产品的诉 ...
oracle跨数据库跨用户訪问注意事项
java代码中不同意出现oracle的username.数据链路名. 跨用户.跨数据库的訪问必须在oracle中建同义词或视图来实现.在java代码中仅仅需当做当前用户下的对象处理.
jQuery自定义插件规范
<ul class="list"> <li>导航列表 <ul class="nav"> <li>导航列表1< ...
DG查看恢复进度
查看恢复进度 (1)查看进程的活动状态 V$MANAGED_STANDBY视图专用于显示物理Standby数据库相关进程的当前状态,该视图中的列也很有特点,查看进程状态时,通常我们会关注PROCESS ...
Git提交.net项目的小问题
今天早上写了点关于asp.net core授权的东西,输入git add .的时候出现的报错 $ git add .error: open(".vs/DOTNETAuthorization/ ...
BZOJ1396: 识别子串（后缀自动机，线段树）
Description Input 一行,一个由小写字母组成的字符串S,长度不超过10^5 Output L行,每行一个整数,第i行的数据表示关于S的第i个元素的最短识别子串有多长. Sample I ...
HDU 4869 Turn the pokers（思维+组合公式+高速幂）
pid=4869" target="_blank">Turn the pokers 大意:给出n次操作,给出m个扑克.然后给出n个操作的个数a[i],每一个a[i] ...

hive SQL优化之distribute by和sort by

hive SQL优化之distribute by和sort by的更多相关文章

随机推荐

热门专题