hive语句执行顺序

msyql语句执行顺序

代码写的顺序：

select ... from... where.... group by... having... order by..

    或者

from ... select ...

代码的执行顺序：

from... where...group by... having.... select ... order by...

hive 语句执行顺序

大致顺序

from... where.... select...group by... having ... order by...

explain查看执行计划

hive语句和mysql都可以通过explain查看执行计划，这样就可以查看执行顺序，比如代码

    explain

    select city,ad_type,device,sum(cnt) as cnt

    from tb_pmp_raw_log_basic_analysis

    where day = '2016-05-28' and type = 0 and media = 'sohu' and (deal_id = '' or deal_id = '-' or deal_id is NULL)

    group by city,ad_type,device

显示执行计划如下

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: tb_pmp_raw_log_basic_analysis

            Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE

            Filter Operator

              predicate: (((deal_id = '') or (deal_id = '-')) or deal_id is null) (type: boolean)

              Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE

              Select Operator

                expressions: city (type: string), ad_type (type: string), device (type: string), cnt (type: bigint)

                outputColumnNames: city, ad_type, device, cnt

                Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE

                Group By Operator

                  aggregations: sum(cnt)

                  keys: city (type: string), ad_type (type: string), device (type: string)

                  mode: hash

                  outputColumnNames: _col0, _col1, _col2, _col3

                  Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE

                  Reduce Output Operator

                    key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)

                    sort order: +++

                    Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string)

                    Statistics: Num rows: 8195357 Data size: 580058024 Basic stats: COMPLETE Column stats: NONE

                    value expressions: _col3 (type: bigint)

      Reduce Operator Tree:

        Group By Operator

          aggregations: sum(VALUE._col0)

          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string)

          mode: mergepartial

          outputColumnNames: _col0, _col1, _col2, _col3

          Statistics: Num rows: 4097678 Data size: 290028976 Basic stats: COMPLETE Column stats: NONE

          Select Operator

            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: bigint)

            outputColumnNames: _col0, _col1, _col2, _col3

            Statistics: Num rows: 4097678 Data size: 290028976 Basic stats: COMPLETE Column stats: NONE

            File Output Operator

              compressed: false

              Statistics: Num rows: 4097678 Data size: 290028976 Basic stats: COMPLETE Column stats: NONE

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1

具体介绍如下

**stage1的map阶段**

        TableScan：from加载表，描述中有行数和大小等

        Filter Operator：where过滤条件筛选数据，描述有具体筛选条件和行数、大小等

        Select Operator：筛选列，描述中有列名、类型，输出类型、大小等。

        Group By Operator：分组，描述了分组后需要计算的函数，keys描述用于分组的列，outputColumnNames为输出的列名，可以看出列默认使用固定的别名_col0，以及其他信息

        Reduce Output Operator：map端本地的reduce，进行本地的计算，然后按列映射到对应的reduce

**stage1的reduce阶段Reduce Operator Tree**

        Group By Operator：总体分组，并按函数计算。map计算后的结果在reduce端的合并。描述类似。mode: mergepartial是说合并map的计算结果。map端是hash映射分组

        Select Operator：最后过滤列用于输出结果

        File Output Operator：输出结果到临时文件中，描述介绍了压缩格式、输出文件格式。

        stage0第二阶段没有，这里可以实现limit 100的操作。

总结

1，每个stage都是一个独立的MR，复杂的hql语句可以产生多个stage，可以通过执行计划的描述，看看具体步骤是什么。

2，执行计划有时预测数据量，不是真实运行，可能不准确

group by的MR

hive语句最好写子查询嵌套，这样分阶段的导入数据，可以逐步减少数据量。但可能会浪费时间。所以需要设计好。

group by本身也是一种数据筛选，可以大量减少数据，尤其用于去重等方面，功效显著。但group by产生MR有时不可控，不知道在哪个阶段更好。尤其，map端本地的reduce减少数据有很大作用。

尤其，hadoop的MR不患寡而患不均。数据倾斜将是MR计算的最大瓶颈。hive中可以设置分区、桶、distribute by等来控制分配数据给Reduce。

那么，group by生成MR是否可以优化呢?

下面两端代码，可以对比一下，

代码1

explain

select advertiser_id,crt_id,ad_place_id,channel,ad_type,rtb_type,media,count(1) as cnt

from (

  select

    split(all,'\\\\|~\\\\|')[41] as advertiser_id,

    split(all,'\\\\|~\\\\|')[11] as crt_id,

    split(all,'\\\\|~\\\\|')[8] as ad_place_id,

    split(all,'\\\\|~\\\\|')[34] as channel,

    split(all,'\\\\|~\\\\|')[42] as ad_type,

    split(all,'\\\\|~\\\\|')[43] as rtb_type,

    split(split(all,'\\\\|~\\\\|')[5],'/')[1] as media

  from tb_pmp_raw_log_bid_tmp tb

) a

group by advertiser_id,crt_id,ad_place_id,channel,ad_type,rtb_type,media;

代码2

 explain

  select

    split(all,'\\\\|~\\\\|')[41] as advertiser_id,

    split(all,'\\\\|~\\\\|')[11] as crt_id,

    split(all,'\\\\|~\\\\|')[8] as ad_place_id,

    split(all,'\\\\|~\\\\|')[34] as channel,

    split(all,'\\\\|~\\\\|')[42] as ad_type,

    split(all,'\\\\|~\\\\|')[43] as rtb_type,

    split(split(all,'\\\\|~\\\\|')[5],'/')[1] as media

  from tb_pmp_raw_log_bid_tmp tb

  group by split(all,'\\\\|~\\\\|')[41],split(all,'\\\\|~\\\\|')[11],split(all,'\\\\|~\\\\|')[8],split(all,'\\\\|~\\\\|')[34],split(all,'\\\\|~\\\\|')[42],split(all,'\\\\|~\\\\|')[43],split(split(all,'\\\\|~\\\\|')[5],'/')[1]

先进行子查询，然后group by，还是直接group by，两种那个好一点，

我个人测试后认为，数据量小，第一种会好一点，如果数据量大，可能第二种会好。至于数据量多大。TB级以下的都是小数据。

两个执行计划对比如下，可以看出基本执行的步骤的数据分析量差不多。

group by一定要用，但内外，先后执行顺序效果差不多。

代码1

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: tb

            Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

            Select Operator

              expressions: split(all, '\\|~\\|')[41] (type: string), split(all, '\\|~\\|')[11] (type: string), split(all, '\\|~\\|')[8] (type: string), split(all, '\\|~\\|')[34] (type: string), split(all, '\\|~\\|')[42] (type: string), split(all, '\\|~\\|')[43] (type: string), split(split(all, '\\|~\\|')[5], '/')[1] (type: string)

              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6

              Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

              Group By Operator

                aggregations: count(1)

                keys: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)

                mode: hash

                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7

                Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

                Reduce Output Operator

                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)

                  sort order: +++++++

                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)

                  Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

                  value expressions: _col7 (type: bigint)

      Reduce Operator Tree:

        Group By Operator

          aggregations: count(VALUE._col0)

          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: string), KEY._col6 (type: string)

          mode: mergepartial

          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7

          Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE

          Select Operator

            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: bigint)

            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7

            Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE

            File Output Operator

              compressed: false

              Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1

代码2

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 is a root stage

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: tb

            Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

            Select Operator

              expressions: all (type: string)

              outputColumnNames: all

              Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

              Group By Operator

                keys: split(all, '\\|~\\|')[41] (type: string), split(all, '\\|~\\|')[11] (type: string), split(all, '\\|~\\|')[8] (type: string), split(all, '\\|~\\|')[34] (type: string), split(all, '\\|~\\|')[42] (type: string), split(all, '\\|~\\|')[43] (type: string), split(split(all, '\\|~\\|')[5], '/')[1] (type: string)

                mode: hash

                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6

                Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

                Reduce Output Operator

                  key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)

                  sort order: +++++++

                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)

                  Statistics: Num rows: 1126576783 Data size: 112657678336 Basic stats: COMPLETE Column stats: NONE

      Reduce Operator Tree:

        Group By Operator

          keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string), KEY._col3 (type: string), KEY._col4 (type: string), KEY._col5 (type: string), KEY._col6 (type: string)

          mode: mergepartial

          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6

          Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE

          Select Operator

            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string)

            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6

            Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE

            File Output Operator

              compressed: false

              Statistics: Num rows: 563288391 Data size: 56328839118 Basic stats: COMPLETE Column stats: NONE

              table:

                  input format: org.apache.hadoop.mapred.TextInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1

hive高阶1--sql和hive语句执行顺序、explain查看执行计划、group by生成MR的更多相关文章

Hive 高阶应用开发示例(一)
Hive的一些常用的高阶开发内容 1.开窗函数 2.行转列,列转行,多行转一行,一行转多行 3.分组: 增强型group 4.排序 5.关联本次的内容: 内容1 和内容2,采用 ...
SQL复杂查询语句-SELECT * FROM cs WHERE score>70 GROUP BY s_id HAVING COUNT(*)>1
如果同时存在where,group by,的时候的执行顺序应该是这样的: 1,首先where后面添加条件把数据进行了过滤,返回一个结果集 2,然后group by将上面返回的结果集进行分组,返回一个结 ...
MySQL--运行机制,SQL执行顺序,Explain
MySQL的运行机制是什么? 首先客户端先要发送用户信息去服务器端进行授权认证,当输入正确密码之后可以连接到数据库了,当连接服务器端成功之后就可以正常的执行 SQL 命令了,MySQL 服务器拿到 ...
Hive高阶聚合函数 GROUPING SETS、Cube、Rollup
-- GROUPING SETS作为GROUP BY的子句,允许开发人员在GROUP BY语句后面指定多个统计选项,可以简单理解为多条group by语句通过union all把查询结果聚合起来结合起 ...
SQL语句的执行顺序 1>优先执行，然后依数字排序
1>…From 表 2>…Where 条件 3>…Group by 列 4>…Having 筛选条件 ...
sql执行效率,explain 查询执行效率
1.对查询进行优化,应尽量避免全表扫描,首先应考虑在 where 及 order by 涉及的列上建立索引. 2.应尽量避免在 where 子句中对字段进行 null 值判断,否则将导致引擎放弃使用索 ...
SQL逻辑查询语句执行顺序
阅读目录一 SELECT语句关键字的定义顺序二 SELECT语句关键字的执行顺序三准备表和数据四准备SQL逻辑查询测试语句五执行顺序分析一 SELECT语句关键字的定义顺序 SELE ...
SQL执行顺序和coalesce以及case when的用法
1.mysql的执行顺序 from on join where group by having select distinct union //UNION 操作符用于合并两个或多个 SELECT ...
SQL 中的语法顺序与执行顺序
FROM : HOME SQL 是一种声明式语言 SQL 语言是为计算机声明了一个你想从原始数据中获得什么样的结果的一个范例,而不是告诉计算机如何能够得到结果. SQL 语言声明的是结果集的属性,计算 ...

随机推荐

JProfiler简明使用教程
JProfile是一款性能瓶颈分析工具,监控粒度可以细化到某一个类包,堪称神器!我安装了一下9.11的版本,并简单说说使用方法. 1:创建一个监控任务 2:选择tomcat版本 3:监控远程服务器 4 ...
.NET面试资料整理
1.WCF和Web Api的区别答:1WCF是.NET平台开发的一站式框架,Web Api的设计和构建只考虑一件事情,那就是Http,而WCF的设计主要考虑是SOAP和WS-*:Web Api非常轻量 ...
CF 472 div1 D. Contact ATC
#include <algorithm> #include <cmath> #include <cstdio> #include <cstring> # ...
推荐系统——online（上）
框架介绍上一篇从总体上介绍了推荐系统,推荐系统online和offline是两个组成部分,其中offline负责数据的收集,存储,统计,模型的训练等工作:online部分负责处理用户的请求,模型数据 ...
windows使用Win32DiskImager安装树莓派系统
首先去官网下载一个树莓派镜像. 然后使用Win32DiskImager这个工具安装. 不过试了以下好像不管用. 然后网上有ubuntu安装树莓派操作系统的方法. 于是就想我要是装树莓派不会还得装一 ...
计蒜客模拟赛5 D2T1 成绩统计
又到了一年一度的新生入学季了,清华和北大的计算机系同学都参加了同一场开学考试(因为两校兄弟情谊深厚嘛,来一场联考还是很正常的). 不幸的是,正当老师要统计大家的成绩时,世界上的所有计算机全部瘫痪了. ...
【Rain in ACStar HDU-3340】
·你正从AC星球返回,天又下起凸包雨,只好到线段树下躲雨. ·英文题,述大意: 一个竖直平面的美丽天空,会下凸包雨.凸包雨指的是边数为3~6的多边形,并且每一个它都遵守一个神奇定律,那就是 ...
bzoj1858[Scoi2010]序列操作线段树
1858: [Scoi2010]序列操作 Time Limit: 10 Sec Memory Limit: 64 MBSubmit: 3079 Solved: 1475[Submit][Statu ...
数据挖掘实战<1>:数据质量检查
数据行业有一句很经典的话--"垃圾进,垃圾出"(Garbage in, Garbage out, GIGO),意思就是,如果使用的基础数据有问题,那基于这些数据得到的任何产出都是没 ...
Vegas Pro 15软件界面对比
大家都知道Vegas是一款专业的视频制作软件,而新版的VEGAS Pro 15更是专业性十足.好了,废话不多说,接下来小编就带大家具体的看一下Vegas 15界面都有哪些更新吧! 一.软件图标图1: ...

hive高阶1--sql和hive语句执行顺序、explain查看执行计划、group by生成MR

hive语句执行顺序

msyql语句执行顺序

hive 语句执行顺序

explain查看执行计划

group by的MR

hive高阶1--sql和hive语句执行顺序、explain查看执行计划、group by生成MR的更多相关文章

随机推荐

热门专题