前言:Hive ql自己设计总结
1,遇到复杂的查询情况,就分步处理。将一个复杂的逻辑,分成几个简单子步骤处理。
2,但能合在一起的,尽量和在一起的。比如同级别的多个concat函数合并一个select

也就是说,字段之间是并行的同级别处理,则放在一个hive ql;而字段间有前后处理逻辑依赖(判断、补值、计算)则可分步执行,提前将每个字段分别处理好,然后进行相应的分步简单逻辑处理。

一、 场景:日志中region数据处理(国家,省份,城市)
select city_id,province_id,country_id
from wizad_mdm_cleaned_hdfs
where city_id = '' or country_id = '' or province_id = ''
group by city_id,province_id,country_id
二 、发现日志中有空数据:
38              1
73 1
75 1
64 81
76 1
(全空)
77
三、设定过滤逻辑
if country_id=''
if province_id != '' then
if city_id = '' thenCONCAT('region_','1','_',province_id)
elseCONCAT('region_','1','_',province_id,'_',city_id)
else
if city_id != '' thenCONCAT('region_','1','_',parent_region_id,'_',city_id)
else
if province_id=''
if city_id !='' thenCONCAT('region_',country_id,'_',parent_region_id,'_',city_id)
四、hive ql实现
SET mapred.queue.names=queue3;
SET mapred.reduce.tasks=14;
DROP TABLE IF EXISTS test_lmj_mdm_tmp1;
CREATE TABLE test_lmj_mdm_tmp1 AS
SELECT
guid,
(CASE country_id
WHEN '' THEN (CASE WHEN province_id='' THENIF(city_id = '','',CONCAT('region_','1','_',parent_region_id,'_',city_id)) ELSEIF(city_id='',CONCAT('region_','1','_',province_id),CONCAT('region_','1','_',province_id,'_',city_id))END)
ELSE (CASE when province_id='' THENIF(city_id='',CONCAT('region_',country_id),CONCAT('region_',country_id,'_',parent_region_id,'_',city_id))ELSE IF(city_id = '', CONCAT('region_',country_id,'_',province_id),CONCAT('region_',country_id,'_',province_id,'_',city_id))END)
END )AS region,
(CASE connection_type WHEN '2' THENCONCAT('carrier_','wifi') ELSE CONCAT('carrier_',c.element_id) END) AS carrier,
SUM(CASE WHEN logtype = '1' THEN 1 ELSE 0END) AS imp_pv,
SUM(CASE WHEN logtype = '2' THEN 1 ELSE 0END) AS clk_pv
FROM wizad_mdm_cleaned_hdfs a
left outer joinwizad_mdm_dev_lmj_ad_campaign_industry_brand b
ON (a.wizad_ad_id = b.ad_id)
left outer join (SELECT * FROMwizad_mdm_dev_lmj_mapping_table_analytics WHERE TYPE = '7') c
ON (a.adn_id = c.ad_network_id ANDa.carrier_id = c.mapping_id)
left outer joinwizad_mdm_dev_lmj_app_category_analytics d
ON (a.app_category_id = d.adn_category)
left outer join (select region_template_id,parent_region_id from wizad_mdm_dev_lmj_region_template) e
ON (a.city_id = e.region_template_id)
WHERE a.day = '2015-01-01'
GROUP BY guid,
(CASE country_id
WHEN '' THEN (CASE WHEN province_id = ''THEN IF(city_id = '','',CONCAT('region_','1','_',parent_region_id,'_',city_id))ELSEIF(city_id='',CONCAT('region_','1','_',province_id),CONCAT('region_','1','_',province_id,'_',city_id))END)
ELSE (CASE when province_id='' THENIF(city_id='',CONCAT('region_',country_id),CONCAT('region_',country_id,'_',parent_region_id,'_',city_id))ELSEIF(city_id='',CONCAT('region_',country_id,'_',province_id),CONCAT('region_',country_id,'_',province_id,'_',city_id))END)
END),
(CASE connection_type WHEN '2' THENCONCAT('carrier_','wifi') ELSE CONCAT('carrier_',c.element_id) END);
五、Hive ql语句分析

上例中使用case和if,语法参见最后{七、CONDITIONAL FUNCTIONS IN HIVE}

注意:

1,case特殊用法:case后可无对象,而在when后加条件判断语句,如,case when a=1 then true else false end;

2,select后的变换字段提取,对应在groupby中也要有,如carrier的case处理。(否则select不到)。但group by 后不能起表别名(as),select后可以。substring处理time时也一样在select和group by都有,

3,left outerjoin用子查询减少join时的内存

4,IF看版本才能用

六、Hive ql设计重构
初学者如我,总设计复杂逻辑,变态语句。
实际上,有经验的人面对逻辑太过复杂,应该分步操作。一个sql的高级同事重构上例。分两步:
- 1)先分别给各字段补充合理值(能补充的补充,不能的置空)
- 2)然后在region处理时直接过滤掉非法值记录
6.1步骤一语句
DROP TABLE IF EXISTS test_lmj_mdm_tmp;
CREATE TABLE test_lmj_mdm_tmp AS
SELECT
guid,
CONCAT('adn_',adn_id) AS adn,
CONCAT('time_',substr(createtime,12,2)) AS hour,
CONCAT('os_',os_id) AS os,
case when (country_id = '' or country_id = 'NULL' or country_id isnull)
and (province_id ='' or province_id = 'NULL' or province_id is null)
and (city_id = ''or city_id = 'NULL' or city_id is null)
then ''
when (country_id = '' orcountry_id = 'NULL' or country_id is null)
and (province_id<> '' or province_id <> 'NULL' or province_id is not null orcity_id <> '' or city_id <> 'NULL' or city_id is not null)
then '1'
else country_id end ascountry_id,
case when (province_id = '' or province_id = 'NULL' or province_idis null)
ande.parent_region_id <> '' and e.parent_region_id <> 'NULL' ande.parent_region_id is not null
thene.parent_region_id
else province_id end asprovince_id,
city_id,
CONCAT('campaign_',b.campaign_id) AS campaign,
CONCAT('interest_',b.industry_id) AS interest,
CONCAT('brand_',b.brand_id) AS brand,
(CASE connection_type WHEN '2' THEN CONCAT('carrier_','wifi') ELSECONCAT('carrier_',c.element_id) END) AS carrier,
CONCAT('appcategory_',d.wizad_category) AS appcategory,
uid,
SUM(CASE WHEN logtype = '1' THEN 1 ELSE 0 END) AS imp_pv,
SUM(CASE WHEN logtype = '2' THEN 1 ELSE 0 END) AS clk_pv
FROM ${clean_log_table} a
left outer join wizad_mdm_dev_lmj_ad_campaign_industry_brand b
ON (a.wizad_ad_id = b.ad_id)
left outer join (SELECT * FROMwizad_mdm_dev_lmj_mapping_table_analytics WHERE TYPE = '7') c
ON (a.adn_id = c.ad_network_id AND a.carrier_id = c.mapping_id)
left outer join wizad_mdm_dev_lmj_app_category_analytics d
ON (a.app_category_id = d.adn_category)
left outer join (select region_template_id, parent_region_id fromwizad_mdm_dev_lmj_region_template) e
ON (a.city_id = e.region_template_id)
WHERE a.day < '${pt}' and a.day >= '${time_span}'
GROUP BY guid,
CONCAT('adn_',adn_id),
CONCAT('time_',substr(createtime,12,2)),
CONCAT('os_',os_id),
case when (country_id = '' or country_id = 'NULL' or country_id isnull)
and (province_id ='' or province_id = 'NULL' or province_id is null)
and (city_id = '' orcity_id = 'NULL' or city_id is null)
then ''
when (country_id = '' orcountry_id = 'NULL' or country_id is null)
and (province_id<> '' or province_id <> 'NULL' or province_id is not null orcity_id <> '' or city_id <> 'NULL' or city_id is not null)
then '1'
else country_id end,
case when (province_id = '' or province_id = 'NULL' or province_idis null)
and e.parent_region_id <> '' ande.parent_region_id <> 'NULL' and e.parent_region_id is not null
thene.parent_region_id
else province_id end,
city_id,
CONCAT('campaign_',b.campaign_id),
CONCAT('interest_',b.industry_id),
CONCAT('brand_',b.brand_id),
(CASE connection_type WHEN '2' THEN CONCAT('carrier_','wifi') ELSECONCAT('carrier_',c.element_id) END),
CONCAT('appcategory_',d.wizad_category),
UID;
6.2步骤二语句
SELECT guid,CONCAT('region_',country_id,'_',province_id,(case when city_id<> '' and city_id <> 'NULL' and city_id is not null thenconcat('_',city_id) else '' end)) AS fixeddim,UID,SUM(imp_pv) AS pv
FROM test_lmj_mdm_tmp
where imp_pv > 0
and country_id <> ''
and country_id <> 'NULL'
and country_id is not null
and province_id <> ''
and province_id <> 'NULL'
and province_id is not null
GROUP BY guid,CONCAT('region_',country_id,'_',province_id,(case whencity_id <> '' and city_id <> 'NULL' and city_id is not null thenconcat('_',city_id) else '' end)),
UID

以下引自网络

七、CONDITIONALFUNCTIONS IN HIVE

Hive supports three types of conditional functions. These functions

are listed below:

IF( Test Condition, True Value, False Value )

The IF condition evaluates the “Test Condition” and if the “Test

Condition” is true, then it returns the “True Value”. Otherwise, it

returns the False Value. Example: IF(1=1, ‘working’, ‘not working’)

returns ‘working’

COALESCE( value1,value2,… )

The COALESCE function returns the fist not NULL value from the list of

values. If all the values in the list are NULL, then it returns NULL.

Example: COALESCE(NULL,NULL,5,NULL,4) returns 5

CASE Statement

The syntax for the case statement is: CASE [ expression ]

    WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
WHEN conditionn THEN resultn
ELSE result END

Here expression is optional. It is the value that you are comparing to

the list of conditions. (ie: condition1, condition2, … conditionn).

All the conditions must be of same datatype. Conditions are evaluated

in the order listed. Once a condition is found to be true, the case

statement will return the result and not evaluate the conditions any

further.

转自:http://www.folkstalk.com/2011/11/conditional-functions-in-hive.html

All the results must be of same datatype. This is the value returned

once a condition is found to be true.

IF no condition is found to be true, then the case statement will

return the value in the ELSE clause. If the ELSE clause is omitted and

no condition is found to be true, then the case statement will return

NULL

Example:

    CASE   Fruit
WHEN 'APPLE' THEN 'The owner is APPLE'
WHEN 'ORANGE' THEN 'The owner is ORANGE'
ELSE 'It is another Fruit'
END

The other form of CASE is

    CASE
WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
ELSE 'It is another Fruit'
END

hive中使用case、if:一个region统计业务(hive条件函数case、if、COALESCE语法介绍:CONDITIONAL FUNCTIONS IN HIVE)的更多相关文章

  1. hive学习7(条件函数case)

    case函数 语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 说明:如果a为TRUE,则返回b:如果c为TRUE,则返回d:否则返回e 实例 ...

  2. 【hive】时间段为五分钟的统计

    问题内容 今天遇到了一个需求,需求就是时间段为5分钟的统计.有数据的时间戳.对成交单量进行统计. 想法思路 因为数据有时间戳,可以通过from_unixtime()来获取具体的时间. 有了具体的时间, ...

  3. SparkSQL读取Hive中的数据

    由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...

  4. hive中创建hive-json格式的表及查询

    在hive中对于json的数据格式,可以使用get_json_object或json_tuple先解析然后查询. 也可以直接在hive中创建json格式的表结构,这样就可以直接查询,实战如下(hive ...

  5. sqoop 从oracle导数据到hive中,date型数据时分秒截断问题

    oracle数据库中Date类型倒入到hive中出现时分秒截断问题解决方案 1.问题描述: 用sqoop将oracle数据表倒入到hive中,oracle中Date型数据会出现时分秒截断问题,只保留了 ...

  6. mysql中case的一个例子

    最近遇到一个问题: year amount num 1991 1 1.1 1991 2 1.2 1991 3 1.3 1992 1 2.1 1992 2 2.2 1992 3 3.3 把上面表格的数据 ...

  7. 在Hive中执行DDL之类的SQL语句时遇到的一个问题

    在Hive中执行DDL之类的SQL语句时遇到的一个问题 作者:天齐 遇到的问题如下: hive> create table ehr_base(id string); FAILED: Execut ...

  8. 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中

    说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...

  9. Hive中的数据倾斜

    Hive中的数据倾斜 hive 1. 什么是数据倾斜 mapreduce中,相同key的value都给一个reduce,如果个别key的数据过多,而其他key的较少,就会出现数据倾斜.通俗的说,就是我 ...

随机推荐

  1. 习题7-1 uva 208(剪枝)

    题意:按最小字典序输出a到b 的所有路径. 思路:先处理出个点到目标点b的情况(是否能到达),搜索即可. 最开始我只判了a能否到b,然后给我的是WA,然后看了半天感觉思路没什么问题,然后把所有点都处理 ...

  2. [暑假的bzoj刷水记录]

    (这篇我就不信有网站来扣) 这个暑假打算刷刷题啥的 但是写博客好累啊  堆一起算了 隔一段更新一下.  7月27号之前刷的的就不写了 , 写的累 代码不贴了,可以找我要啊.. 2017.8.27upd ...

  3. [BZOJ]1076 奖励关(SCOI2008)

    终于又一次迎来了一道期望DP题,按照约定,小C把它贴了出来. Description 你正在玩你最喜欢的电子游戏,并且刚刚进入一个奖励关.在这个奖励关里,系统将依次随机抛出k次宝物,每次你都可以选择吃 ...

  4. ORACLE 启动过程

    1 STARTUP NOMOUNT 1.读取环境变量下dbs目录下的参数文件(spfile/pfile) 查找参数文件的顺序如上面列表的,读取优先级: spfilechongshi.ora > ...

  5. C程序练习

    1.编程从键盘任意输入两个时间(例如4时55分和1时25分),计算并输出这两个时间之间的间隔.要求不输出时间差的负号. #include<stdio.h> int main() { int ...

  6. Oracle中的行转列例子详解

    --场景1: A B a a a b b 希望实现如下效果: a ,, b , create table tmp as B from dual union all B from dual union ...

  7. 为什么《Dive into Python》不值得推荐

    2010 年 5 月 5 日更新:我翻译了一篇<<Dive Into Python>非死不可>作为对本文观点的进一步支持和对评论的回复,请见:http://blog.csdn. ...

  8. Dynamics 365 Web Api之基于single-valued navigation property的filter查询

    本篇要讲的是dynamics 新版本中web api的一个改进功能,虽然改进的很有限,但至少是改进了. 举个例子,我们现在知道联系人的名字vic,我们想找出客户记录中主要联系人名字为vic的所有客户, ...

  9. Android 实现串口的移植

    安卓串口的实现,需要底层C++配合,不过这次我们根据framework中的思想,直接用API修改提供给JAVA层调用,这个就比较简单了. DEV项目需要,要实现在Android中实现串口的收发功能,有 ...

  10. Azkaban-2.5及Plugins的安装配置

    Azkaban是由LinkedIn开发的调度工具,可以用于调度Hadoop中的相互依赖的Job.有时候,在Hadoop集群中运行的Job是相互依赖的,某些任务需要顺序的执行,这种场景下使用Azkaba ...