Hive手写SQL案例

1-请详细描述将一个有结构的文本文件student.txt导入到一个hive表中的步骤，及其关键字

假设student.txt 有以下几列：id,name,gender三列
1-创建数据库 create database student_info;
2-创建hive表 student

create external table student_info.student(

id string comment '学生id',

name string comment '学生姓名',

gender string comment '学生性别'

) comment "学生信息表"

row format delimited fields terminated by '\t'

line terminated by '\n'

stored as textfile

location "/user/root/student";

3-加载数据

load data local inpath '/root/student.txt' into table student_info.student  location "/user/root/student" ;

4- 进入hive-cli，查看相应的表结构
select * from student_info.student limit 10；

划重点：要回手写代码

2-利用HQL实现以下功能

2-1-创建表

创建员工基本信息表(EmployeeInfo)，字段包括(员工 ID，员工姓名，员工身份证号，性别，年龄，所属部门，岗位，入职公司时间，离职公司时间)，分区字段为入职公司时间，其行分隔符为”\n “，字段分隔符为”\t “。其中所属部门包括行政部、财务部、研发部、教学部，其对应岗位包括行政经理、行政专员、财务经理、财务专员、研发工程师、测试工程师、实施工程师、讲师、助教、班主任等，时间类型值如：2018-05-10 11:00:00
创建员工收入表(IncomeInfo)，字段包括(员工 ID，员工姓名，收入金额，收入所属
月份，收入类型，收入薪水的时间)，分区字段为发放薪水的时间，其中收入类型包括薪资、奖金、公司福利、罚款四种情况 ; 时间类型值如：2018-05-10 11:00:00。

注意：时间类型是2018-05-10 11:00:00，需要对字段进行处理

创建员工基本信息表

create external table test.employee_info(

id string comment '员工id',

name string comment '员工姓名',

indentity_card string comment '身份证号',

gender string comment '性别',

department string comment '所属部门',

post string comment '岗位',

hire_date string comment '入职时间',

departure_date string comment '离职时间'

) comment "员工基本信息表"

partitioned by (day string comment "员工入职时间")

row format delimited fields terminated by '\t'

lines terminated by  '\n'

stored as textfile

location '/user/root/employee';

创建员工收入表

create external table test.income_info(

id string comment '员工id',

name string comment '员工姓名',

income_data string comment '收入',

income_month string comment '收入所属月份',

income_type string comment '收入类型',

income_datetime string comment '收入薪水时间'

) comment '员工收入表'

partitioned by (day string comment "员工发放薪水时间")

row format delimited fields terminated by '\t'

lines terminated by  '\n'

stored as textfile

location '/user/root/income';

2-2用 HQL 实现，求公司每年的员工费用总支出各是多少，并按年份降序排列?

重点对时间类型 2018-05-10 11:00:00 进行内置函数处理
需要读取income_info全量表，按照分区时间进行聚合，因为收入类型里面有罚款一项，所以需要在员工发放的钱中扣除罚款的钱。
不采用join、对数据一次遍历输出结果，
对于大数据量的情况下，要考虑对数据进行一次遍历求出结果

select

    income_year,(income_data-(nvl(penalty_data,0))) as company_cost

from

(

    -- 统计员工收入金额和罚款金额，输出 2019 500 10

    select

        income_year,

        sum(case when income_type!='罚款' then data_total else 0 end) as income_data,

        sum(case when income_type='罚款' then data_total else 0 end) as penalty_data

    from

    (

    -- 按照年份、收入类型求收入金额

    select

        year(to_date(income_datetime)) as income_year,

        income_type,

        sum(income_data) as data_total

    from

        test.income_info

    group by

        year(to_date(income_datetime)) ,income_type

    ) tmp_a

    group by  tmp_a.income_year

) as  temp

order by income_year desc;

2-3用 HQL 实现，求各部门每年的员工费用总支出各是多少，并按年份降序，按部门的支出升序排列？

保证对数据的一次遍历

--根据id关联得出department,和消费类型

select

    income_year,department,

    (sum(case when income_type!='罚款' then income_data else 0 end) - sum(case when income_type='罚款' then income_data else 0 end) ) as department_cost

from

(

    -- 先对员工进行薪资类别的聚合统计

    select

        id,year(to_date(income_datetime)) as income_year,income_type,sum(income_data) as income_data

    from

        test.income_info

    group by

    year(to_date(income_datetime)),id,income_type

) temp_a

inner join

    test.employee_info b

on

    temp_a.id=b.id

group by

    department,income_year

order by income_year desc , department_cost asc;

2-4用 HQL 实现，求各部门历史所有员工费用总支出各是多少，按总支出多少排名降序，遇到值相等情况，不留空位。

根据2-3中的中间结果进行修改
注意历史上所有的数据

select department,department_cost,dense_rank() over(order by department_cost desc) as cost_rank

from

(

--根据id关联得出department,和消费类型

select

    department,

    (sum(case when income_type!='罚款' then income_data else 0 end) - sum(case when income_type='罚款' then income_data else 0 end) ) as department_cost

from

(

    -- 先对员工进行薪资类别的聚合统计

    select

        id,income_type,sum(income_data) as income_data

    from

        test.income_info

    group by

    id,income_type

) temp_a

inner join

    test.employee_info b

on

    temp_a.id=b.id

group by

    department

) tmp_c ;

2-5 用 HQL 实现，创建并生成员工薪资收入动态变化表，即员工 ID，员工姓名，员工本月薪资，本月薪资发放时间，员工上月薪资，上月薪资发放时间。分区字段为本月薪资发放时间。

感觉应该使用动态分区插入的特性？-但是不知道该怎么写
先创建表，再采用insert into table **** select ***
要考虑到离职和入职的员工，这一点需要考虑到，full join
两张表进行full join，过滤day is null
需要concat year month to_date内置函数处理
这个题需要考虑的比较多

create external table test.income_dynamic(

id string comment '员工id',

name string comment '员工姓名',

income_data_current string comment '本月收入',

income_datetime_current string comment 本月'收入薪水时间',

income_data_last   string comment '上月收入',

income_datetime_last string comment '上月收入薪水时间',

) comment '员工收入动态表'

partitioned by (day string comment "员工本月发放薪水时间")

row format delimited fields terminated by '\t'

lines terminated by  '\n'

stored as textfile

location '/user/root/income';

-- ------------------------------------------------------------------------------

-- 动态分区插入

-- 插入语句

-- 采用full join

insert into table test.income_dynamic partition(day)

select

    (case when id_a is not null then id_a else id_b end ) as id,

    (case when name_a is not null then name_a else name_b end )  as name ,

    income_data,income_datetime,income_data_b,income_datetime_b,day

from

    (

    -- 选出表中所有的数据

    select

        id as id_a,name as name_a,income_data,income_datetime,day,concat(year(to_date(day)),month(to_date(day))) as day_flag

    from

        test.income_info

    where

        income_type='薪资' ) tmp_a

full outer join

    (

    -- 将表中的收到薪水的日期整体加一个月

    select

        id as id_b,name as name_b,income_data as income_data_b,income_datetime as  income_datetime_b,concat(year(add_months(to_date(day),1)),month(add_months(to_date(day),1))) as   month_flag

    from

        test.income_info

    where

        income_type='薪资'

    ) tmp_b

    on

        tmp_a.day_flag=tmp_b.month_flag

    and

        tmp_a.id_a=tmp_b.id_b

where day is not null

;

2-6 用 HQL 实现，薪资涨幅方面，2018 年 5 月份谁的工资涨的最多，谁的涨幅最大？

再2-5的基础上做比较简单，仅仅利用select部分即可；或者是再2-5的基础上做就行

Hive行列转换

１、问题

hive如何将

a       b       1

a       b       2

a       b       3

c       d       4

c       d       5

c       d       6

变为：

a       b       1,2,3

c       d       4,5,6

-------------------------------------------------------------------------------------------

２、数据

test.txt

a       b       1

a       b       2

a       b       3

c       d       4

c       d       5

c       d       6

-------------------------------------------------------------------------------------------

３、答案

1.建表

drop table tmp_jiangzl_test;

create table tmp_jiangzl_test

(

col1 string,

col2 string,

col3 string

)

row format delimited fields terminated by '\t'

stored as textfile;

-- 加载数据

load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test;

2.处理

select col1,col2,concat_ws(',',collect_set(col3))

from tmp_jiangzl_test

group by col1,col2;

---------------------------------------------------------------------------------------

collect_set/concat_ws语法参考链接：https://blog.csdn.net/waiwai3/article/details/79071544

https://blog.csdn.net/yeweiouyang/article/details/41286469   [Hive]用concat_w实现将多行记录合并成一行

---------------------------------------------------------------------------------------

二、列转行

１、问题

hive如何将

a       b       1,2,3

c       d       4,5,6

变为：

a       b       1

a       b       2

a       b       3

c       d       4

c       d       5

c       d       6

---------------------------------------------------------------------------------------------

2、答案

1.建表

drop table tmp_jiangzl_test;

create table tmp_jiangzl_test

(

col1 string,

col2 string,

col3 string

)

row format delimited fields terminated by '\t'

stored as textfile;

处理：

select col1, col2, col5

from tmp_jiangzl_test a

lateral  view explode(split(col3,',')) b AS col5;

---------------------------------------------------------------------------------------

lateral  view 语法参考链接：

https://blog.csdn.net/clerk0324/article/details/58600284

Hive实现wordcount

1.创建数据库

create database wordcount;

2.创建外部表

create external table word_data(line string) row format delimited fields terminated by ',' location '/home/hadoop/worddata';

3.映射数据表

load data inpath '/home/hadoop/worddata' into table word_data;

4.这里假设我们的数据存放在hadoop下，路径为：/home/hadoop/worddata，里面主要是一些单词文件，内容大概为：

hello man

what are you doing now

my running

hello

kevin

hi man

执行了上述hql就会创建一张表src_wordcount，内容是这些文件的每行数据，每行数据存在字段line中，select * from word_data;就可以看到这些数据

5.根据MapReduce的规则，我们需要进行拆分，把每行数据拆分成单词，这里需要用到一个hive的内置表生成函数（UDTF）：explode(array)，参数是array，其实就是行变多列：

create table words(word string);

insert into table words select explode(split(line, " ")) as word from word_data;

6.查看words表内容

OK

hello

man

what

are

you

doing

now

my

running

hello

kevin

hi

man

split是拆分函数，跟java的split功能一样，这里是按照空格拆分，所以执行完hql语句，words表里面就全部保存的单个单词

7.group by统计单词

    select word, count(*) from wordcount.words group by word;

wordcount.words 库名称.表名称，group by word这个word是create table words(word string) 命令创建的word string

结果：

are     1

doing   1

hello   2

hi      1

kevin   1

man     2

my      1

now     1

running 1

what    1

you     1

Hive取TopN

rank() over()
dense_rank() over()
row_number() over()

求取指定状态下的订单id

给一张订单表，统计只购买过面粉的用户；（重点在于仅仅购买过面粉的客户）
eg：order:order_id,buyer_id,order_time.....

在保证一次遍历的情况下,重点是O(1)复杂度

select buyer_id

from

(

select buyer_id,sum(case when order_id='面粉' then 0 else 1 end) as flag

from order

) as tmp

where flag=0;

微博体系中互粉的有多少组

在微博粉丝表中，互相关注的人有多少组，例如：A-->B;B-->A；A和B互粉，称为一组。
表结构：id,keep_id,time.... (id,keep_id可作为联合主键)
借助Hive进行实现

select count(*)/2 as weibo_relation_number

from

(

  (select concat(id,keep_id) as flag from weibo_relation)

  union all  --全部合并到一起，不能提前去重

  (select concat(keep_id,id) as flag from weibo_relation)

) as tmp

having count(flag) =2

group by flag;

购买了香蕉的人买了多少东西

这个是一个很经典的问题，购买了香蕉的人买了多少东西
数据还是延用上一个问题的数据和表结构，即理解为关注C的人总共关注了多少人
仔细理解是需要对关注的人进行去重统计

select count(distinct keep_id) as total_keep_id

from weibo_relation

where id

  in

(select id from weibo_relation where keep_id='c')

Hive手写SQL案例的更多相关文章

SpringBoot项目里，让TKmybatis支持可以手写sql的Mapper.xml文件
SpringBoot项目通常配合TKMybatis或MyBatis-Plus来做数据的持久化. 对于单表的增删改查,TKMybatis优雅简洁,无需像传统mybatis那样在mapper.xml文件里 ...
hibernate使用手写sql以及对结果list的处理
Session sees=simpleDAO.getSessionFactory().openSession(); String sql = "select * from fhcb_08_t ...
面试题四：手写sql
矫正数据,有以下2个表,建表语句如下所示 -- 订单表 create table t_order ( id int auto_increment primary key, name varchar(2 ...
IDEA中mybatis插件自动生成手写sql的xml文件
上图: 选择这个安装,然后重启IDEA,ok.
使用HibernateTemplate手写源生SQL的【增删改查】操作
使用 HibernateTemplate 进行持久化操作执行的时候不报错,但数据库的持久化操作没有一点作用,问了好多人,说没有声明事务和提交事务, 用的是别人搭的的架构,事务已经有了,自动提交事务的 ...
30个类手写Spring核心原理之自定义ORM（上）（6）
本文节选自<Spring 5核心原理> 1 实现思路概述 1.1 从ResultSet说起说到ResultSet,有Java开发经验的"小伙伴"自然最熟悉不过了,不过 ...
手写一个简单的ElasticSearch SQL转换器(一)
一.前言之前有个需求,是使ElasticSearch支持使用SQL进行简单查询,较新版本的ES已经支持该特性(不过貌似还是实验性质的?) ,而且git上也有elasticsearch-sql 插件, ...
初学源码之——银行案例手写IOC和AOP
手写实现lOC和AOP 上一部分我们理解了loC和AOP思想,我们先不考虑Spring是如何实现这两个思想的,此处准备了一个『银行转账」的案例,请分析该案例在代码层次有什么问题?分析之后使用我们已有知 ...
如何手写一款SQL injection tool？
0×01 前言我想在FreeBuf上出没的人一般都是安全行业的,或者说是安全方面的爱好者,所以大家对sql注入应该都比较了解,反正我刚入门的时候就是学的这些:sql注入.xss之类的.sql注入从出 ...

随机推荐

e代驾狂野裁员 O2O逐渐恢复理智？
O2O逐渐恢复理智?" title="e代驾狂野裁员 O2O逐渐恢复理智?"> 近段时间以来,O2O行业堪称"哀鸿遍野",十分凄惨.巨头 ...
游LeetCode一月之闲谈
今年的2月比往常更长,不是因为比往年多了一天,而是被病毒隔离在家的日子显得十分漫长.如果再不给自己找点事情做的话,且不论身体方面的健康状况,精神方面可能也会有些隐忧.做为一名工程师,适时地读上几本平日 ...
微信小程序app.js中设置公有变量
初始化GlobalData 在App.js的最上方可以设置GlobalData的初始值. App({ globalData:{ appid: '1wqas2342dasaqwe232342xxxxxx ...
linux tc流量控制
tc流量控制项目背景 vintage3.0接口lookupforupdage增加一个策略,当带宽流量tx或rx超过40%,75%随机返回304:超过60%,此接口均返回304 为了对测试机器进行流量 ...
【2020Python修炼记3】初识Python，你需要知道哪些（一）
一.编程语言简介机器语言计算机能直接理解的就是二进制指令,所以机器语言就是直接用二进制编程,这意味着机器语言是直接操作硬件的,因此机器语言属于低级语言, 此处的低级指的是底层.贴近计算机硬件(贴近 ...
二叉堆的BuildHeap操作
优先队列(二叉堆)BuildHeap操作 \(BuildHeap(H)\)操作把\(N\)个关键字作为输入并把它们放入空堆中.显然,这可以使用\(N\)个相继的\(Insert\)操作来完成.由于每个 ...
【Amaple教程】2. 模块
正如它的名字,模块用于amaplejs单页应用的页面分割,所有的跳转更新和代码编写都是以模块为单位的. 定义一个模块一个模块由<module>标签对包含,内部分为template模板.J ...
前端每日实战：27# 视频演示如何用纯 CSS 创作一个精彩的彩虹 loading 特效
效果预览按下右侧的"点击预览"按钮可以在当前页面预览,点击链接可以全屏预览. https://codepen.io/comehope/pen/vjvoow 可交互视频教程此视频 ...
利用virtualenvwrapper创建虚拟环境出现错误“/usr/bin/python: No module named virtualenvwrapper”
Linux:CentOS7 python: 系统默认python版本2.7,利用python启动自己安装python版本3.8,利用python3启动问题描述: 在上述环境中利用virtualen ...
关于java性能优化细节方面的建议
在Javva程序中,性能问题的大部分原因并不在于Java语言,而是程序本身,养成一个良好的编码习惯非常重要,能够显著地提升程序性能.下面来聊聊该方面的建议: 1.尽量在合适的场合使用单例: 所谓单例, ...

Hive手写SQL案例

1-请详细描述将一个有结构的文本文件student.txt导入到一个hive表中的步骤，及其关键字

2-利用HQL实现以下功能

2-1-创建表

2-2用 HQL 实现，求公司每年的员工费用总支出各是多少，并按年份降序排列?

2-3用 HQL 实现，求各部门每年的员工费用总支出各是多少，并按年份降序，按部门的支出升序排列？

2-4用 HQL 实现，求各部门历史所有员工费用总支出各是多少，按总支出多少排名降序，遇到值相等情况，不留空位。

2-5 用 HQL 实现，创建并生成员工薪资收入动态变化表，即员工 ID，员工姓名，员工本月薪资，本月薪资发放时间，员工上月薪资，上月薪资发放时间。分区字段为本月薪资发放时间。

2-6 用 HQL 实现，薪资涨幅方面，2018 年 5 月份谁的工资涨的最多，谁的涨幅最大？

Hive行列转换

Hive实现wordcount

Hive取TopN

求取指定状态下的订单id

微博体系中互粉的有多少组

购买了香蕉的人买了多少东西

Hive手写SQL案例的更多相关文章

随机推荐

热门专题