Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE
参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm
测试数据准备:
create external table test_data (
cookieid string,
createtime string, --页面访问时间
url string --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/user/jc_rc_ftp/test_data'; select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid | l.createtime | l.url |
+-------------+----------------------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 |
| cookie1 | 2015-04-10 10:50:05 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 |
| cookie2 | 2015-04-10 10:00:02 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 |
| cookie2 | 2015-04-10 10:50:05 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 |
+-------------+----------------------+---------+--+
LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | last_1_time | last_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
LEAD
与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | next_1_time | next_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 1970-01-01 00:00:00 | NULL |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
FIRST_VALUE
取分组内排序后,截止到当前行,第一个值
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_data; +-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | first1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url1 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | url11 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url11 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url11 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url11 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url11 |
+-----------+----------------------+---------+-----+---------+--+
LAST_VALUE
取分组内排序后,截止到当前行,最后一个值
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
+-----------+----------------------+---------+-----+---------+--+ SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
+-----------+----------------------+---------+-----+---------+--+
如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果
SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid | createtime | url | first2 |
+-----------+----------------------+---------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url2 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url2 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url2 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url2 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url2 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url55 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url55 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url55 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url55 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url55 |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid | createtime | url | last2 |
+-----------+----------------------+---------+--------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url1 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url1 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url22 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url22 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url22 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url22 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url22 |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)
如果想要取分组内排序后最后一个值,则需要变通一下:
SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_data
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid | createtime | url | rn | last1 | last2 |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 | url7 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 | url7 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 | url7 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 | url7 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 | url77 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 | url77 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 | url77 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 | url77 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 | url77 |
+-----------+----------------------+---------+-----+---------+--------+--+
Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的更多相关文章
- pandas实现hive的lag和lead函数 以及 first_value和last_value函数
lag和lead VS shift 该函数的格式如下: 第一个参数为列名, 第二个参数为往上第n行(可选,默认为1), 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL ...
- Hive 窗口函数LEAD LAG FIRST_VALUE LAST_VALUE
窗口函数(window functions)对多行进行操作,并为查询中的每一行返回一个值. OVER()子句能将窗口函数与其他分析函数(analytical functions)和报告函数(repor ...
- oracle listagg函数、lag函数、lead函数 实例
Oracle大师Thomas Kyte在他的经典著作中,反复强调过一个实现需求方案选取顺序: “如果你可以使用一句SQL解决的需求,就使用一句SQL:如果不可以,就考虑PL/SQL是否可以:如果PL/ ...
- hive函数参考手册
hive函数参考手册 原文见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 1.内置运算符1.1关系运算符 运 ...
- Hive函数以及自定义函数讲解(UDF)
Hive函数介绍HQL内嵌函数只有195个函数(包括操作符,使用命令show functions查看),基本能够胜任基本的hive开发,但是当有较为复杂的需求的时候,可能需要进行定制的HQL函数开发. ...
- 大数据入门第十一天——hive详解(三)hive函数
一.hive函数 1.内置运算符与内置函数 函数分类: 查看函数信息: DESC FUNCTION concat; 常用的分析函数之rank() row_number(),参考:https://www ...
- Hadoop生态圈-Hive函数
Hadoop生态圈-Hive函数 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.
- Hive(四)hive函数与hive shell
一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档> https://cwiki.apache.org/confluence/displ ...
- Hive入门笔记---2.hive函数大全
Hive函数大全–完整版 现在虽然有很多SQL ON Hadoop的解决方案,像Spark SQL.Impala.Presto等等,但就目前来看,在基于Hadoop的大数据分析平台.数据仓库中,Hiv ...
随机推荐
- git 使用方式
一.常用操作命令 1.初始化操作 git config --global user.name '<name>' # 设置提交者名称 git config --global user.ema ...
- python编程中的if __name__ == 'main与windows中使用多进程
if __name__ == 'main 一个python的文件有两种使用的方法,第一是直接作为程序执行,第二是import到其他的python程序中被调用(模块重用)执行. 因此if __name_ ...
- 51ak带你看MYSQL5.7源码2:编译现有的代码
从事DBA工作多年 MYSQL源码也是头一次接触 尝试记录下自己看MYSQL5.7源码的历程 目录: 51ak带你看MYSQL5.7源码1:main入口函数 51ak带你看MYSQL5.7源码2:编译 ...
- JQ 判断 浏览器打开的设备类型
<script> $(document).ready(function(){ var ua = navigator.userAgent; var ipad = ua.match(/(iPa ...
- Vue解析四之注册变量
判断监听的变量,如果undefined可以用$set来注册一个变量. 另外click可以是表达式,不一定必须是一个方法.
- oracle10g 基于linux6安装问题收集
1.[oracle@rsyslogserver database]$ dbca -silent -responseFile /home/oracle/database/dbca.rsp No comm ...
- linux小白成长之路6————安装Java+Apache(httpd)+Tomcat
[内容指引] 安装Java环境: 查看JDK版本: 安装Apache(httpd); 安装Tomcat: 设置服务开机启动. 1.安装Java环境 指令: yum intall java-1.8.0* ...
- supervisor进程管理工具的使用
supervisor是一款进程管理工具,当想让应用随着开机启动,或者在应用崩溃之后自启动的时候,supervisor就派上了用场. 广泛应用于服务器中,用于引导控制程序的启动 安装好superviso ...
- 想不到的:js中加号操作符
研究js加号操作符的时候,无意中试验了一个 console.log({} + "str");//NaN 发现结果居然是NaN,这让我百思不得其解. 我查阅资料,js高级编程里是这样 ...
- 2018.3.28html学习笔记
<!DOCTYPE html><html> <head> <meta charset="UTF-8"> ...