参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

测试数据准备:

create external table test_data (
cookieid string,
createtime string, --页面访问时间
url string --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/user/jc_rc_ftp/test_data'; select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid | l.createtime | l.url |
+-------------+----------------------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 |
| cookie1 | 2015-04-10 10:50:05 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 |
| cookie2 | 2015-04-10 10:00:02 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 |
| cookie2 | 2015-04-10 10:50:05 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 |
+-------------+----------------------+---------+--+

LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | last_1_time | last_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | next_1_time | next_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 1970-01-01 00:00:00 | NULL |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_data; +-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | first1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url1 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | url11 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url11 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url11 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url11 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url11 |
+-----------+----------------------+---------+-----+---------+--+

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
+-----------+----------------------+---------+-----+---------+--+ SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
+-----------+----------------------+---------+-----+---------+--+

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid | createtime | url | first2 |
+-----------+----------------------+---------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url2 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url2 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url2 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url2 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url2 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url55 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url55 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url55 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url55 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url55 |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid | createtime | url | last2 |
+-----------+----------------------+---------+--------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url1 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url1 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url22 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url22 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url22 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url22 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url22 |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)

如果想要取分组内排序后最后一个值,则需要变通一下:

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_data
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid | createtime | url | rn | last1 | last2 |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 | url7 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 | url7 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 | url7 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 | url7 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 | url77 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 | url77 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 | url77 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 | url77 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 | url77 |
+-----------+----------------------+---------+-----+---------+--------+--+

Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的更多相关文章

  1. pandas实现hive的lag和lead函数 以及 first_value和last_value函数

    lag和lead VS shift 该函数的格式如下: 第一个参数为列名, 第二个参数为往上第n行(可选,默认为1), 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL ...

  2. Hive 窗口函数LEAD LAG FIRST_VALUE LAST_VALUE

    窗口函数(window functions)对多行进行操作,并为查询中的每一行返回一个值. OVER()子句能将窗口函数与其他分析函数(analytical functions)和报告函数(repor ...

  3. oracle listagg函数、lag函数、lead函数 实例

    Oracle大师Thomas Kyte在他的经典著作中,反复强调过一个实现需求方案选取顺序: “如果你可以使用一句SQL解决的需求,就使用一句SQL:如果不可以,就考虑PL/SQL是否可以:如果PL/ ...

  4. hive函数参考手册

    hive函数参考手册 原文见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 1.内置运算符1.1关系运算符 运 ...

  5. Hive函数以及自定义函数讲解(UDF)

    Hive函数介绍HQL内嵌函数只有195个函数(包括操作符,使用命令show functions查看),基本能够胜任基本的hive开发,但是当有较为复杂的需求的时候,可能需要进行定制的HQL函数开发. ...

  6. 大数据入门第十一天——hive详解(三)hive函数

    一.hive函数 1.内置运算符与内置函数 函数分类: 查看函数信息: DESC FUNCTION concat; 常用的分析函数之rank() row_number(),参考:https://www ...

  7. Hadoop生态圈-Hive函数

    Hadoop生态圈-Hive函数 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.

  8. Hive(四)hive函数与hive shell

    一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档>            https://cwiki.apache.org/confluence/displ ...

  9. Hive入门笔记---2.hive函数大全

    Hive函数大全–完整版 现在虽然有很多SQL ON Hadoop的解决方案,像Spark SQL.Impala.Presto等等,但就目前来看,在基于Hadoop的大数据分析平台.数据仓库中,Hiv ...

随机推荐

  1. Mysql5.7动态修改innodb_buffer_pool_size

    SELECT @@innodb_buffer_pool_size,@@innodb_buffer_pool_chunk_size,@@innodb_buffer_pool_instances; SET ...

  2. Maven-03: 优化依赖

    已解析依赖: Maven会自动解析项目的直接依赖和传递性依赖,并且根据规则正确判断每个依赖的范围,对于一些依赖冲突,也能进行调节,以确保任何一个构件只有唯一的版本在依赖中存在.在这些工作之后,最后得到 ...

  3. 笔记:Spring Cloud Hystrix 异常处理、缓存和请求合并

    异常处理 在 HystrixCommand 实现的run方法中抛出异常,除了 HystrixBadRequestException之外,其他异常均会被Hystrix 认为命令执行失败并触发服务降级处理 ...

  4. 压力测试(webbench、ab、siege)

    在本地安装webbench,步骤如下: wget http://www.ha97.com/code/webbench-1.5.tar.gz tar zxvf webbench-1.5.tar.gz m ...

  5. MSIL实用指南-装箱拆箱

    本篇讲述怎样装箱拆箱.装箱和拆箱都是针对值类型而言的,装箱的性能开销远比拆箱的性能开销大. 装箱装箱指令是Box.使用格式是 ILGenerator.Emit(OpCodes.Box,<值类型& ...

  6. mysqldump 备份脚本

    #!/bin/bash DUMP=/usr/bin/mysqldump OUT_DIR=/home/mysql LINUX_USER=root DB_NAME=snale DB_USER=root D ...

  7. Nginx出现500 Internal Server Error 错误的解决方案

    500(服务器内部错误) 服务器遇到错误,无法完成请求. 501(尚未实施) 服务器不具备完成请求的功能.例如,当服务器无法识别请求方法时,服务器可能会返回此代码. 502(错误网关) 服务器作为网关 ...

  8. java基础笔记(9)----集合之list集合

    集合 对于集合的理解,集合是一个容器,用于存储和管理其它对象的对象 集合,首先了解所有集合的父接口----collection 特点:存储任意object元素 方法 boolean add(Objec ...

  9. UGUI中显示粒子特效

    今天在UGUI上显示粒子特效的时候遇到的一些问题,Mark一下.原理:修改特效中每一个ParticleSystem的Layer为UI,并且把ParticleSystemRenderer.sorting ...

  10. 『开源』设置系统 主音量(0~100 静音) VolumeHelper 兼容 Xp Win7 .Net 20 AnyCPU

    背景: 近来的生活一团乱麻,没心态写高大上的代码,于是就着手 写了几个 辅助类. 在整理 InkFx.Utils 时,发现有几个 辅助类 只写了定义,没有实现函数体,于是就 花了1天时间 完善了一下. ...