参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

测试数据准备:

create external table test_data (
cookieid string,
createtime string, --页面访问时间
url string --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/user/jc_rc_ftp/test_data'; select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid | l.createtime | l.url |
+-------------+----------------------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 |
| cookie1 | 2015-04-10 10:50:05 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 |
| cookie2 | 2015-04-10 10:00:02 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 |
| cookie2 | 2015-04-10 10:50:05 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 |
+-------------+----------------------+---------+--+

LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | last_1_time | last_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | next_1_time | next_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 1970-01-01 00:00:00 | NULL |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_data; +-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | first1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url1 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | url11 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url11 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url11 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url11 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url11 |
+-----------+----------------------+---------+-----+---------+--+

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
+-----------+----------------------+---------+-----+---------+--+ SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
+-----------+----------------------+---------+-----+---------+--+

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid | createtime | url | first2 |
+-----------+----------------------+---------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url2 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url2 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url2 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url2 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url2 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url55 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url55 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url55 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url55 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url55 |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid | createtime | url | last2 |
+-----------+----------------------+---------+--------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url1 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url1 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url22 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url22 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url22 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url22 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url22 |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)

如果想要取分组内排序后最后一个值,则需要变通一下:

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_data
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid | createtime | url | rn | last1 | last2 |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 | url7 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 | url7 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 | url7 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 | url7 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 | url77 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 | url77 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 | url77 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 | url77 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 | url77 |
+-----------+----------------------+---------+-----+---------+--------+--+

Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的更多相关文章

  1. pandas实现hive的lag和lead函数 以及 first_value和last_value函数

    lag和lead VS shift 该函数的格式如下: 第一个参数为列名, 第二个参数为往上第n行(可选,默认为1), 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL ...

  2. Hive 窗口函数LEAD LAG FIRST_VALUE LAST_VALUE

    窗口函数(window functions)对多行进行操作,并为查询中的每一行返回一个值. OVER()子句能将窗口函数与其他分析函数(analytical functions)和报告函数(repor ...

  3. oracle listagg函数、lag函数、lead函数 实例

    Oracle大师Thomas Kyte在他的经典著作中,反复强调过一个实现需求方案选取顺序: “如果你可以使用一句SQL解决的需求,就使用一句SQL:如果不可以,就考虑PL/SQL是否可以:如果PL/ ...

  4. hive函数参考手册

    hive函数参考手册 原文见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 1.内置运算符1.1关系运算符 运 ...

  5. Hive函数以及自定义函数讲解(UDF)

    Hive函数介绍HQL内嵌函数只有195个函数(包括操作符,使用命令show functions查看),基本能够胜任基本的hive开发,但是当有较为复杂的需求的时候,可能需要进行定制的HQL函数开发. ...

  6. 大数据入门第十一天——hive详解(三)hive函数

    一.hive函数 1.内置运算符与内置函数 函数分类: 查看函数信息: DESC FUNCTION concat; 常用的分析函数之rank() row_number(),参考:https://www ...

  7. Hadoop生态圈-Hive函数

    Hadoop生态圈-Hive函数 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.

  8. Hive(四)hive函数与hive shell

    一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档>            https://cwiki.apache.org/confluence/displ ...

  9. Hive入门笔记---2.hive函数大全

    Hive函数大全–完整版 现在虽然有很多SQL ON Hadoop的解决方案,像Spark SQL.Impala.Presto等等,但就目前来看,在基于Hadoop的大数据分析平台.数据仓库中,Hiv ...

随机推荐

  1. 第一周CoreIDRAW课总结

      一. 问:这节课学到了什么知识? 答:这节课我学到了很多知识,作为初学者的我,老师给我讲了CorelDEAW的启动,安装,基础知识和应用与用途. 基础知识 位图:是由无数个像素点构成的图像,又叫点 ...

  2. Python进阶_类与实例

    上一节将到面对对象必须先抽象模型,之后直接利用模型.这一节我们来具体理解一下这句话的意思. 面对对象最重要的概念就是类(class)和实例(instance),必须牢记类是抽象的模板,比如studen ...

  3. Runtime的使用

    一.RunTime简介 RunTime简称运行时.OC就是运行时机制,也就是在运行时候的一些机制,其中最主要的是消息机制. 对于C语言,函数的调用在编译的时候会决定调用哪个函数. 对于OC的函数,属于 ...

  4. linux(ubuntu)环境下安装及配置JDK

    安装完IDEA之后遇到了问题,发现jdk安装完之后配置环境变量好困难,下面总结一下我的安装及配置方式: JDK下载链接:http://download.oracle.com/otn-pub/java/ ...

  5. MYSQL数据库学习九 数据的操作

    9.1 插入数据记录 1. 插入完整或部分数据记录: INSERT INTO table_name(field1,field2,field3,...fieldn) VALUES(value1,valu ...

  6. Find The Multiply

    Find The Multiply poj-1426 题目大意:给你一个正整数n,求任意一个正整数m,使得n|m且m在十进制下的每一位都是0或1. 注释:n<=200. 想法:看网上的题解全是b ...

  7. 定位bug的姿势对吗?

    举个例子来说明 WEB页面上数据显示错误,本来应该显示38,  结果显示35,这个时候你怎么去定位这个问题出在哪里? 1.通过fiddler抓包工具(或者其他抓包工具), 分析接口返回的数据是35还是 ...

  8. hostPath Volume - 每天5分钟玩转 Docker 容器技术(148)

    hostPath Volume 的作用是将 Docker Host 文件系统中已经存在的目录 mount 给 Pod 的容器.大部分应用都不会使用 hostPath Volume,因为这实际上增加了 ...

  9. 【Alpha 阶段】后期测试及补充(第十一、十二周)

    [Alpha 阶段]动态成果展示 修复了一些bug后,关于游戏的一些动态图展示如下: 终极版需求规格说明书和代码规范 经过一些细微的图片和格式的调整,完成了本学期的最终版本: [markdown版说明 ...

  10. beat冲刺计划安排

    1. 团队成员 组长:郭晓迪 组员:钟平辉 柳政宇 徐航 曾瑞 2. 主要计划安排如下: 3. 详细日程任务安排