参考自大数据田地:http://lxw1234.com/archives/2015/04/190.htm

测试数据准备:

create external table test_data (
cookieid string,
createtime string, --页面访问时间
url string --被访问页面
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile location '/user/jc_rc_ftp/test_data'; select * from test_data l;
+-------------+----------------------+---------+--+
| l.cookieid | l.createtime | l.url |
+-------------+----------------------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 |
| cookie1 | 2015-04-10 10:50:05 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 |
| cookie2 | 2015-04-10 10:00:02 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 |
| cookie2 | 2015-04-10 10:50:05 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 |
+-------------+----------------------+---------+--+

LAG
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值

第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | last_1_time | last_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:00:00 | NULL |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:00:02 | 2015-04-10 10:00:00 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:03:04 | 2015-04-10 10:00:02 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:10:00 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 10:50:01 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 2015-04-10 10:50:05 | 2015-04-10 10:50:01 |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

LEAD

与LAG相反
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM test_data;
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookieid | createtime | url | rn | next_1_time | next_2_time |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | 1970-01-01 00:00:00 | NULL |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | 2015-04-10 10:00:02 | 2015-04-10 10:03:04 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | 2015-04-10 10:03:04 | 2015-04-10 10:10:00 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 2015-04-10 10:10:00 | 2015-04-10 10:50:01 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | 2015-04-10 10:50:01 | 2015-04-10 10:50:05 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | 2015-04-10 10:50:05 | 2015-04-10 11:00:00 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | 2015-04-10 11:00:00 | NULL |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | 1970-01-01 00:00:00 | NULL |
+-----------+----------------------+---------+-----+----------------------+----------------------+--+

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM test_data; +-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | first1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url1 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url11 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | url11 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url11 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url11 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url11 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url11 |
+-----------+----------------------+---------+-----+---------+--+

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
+-----------+----------------------+---------+-----+---------+--+ SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last1
FROM test_data;
+-----------+----------------------+---------+-----+---------+--+
| cookieid | createtime | url | rn | last1 |
+-----------+----------------------+---------+-----+---------+--+
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 |
+-----------+----------------------+---------+-----+---------+--+

如果不指定ORDER BY,则默认按照记录在文件中的偏移量进行排序,会出现错误的结果

SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM test_data;
+-----------+----------------------+---------+---------+--+
| cookieid | createtime | url | first2 |
+-----------+----------------------+---------+---------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url2 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url2 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url2 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url2 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url2 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url2 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url2 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url55 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url55 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url55 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url55 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url55 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url55 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url55 |
+-----------+----------------------+---------+---------+--+
SELECT cookieid,
createtime,
url,
LAST_VALUE(url) OVER(PARTITION BY cookieid) AS last2
FROM test_data;
+-----------+----------------------+---------+--------+--+
| cookieid | createtime | url | last2 |
+-----------+----------------------+---------+--------+--+
| cookie1 | 2015-04-10 10:00:02 | url2 | url1 |
| cookie1 | 2015-04-10 10:50:01 | url5 | url1 |
| cookie1 | 2015-04-10 10:10:00 | url4 | url1 |
| cookie1 | 2015-04-10 11:00:00 | url7 | url1 |
| cookie1 | 2015-04-10 10:50:05 | url6 | url1 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | url1 |
| cookie1 | 2015-04-10 10:00:00 | url1 | url1 |
| cookie2 | 2015-04-10 10:50:01 | url55 | url22 |
| cookie2 | 2015-04-10 10:10:00 | url44 | url22 |
| cookie2 | 2015-04-10 11:00:00 | url77 | url22 |
| cookie2 | 2015-04-10 10:50:05 | url66 | url22 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | url22 |
| cookie2 | 2015-04-10 10:00:00 | url11 | url22 |
| cookie2 | 2015-04-10 10:00:02 | url22 | url22 |
+-----------+----------------------+---------+--------+--+
14 rows selected (78.058 seconds)

如果想要取分组内排序后最后一个值,则需要变通一下:

SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM test_data
ORDER BY cookieid,createtime;
+-----------+----------------------+---------+-----+---------+--------+--+
| cookieid | createtime | url | rn | last1 | last2 |
+-----------+----------------------+---------+-----+---------+--------+--+
| cookie1 | 2015-04-10 10:00:00 | url1 | 1 | url1 | url7 |
| cookie1 | 2015-04-10 10:00:02 | url2 | 2 | url2 | url7 |
| cookie1 | 2015-04-10 10:03:04 | 1url3 | 3 | 1url3 | url7 |
| cookie1 | 2015-04-10 10:10:00 | url4 | 4 | url4 | url7 |
| cookie1 | 2015-04-10 10:50:01 | url5 | 5 | url5 | url7 |
| cookie1 | 2015-04-10 10:50:05 | url6 | 6 | url6 | url7 |
| cookie1 | 2015-04-10 11:00:00 | url7 | 7 | url7 | url7 |
| cookie2 | 2015-04-10 10:00:00 | url11 | 1 | url11 | url77 |
| cookie2 | 2015-04-10 10:00:02 | url22 | 2 | url22 | url77 |
| cookie2 | 2015-04-10 10:03:04 | 1url33 | 3 | 1url33 | url77 |
| cookie2 | 2015-04-10 10:10:00 | url44 | 4 | url44 | url77 |
| cookie2 | 2015-04-10 10:50:01 | url55 | 5 | url55 | url77 |
| cookie2 | 2015-04-10 10:50:05 | url66 | 6 | url66 | url77 |
| cookie2 | 2015-04-10 11:00:00 | url77 | 7 | url77 | url77 |
+-----------+----------------------+---------+-----+---------+--------+--+

Hive函数:LAG,LEAD,FIRST_VALUE,LAST_VALUE的更多相关文章

  1. pandas实现hive的lag和lead函数 以及 first_value和last_value函数

    lag和lead VS shift 该函数的格式如下: 第一个参数为列名, 第二个参数为往上第n行(可选,默认为1), 第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL ...

  2. Hive 窗口函数LEAD LAG FIRST_VALUE LAST_VALUE

    窗口函数(window functions)对多行进行操作,并为查询中的每一行返回一个值. OVER()子句能将窗口函数与其他分析函数(analytical functions)和报告函数(repor ...

  3. oracle listagg函数、lag函数、lead函数 实例

    Oracle大师Thomas Kyte在他的经典著作中,反复强调过一个实现需求方案选取顺序: “如果你可以使用一句SQL解决的需求,就使用一句SQL:如果不可以,就考虑PL/SQL是否可以:如果PL/ ...

  4. hive函数参考手册

    hive函数参考手册 原文见:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF 1.内置运算符1.1关系运算符 运 ...

  5. Hive函数以及自定义函数讲解(UDF)

    Hive函数介绍HQL内嵌函数只有195个函数(包括操作符,使用命令show functions查看),基本能够胜任基本的hive开发,但是当有较为复杂的需求的时候,可能需要进行定制的HQL函数开发. ...

  6. 大数据入门第十一天——hive详解(三)hive函数

    一.hive函数 1.内置运算符与内置函数 函数分类: 查看函数信息: DESC FUNCTION concat; 常用的分析函数之rank() row_number(),参考:https://www ...

  7. Hadoop生态圈-Hive函数

    Hadoop生态圈-Hive函数 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.

  8. Hive(四)hive函数与hive shell

    一.hive函数 1.hive内置函数 (1)内容较多,见< Hive 官方文档>            https://cwiki.apache.org/confluence/displ ...

  9. Hive入门笔记---2.hive函数大全

    Hive函数大全–完整版 现在虽然有很多SQL ON Hadoop的解决方案,像Spark SQL.Impala.Presto等等,但就目前来看,在基于Hadoop的大数据分析平台.数据仓库中,Hiv ...

随机推荐

  1. PHP 简单的加密解密方法

    本算法的基础:给定字符A B,A^B=C,C^B=A,即两次异或运算可得到原字符.实现代码如下: /** * @desc加密 * @param string $str 待加密字符串 * @param ...

  2. FMDatabaseQueue 如何保证线程安全

    这篇文章原来在用 Github Pages 搭建的博客上,现在决定重新用回博客园,所以把文章搬回来. FMDB 是 OC 针对 sqlite 的封装.在其文档的线程安全部分这样讲:同时从多个线程使用同 ...

  3. javap -c命令关键字的含义

    jdk提供了javap命令用于查看字节码来查看程序执行赋值的顺序,看懂这些关键字可以很好的理解程序执行的过程 转自:http://www.cnblogs.com/duanxz/archive/2014 ...

  4. centOS7安装nodejs(8.4.0)(详细步骤)

    1.使用rpm查看是否安装gcc.make 若如下图有输出版本详细表示已安装,则无需再次安装,直接下一步(输入rpm -qa 包名称) 若没有安装则执行以下命令安装:  yum install gcc ...

  5. [译文] SQL JOIN,你想知道的应该都有

    介绍 这是一篇阐述SQL JOINs的文章. 背景 我是个不喜欢抽象的人,一图胜千言.我在网上查找了所有的关于SQL JOIN的解释,但是没有找到一篇能用图像形象描述的. 有些是有图片的但是他们没有覆 ...

  6. Spring Boot Druid数据源配置

    package com.hgvip.config; import com.alibaba.druid.pool.DruidDataSource; import com.alibaba.druid.su ...

  7. RabbitMQ 消息确认与公平调度消费者

    一.消息确认 为了确保消息一定被消费者处理,rabbitMQ提供了消息确认功能,就是在消费者处理完任务之后,就给服务器一个回馈,服务器就会将该消息删除,如果消费者超时不回馈,那么服务器将就将该消息重新 ...

  8. Vue常用开源项目汇总

    前言:Vue (读音 /vjuː/,类似于 view) 是一套用于构建用户界面的渐进式框架.与其它大型框架不同的是,Vue 被设计为可以自底向上逐层应用.Vue 的核心库只关注视图层,不仅易于上手,还 ...

  9. python 面向对象的程序设计

    一:什么是编程范式? 编程是程序员用特定的语法 + 数据结构 + 算法组成的代码来告诉计算机如何执行任务的过程. 如果把编程的过程比喻为练习武功,那么编程范式指的就是武林中的各种流派,而在编程的世界里 ...

  10. Beta Scrum

    听说 Beta Scrum Day 1