spark 累加历史 + 统计全部 + 行转列
spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数
1 应用场景:
1、我们需要统计用户的总使用时长(累加历史)
2、前台展现页面需要对多个维度进行查询,如:产品、地区等等
3、需要展现的表格头如: 产品、2015-04、2015-05、2015-06
2 原始数据:
product_code |event_date |duration |
-------------|-----------|---------|
1438 |2016-05-13 |165 |
1438 |2016-05-14 |595 |
1438 |2016-05-15 |105 |
1629 |2016-05-13 |12340 |
1629 |2016-05-14 |13850 |
1629 |2016-05-15 |227 |
3 业务场景实现
3.1 业务场景1:累加历史:
如数据源所示:我们已经有当天用户的使用时长,我们期望在进行统计的时候,14号能累加13号的,15号能累加14、13号的,以此类推
3.1.1 spark-sql实现
//spark sql 使用窗口函数累加历史数据
sqlContext.sql(
"""
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
from userlogs_date
""").show
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.1.2 dataframe实现
//使用Column提供的over 函数,传入窗口操作
import org.apache.spark.sql.expressions._
val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")
df_userlogs_date.select(
$"pcode",
$"event_date",
sum($"duration").over(first_2_now_window).as("sum_duration")
).show +-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.1.3 扩展 累加一段时间范围内
实际业务中的累加逻辑远比上面复杂,比如,累加之前N天,累加前N天到后N天等等。以下我们来实现:
3.1.3.1 累加历史所有:
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)
Window.partitionBy("pcode").orderBy("event_date")
上边四种写法完全相等 3.1.3.2 累加N天之前,假设N=3
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0)
3.1.3.3 累加前N天,后M天: 假设N=3 M=5
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)
3.1.3.4 累加该分区内所有行
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)
总结如下:
preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量
following :与preceding相反,累加后N行(分区之内)。若是累加到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0
说明:上边的前N,后M,以及current row均会累加该偏移量所在行
3.1.3.4 实测结果
累加历史:分区内当天及之前所有 写法1:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加历史:分区内当天及之前所有 写法2:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加当日和昨天:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加当日、昨日、明日:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 760|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 26190|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加分区内所有:当天和之前之后所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 865|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 26417|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.2 业务场景2:统计全部
3.2.1 spark sql实现
//spark sql 使用rollup添加all统计
sqlContext.sql(
"""
select pcode,event_date,sum(duration) as sum_duration
from userlogs_date_1
group by pcode,event_date with rollup
order by pcode,event_date
""").show() +-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+------------+
3.2.2 dataframe函数实现
//使用dataframe提供的rollup函数,进行多维度all统计
df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderBy($"pcode", $"event_date") +-----+----------+-------------+
|pcode|event_date|sum(duration)|
+-----+----------+-------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+-------------+
3.3 行转列 ->pivot
pivot目前还没有sql语法,先用df语法吧
val userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date ")
userlogs_date_all.registerTempTable("userlogs_date_all")
val dates = userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList
userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill().show
+-----------------+-----+----------+----------+----------+----------+
| dcode|pcode|--|--|--|--|
+-----------------+-----+----------+----------+----------+----------+
| F2429186| | | | | |
| AI2342441| | | | | |
| A320018711| | | | | |
| H2635817| | | | | |
| D0288196| | | | | |
| Y0242218| | | | | |
| H2392574| | | | | |
| D2245588| | | | | |
| Y2514906| | | | | |
| H2540419| | | | | |
| R2231926| | | | | |
| H2684591| | | | | |
| A2548470| | | | | |
| GH000309| | | | | |
| H2293216| | | | | |
| R2170601| | | | | |
|B2365238;B2559538| | | | | |
| BQ005465| | | | | |
| AH2180324| | | | | |
| H0279306| | | | | |
+-----------------+-----+----------+----------+----------+----------+
附录
下面是这两个函数的官方api说明:
org.apache.spark.sql.scala
def rollup(col1: String, cols: String*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions). // Compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg() // Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
def rollup(cols: Column*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
df.rollup($"department", $"group").avg() // Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
org.apache.spark.sql.Column.scala
def over(window: WindowSpec): Column
Define a windowing column. val w = Window.partitionBy("name").orderBy("id")
df.select(
sum("price").over(w.rangeBetween(Long.MinValue, 2)),
avg("price").over(w.rowsBetween(0, 4))
)
spark 累加历史 + 统计全部 + 行转列的更多相关文章
- SqlServer PIVOT函数快速实现行转列,UNPIVOT实现列转行
我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...
- SqlServer PIVOT函数快速实现行转列,UNPIVOT实现列转行(转)
我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...
- Sql 不确定列 行转列操作
做项目时,用到了汇总统计的行转列,且 表结构: 具体存储过程脚本如下: -- =============================================-- Author: -- C ...
- MySQL,排序,统计行转列
表 -- ------------------------------ Table structure for a-- ---------------------------- DROP TABLE ...
- Mysql 列转行统计查询 、行转列统计查询
-- ---------------------------- -- Table structure for `TabName` -- ---------------------------- D ...
- Spark基于自定义聚合函数实现【列转行、行转列】
一.分析 Spark提供了非常丰富的算子,可以实现大部分的逻辑处理,例如,要实现行转列,可以用hiveContext中支持的concat_ws(',', collect_set('字段'))实现.但是 ...
- Databricks 第11篇:Spark SQL 查询(行转列、列转行、Lateral View、排序)
本文分享在Azure Databricks中如何实现行转列和列转行. 一,行转列 在分组中,把每个分组中的某一列的数据连接在一起: collect_list:把一个分组中的列合成为数组,数据不去重,格 ...
- Linux Shell 统计一(行\列)数值的总和及行、列转换
(对一列数字求和) 在日常工作当中需要对文本过滤出来的数字进行求和运算,例如想统计一个MySQL分区表现在有多大 # ls -lsh AdPlateform#P#p*.ibd |grep G 2.6 ...
- 做图表统计你需要掌握SQL Server 行转列和列转行
说在前面 做一个数据统计和分析的项目,每天面对着各种数据,经过存储过程从源表计算汇总后需要写入中间结果表以提高数据使用效率,那么此时就需要用到行转列和列转行. 1.列转行 数据经过计算加工后会直接生成 ...
随机推荐
- JS通过正则限制 input 输入框只能输入整数、小数(金额或者现金)
第一: 限制只能是整数 <input type = "text" name= "number" id = 'number' onkeyup= " ...
- CSS Spritec下载,精灵图,雪碧图,初探之原理、使用
CSS Spritec下载,精灵图,雪碧图,初探之原理.使用 关于CSS Sprite CSSSprites在国内很多人叫css精灵雪碧图,是一种网页图片应用处理方式.它允许你将一个页面涉及到的所有零 ...
- Linux的操作系统I2C驱动架构解说
Linux的操作系统I2C驱动架构解说 发布时间:2006.10.16 04:52 来源:赛迪网技术社区 作者:LoneStar 最近因为工作需要涉及到了I2C总线.虽然我过去用过I2c,但看了 Li ...
- nginx详解之语法规则
1.location [=|~|~*|^~] /uri/ { … } location = / { # 精确匹配 / ,主机名后面不能带任何字符串 [ configuration A ] ...
- go-006-运算符
运算符用于在程序运行时执行数学或逻辑运算. Go 语言内置的运算符有: 算术运算符 关系运算符 逻辑运算符 位运算符 赋值运算符 其他运算符 算术运算符 下表列出了所有Go语言的算术运算符.假定 A ...
- React 教程
React 入门实例教程 http://www.ruanyifeng.com/blog/2015/03/react.html React 测试入门教程http://www.ruanyifeng.com ...
- 【代码片段】Python发送带图片的邮件
# coding=utf-8 import smtplib from email.mime.text import MIMEText from email.mime.multipart import ...
- Selenium之IE浏览器的启动
1.下载IEDriverServer.exe文件放至需要的目录中: 2.编写代码 import org.openqa.selenium.WebDriver; import org.openqa.sel ...
- [golang note] 协程基础
协程概念 √ 协程通常称为coroutine,在golang中称为goroutine. √ 协程本质上是一种用户态线程,它不需要操作系统来进行抢占式调度,在实际实现中寄存在线程之中. √ 协程系统开销 ...
- 利用 TestNG 并行执行用例
原文地址https://testerhome.com/topics/1639 一.测试类*注1 package com.testerhome; import io.appium.java_client ...