spark 累加历史主要用到了窗口函数,而进行全部统计,则需要用到rollup函数

1  应用场景:

  1、我们需要统计用户的总使用时长(累加历史)

  2、前台展现页面需要对多个维度进行查询,如:产品、地区等等

  3、需要展现的表格头如: 产品、2015-04、2015-05、2015-06

2 原始数据:

product_code |event_date |duration |
-------------|-----------|---------|
1438 |2016-05-13 |165 |
1438 |2016-05-14 |595 |
1438 |2016-05-15 |105 |
1629 |2016-05-13 |12340 |
1629 |2016-05-14 |13850 |
1629 |2016-05-15 |227 |

3 业务场景实现

3.1 业务场景1:累加历史:

如数据源所示:我们已经有当天用户的使用时长,我们期望在进行统计的时候,14号能累加13号的,15号能累加14、13号的,以此类推

3.1.1 spark-sql实现

//spark sql 使用窗口函数累加历史数据
sqlContext.sql(
"""
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration
from userlogs_date
""").show
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+

3.1.2 dataframe实现

//使用Column提供的over 函数,传入窗口操作
import org.apache.spark.sql.expressions._
val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")
df_userlogs_date.select(
$"pcode",
$"event_date",
sum($"duration").over(first_2_now_window).as("sum_duration")
).show +-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+

 3.1.3 扩展 累加一段时间范围内

实际业务中的累加逻辑远比上面复杂,比如,累加之前N天,累加前N天到后N天等等。以下我们来实现:

 3.1.3.1 累加历史所有:

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)
Window.partitionBy("pcode").orderBy("event_date")
上边四种写法完全相等 3.1.3.2 累加N天之前,假设N=3
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0) 

  3.1.3.3 累加前N天,后M天: 假设N=3 M=5 

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)
3.1.3.4 累加该分区内所有行
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)

总结如下:

preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量
following :与preceding相反,累加后N行(分区之内)。若是累加到该分区结束,则为 unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0
说明:上边的前N,后M,以及current row均会累加该偏移量所在行

3.1.3.4 实测结果
累加历史:分区内当天及之前所有 写法1:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date

+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加历史:分区内当天及之前所有 写法2:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
累加当日和昨天:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 760|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 26190|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加当日、昨日、明日:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 760|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 700|
| 1629|2016-05-13| 26190|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 14077|
+-----+----------+------------+
累加分区内所有:当天和之前之后所有:select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date
+-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| 1438|2016-05-13| 865|
| 1438|2016-05-14| 865|
| 1438|2016-05-15| 865|
| 1629|2016-05-13| 26417|
| 1629|2016-05-14| 26417|
| 1629|2016-05-15| 26417|
+-----+----------+------------+
3.2 业务场景2:统计全部

3.2.1 spark sql实现

//spark sql 使用rollup添加all统计
sqlContext.sql(
"""
select pcode,event_date,sum(duration) as sum_duration
from userlogs_date_1
group by pcode,event_date with rollup
order by pcode,event_date
""").show() +-----+----------+------------+
|pcode|event_date|sum_duration|
+-----+----------+------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+------------+

3.2.2 dataframe函数实现

//使用dataframe提供的rollup函数,进行多维度all统计
df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderBy($"pcode", $"event_date") +-----+----------+-------------+
|pcode|event_date|sum(duration)|
+-----+----------+-------------+
| null| null| 27282|
| 1438| null| 865|
| 1438|2016-05-13| 165|
| 1438|2016-05-14| 595|
| 1438|2016-05-15| 105|
| 1629| null| 26417|
| 1629|2016-05-13| 12340|
| 1629|2016-05-14| 13850|
| 1629|2016-05-15| 227|
+-----+----------+-------------+

  3.3 行转列 ->pivot

pivot目前还没有sql语法,先用df语法吧
val userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date ")
userlogs_date_all.registerTempTable("userlogs_date_all")
val dates = userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList
userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill().show +-----------------+-----+----------+----------+----------+----------+
| dcode|pcode|--|--|--|--|
+-----------------+-----+----------+----------+----------+----------+
| F2429186| | | | | |
| AI2342441| | | | | |
| A320018711| | | | | |
| H2635817| | | | | |
| D0288196| | | | | |
| Y0242218| | | | | |
| H2392574| | | | | |
| D2245588| | | | | |
| Y2514906| | | | | |
| H2540419| | | | | |
| R2231926| | | | | |
| H2684591| | | | | |
| A2548470| | | | | |
| GH000309| | | | | |
| H2293216| | | | | |
| R2170601| | | | | |
|B2365238;B2559538| | | | | |
| BQ005465| | | | | |
| AH2180324| | | | | |
| H0279306| | | | | |
+-----------------+-----+----------+----------+----------+----------+

附录

下面是这两个函数的官方api说明:

org.apache.spark.sql.scala
def rollup(col1: String, cols: String*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions). // Compute the average for all numeric columns rolluped by department and group.
df.rollup("department", "group").avg() // Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
def rollup(cols: Column*): GroupedData
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.
df.rollup($"department", $"group").avg() // Compute the max age and average salary, rolluped by department and gender.
df.rollup($"department", $"gender").agg(Map(
"salary" -> "avg",
"age" -> "max"
))
org.apache.spark.sql.Column.scala
def over(window: WindowSpec): Column
Define a windowing column. val w = Window.partitionBy("name").orderBy("id")
df.select(
sum("price").over(w.rangeBetween(Long.MinValue, 2)),
avg("price").over(w.rowsBetween(0, 4))
)

spark 累加历史 + 统计全部 + 行转列的更多相关文章

  1. SqlServer PIVOT函数快速实现行转列,UNPIVOT实现列转行

    我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...

  2. SqlServer PIVOT函数快速实现行转列,UNPIVOT实现列转行(转)

    我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...

  3. Sql 不确定列 行转列操作

    做项目时,用到了汇总统计的行转列,且 表结构: 具体存储过程脚本如下: -- =============================================-- Author:  -- C ...

  4. MySQL,排序,统计行转列

    表 -- ------------------------------ Table structure for a-- ---------------------------- DROP TABLE ...

  5. Mysql 列转行统计查询 、行转列统计查询

      -- ---------------------------- -- Table structure for `TabName` -- ---------------------------- D ...

  6. Spark基于自定义聚合函数实现【列转行、行转列】

    一.分析 Spark提供了非常丰富的算子,可以实现大部分的逻辑处理,例如,要实现行转列,可以用hiveContext中支持的concat_ws(',', collect_set('字段'))实现.但是 ...

  7. Databricks 第11篇:Spark SQL 查询(行转列、列转行、Lateral View、排序)

    本文分享在Azure Databricks中如何实现行转列和列转行. 一,行转列 在分组中,把每个分组中的某一列的数据连接在一起: collect_list:把一个分组中的列合成为数组,数据不去重,格 ...

  8. Linux Shell 统计一(行\列)数值的总和及行、列转换

    (对一列数字求和) 在日常工作当中需要对文本过滤出来的数字进行求和运算,例如想统计一个MySQL分区表现在有多大 # ls -lsh AdPlateform#P#p*.ibd  |grep G 2.6 ...

  9. 做图表统计你需要掌握SQL Server 行转列和列转行

    说在前面 做一个数据统计和分析的项目,每天面对着各种数据,经过存储过程从源表计算汇总后需要写入中间结果表以提高数据使用效率,那么此时就需要用到行转列和列转行. 1.列转行 数据经过计算加工后会直接生成 ...

随机推荐

  1. js封装正则验证

    //根据不同的验证内容,返回相应的正则表达式 function returnRegString(regName) { if (regName == "email") { retur ...

  2. #pragma预处理命令详解

    #pragma预处理命令 #pragma可以说是C++中最复杂的预处理指令了,下面是最常用的几个#pragma指令: #pragma comment(lib,"XXX.lib") ...

  3. win10配置的静态/动态IP和 DNS的方法

    1.配置静态IP和DNS netsh interface ip set address name="以太网" source=static addr=192.168.9.145 ma ...

  4. php中函数preg_match或preg_match_all 第三个参数$match的解释

    理解自:http://www.cnblogs.com/vicenteforever/articles/1623137.html php手册中是这样解释的 matches 如果提供了参数matches, ...

  5. JS原生ajax

    原文链接:http://caibaojian.com/ajax-jsonp.html 一.JS原生ajax ajax:一种请求数据的方式,不需要刷新整个页面: ajax的技术核心是 XMLHttpRe ...

  6. github push error ---- recursion detected in die handler

    错误提示如下: 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 New Bitmap Image.bmp Coun ...

  7. mysql 数据操作 单表查询 having 过滤

    SELECT 字段1,字段2... FROM 库名.表名 WHERE 条件 GROUP BY field HAVING 筛选 ORDER BY field LIMIT 限制条数 1.首先找到表 库.表 ...

  8. python 定义类 学习1

    此时的d1就是类Dog的实例化对象 实例化,其实就是以Dog类为模版,在内存里开辟一块空间,存上数据,赋值成一个变量名 # 定义类模板 class dog(object): # 定义类的方法功能 # ...

  9. C#知识点备忘

    1.结构体不能用判断符号==判断是否为null,结构体是值类型,不论采用new与否,结构体中的值类型都已经赋了初值. 2.整数相除: a=; b=: c=a/b; 结果c= 如果想得到double型需 ...

  10. django2.0关于path匹配路径页面刷新不出来的问题

    下面是官方文档的内容,如果在urls.py中使用到正则匹配路径(^$)的时候,就需要使用re_path,而不能使用path,不然页面会显示404错误, 如果未用到正则,那么使用path即可. re_p ...