spark 累加历史 + 统计全部 + 行转列

spark 累加历史主要用到了窗口函数，而进行全部统计，则需要用到rollup函数

1 应用场景：

　　1、我们需要统计用户的总使用时长（累加历史）

　　2、前台展现页面需要对多个维度进行查询，如：产品、地区等等

　　3、需要展现的表格头如：产品、2015-04、2015-05、2015-06

2 原始数据：

product_code |event_date |duration |

-------------|-----------|---------|

1438         |2016-05-13 |165      |

1438         |2016-05-14 |595      |

1438         |2016-05-15 |105      |

1629         |2016-05-13 |12340    |

1629         |2016-05-14 |13850    |

1629         |2016-05-15 |227      |

3 业务场景实现

3.1 业务场景1：累加历史：

如数据源所示：我们已经有当天用户的使用时长，我们期望在进行统计的时候，14号能累加13号的，15号能累加14、13号的，以此类推

3.1.1 spark-sql实现

//spark sql 使用窗口函数累加历史数据

sqlContext.sql(

"""

  select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration

  from userlogs_date

""").show

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         165|

| 1438|2016-05-14|         760|

| 1438|2016-05-15|         865|

| 1629|2016-05-13|       12340|

| 1629|2016-05-14|       26190|

| 1629|2016-05-15|       26417|

+-----+----------+------------+

3.1.2 dataframe实现

//使用Column提供的over 函数，传入窗口操作

import org.apache.spark.sql.expressions._

val first_2_now_window = Window.partitionBy("pcode").orderBy("event_date")

df_userlogs_date.select(

    $"pcode",

    $"event_date",

    sum($"duration").over(first_2_now_window).as("sum_duration")

).show

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         165|

| 1438|2016-05-14|         760|

| 1438|2016-05-15|         865|

| 1629|2016-05-13|       12340|

| 1629|2016-05-14|       26190|

| 1629|2016-05-15|       26417|

+-----+----------+------------+

3.1.3 扩展累加一段时间范围内

实际业务中的累加逻辑远比上面复杂，比如，累加之前N天，累加前N天到后N天等等。以下我们来实现：

　3.1.3.1 累加历史所有：

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date
select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,0)
Window.partitionBy("pcode").orderBy("event_date")
上边四种写法完全相等

3.1.3.2 累加N天之前，假设N=3

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and current row) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,0)

3.1.3.3 累加前N天，后M天: 假设N=3 M=5

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 3 preceding and 5 following ) as sum_duration from userlogs_date
Window.partitionBy("pcode").orderBy("event_date").rowsBetween(-3,5)
3.1.3.4 累加该分区内所有行

select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date

Window.partitionBy("pcode").orderBy("event_date").rowsBetween(Long.MinValue,Long.MaxValue)

总结如下：

preceding：用于累加前N行（分区之内）。若是从分区第一行头开始，则为 unbounded。 N为：相对当前行向前的偏移量
following ：与preceding相反，累加后N行（分区之内）。若是累加到该分区结束，则为 unbounded。N为：相对当前行向后的偏移量
current row：顾名思义，当前行，偏移量为0

说明：上边的前N，后M，以及current row均会累加该偏移量所在行

3.1.3.4 实测结果

累加历史：分区内当天及之前所有 写法1：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc) as sum_duration from userlogs_date

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         165|

| 1438|2016-05-14|         760|

| 1438|2016-05-15|         865|

| 1629|2016-05-13|       12340|

| 1629|2016-05-14|       26190|

| 1629|2016-05-15|       26417|

+-----+----------+------------+

累加历史：分区内当天及之前所有 写法2：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and current row) as sum_duration from userlogs_date

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         165|

| 1438|2016-05-14|         760|

| 1438|2016-05-15|         865|

| 1629|2016-05-13|       12340|

| 1629|2016-05-14|       26190|

| 1629|2016-05-15|       26417|

+-----+----------+------------+

累加当日和昨天：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and current row) as sum_duration from userlogs_date

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         165|

| 1438|2016-05-14|         760|

| 1438|2016-05-15|         700|

| 1629|2016-05-13|       12340|

| 1629|2016-05-14|       26190|

| 1629|2016-05-15|       14077|

+-----+----------+------------+

累加当日、昨日、明日：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between 1 preceding and 1 following ) as sum_duration from userlogs_date

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         760|

| 1438|2016-05-14|         865|

| 1438|2016-05-15|         700|

| 1629|2016-05-13|       26190|

| 1629|2016-05-14|       26417|

| 1629|2016-05-15|       14077|

+-----+----------+------------+

累加分区内所有：当天和之前之后所有：select pcode,event_date,sum(duration) over (partition by pcode order by event_date asc rows between unbounded preceding and unbounded following ) as sum_duration from userlogs_date

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| 1438|2016-05-13|         865|

| 1438|2016-05-14|         865|

| 1438|2016-05-15|         865|

| 1629|2016-05-13|       26417|

| 1629|2016-05-14|       26417|

| 1629|2016-05-15|       26417|

+-----+----------+------------+

3.2 业务场景2：统计全部

3.2.1 spark sql实现

//spark sql 使用rollup添加all统计

sqlContext.sql(

"""

  select pcode,event_date,sum(duration) as sum_duration

  from userlogs_date_1

  group by pcode,event_date with rollup

  order by pcode,event_date

""").show()

+-----+----------+------------+

|pcode|event_date|sum_duration|

+-----+----------+------------+

| null|      null|       27282|

| 1438|      null|         865|

| 1438|2016-05-13|         165|

| 1438|2016-05-14|         595|

| 1438|2016-05-15|         105|

| 1629|      null|       26417|

| 1629|2016-05-13|       12340|

| 1629|2016-05-14|       13850|

| 1629|2016-05-15|         227|

+-----+----------+------------+

3.2.2 dataframe函数实现

//使用dataframe提供的rollup函数，进行多维度all统计

df_userlogs_date.rollup($"pcode", $"event_date").agg(sum($"duration")).orderBy($"pcode", $"event_date")

+-----+----------+-------------+

|pcode|event_date|sum(duration)|

+-----+----------+-------------+

| null|      null|        27282|

| 1438|      null|          865|

| 1438|2016-05-13|          165|

| 1438|2016-05-14|          595|

| 1438|2016-05-15|          105|

| 1629|      null|        26417|

| 1629|2016-05-13|        12340|

| 1629|2016-05-14|        13850|

| 1629|2016-05-15|          227|

+-----+----------+-------------+

3.3 行转列 ->pivot

pivot目前还没有sql语法，先用df语法吧

val userlogs_date_all = sqlContext.sql("select dcode, pcode,event_date,sum(duration) as duration from userlogs group by dognum, pcode,event_date ")

userlogs_date_all.registerTempTable("userlogs_date_all")

val dates = userlogs_date_all.select($"event_date").map(row => row.getAs[String]("event_date")).distinct().collect().toList

userlogs_date_all.groupBy($"dcode", $"pcode").pivot("event_date", dates).sum("duration").na.fill().show

+-----------------+-----+----------+----------+----------+----------+

|            dcode|pcode|--|--|--|--|

+-----------------+-----+----------+----------+----------+----------+

|         F2429186| |         |         |       |         |

|        AI2342441| |         |         |         |       |

|       A320018711| |         |       |         |         |

|         H2635817| |         |       |         |         |

|         D0288196| |         |       |         |         |

|         Y0242218| |         |      |         |         |

|         H2392574| |         |         |       |         |

|         D2245588| |         |         |         |         |

|         Y2514906| |         |         |       |         |

|         H2540419| |         |       |       |         |

|         R2231926| |         |         |       |         |

|         H2684591| |         |       |         |         |

|         A2548470| |         |       |         |         |

|         GH000309| |         |         |         |         |

|         H2293216| |         |         |         |       |

|         R2170601| |         |         |         |         |

|B2365238;B2559538| |         |         |         |         |

|         BQ005465| |         |         |       |        |

|        AH2180324| |         |       |       |        |

|         H0279306| |         |       |         |         |

+-----------------+-----+----------+----------+----------+----------+

附录

下面是这两个函数的官方api说明：

org.apache.spark.sql.scala

def rollup(col1: String, cols: String*): GroupedData

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

// Compute the average for all numeric columns rolluped by department and group.

df.rollup("department", "group").avg()

// Compute the max age and average salary, rolluped by department and gender.

df.rollup($"department", $"gender").agg(Map(

  "salary" -> "avg",

  "age" -> "max"

))

def rollup(cols: Column*): GroupedData

Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them. See GroupedData for all the available aggregate functions.

df.rollup($"department", $"group").avg()

// Compute the max age and average salary, rolluped by department and gender.

df.rollup($"department", $"gender").agg(Map(

  "salary" -> "avg",

  "age" -> "max"

))

org.apache.spark.sql.Column.scala

def over(window: WindowSpec): Column

Define a windowing column.

val w = Window.partitionBy("name").orderBy("id")

df.select(

  sum("price").over(w.rangeBetween(Long.MinValue, 2)),

  avg("price").over(w.rowsBetween(0, 4))

)

spark 累加历史 + 统计全部 + 行转列的更多相关文章

SqlServer PIVOT函数快速实现行转列，UNPIVOT实现列转行
我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...
SqlServer PIVOT函数快速实现行转列，UNPIVOT实现列转行（转）
我们在写Sql语句的时候没经常会遇到将查询结果行转列,列转行的需求,拼接sql字符串,然后使用sp_executesql执行sql字符串是比较常规的一种做法.但是这样做实现起来非常复杂,而在SqlSe ...
Sql 不确定列行转列操作
做项目时,用到了汇总统计的行转列,且表结构: 具体存储过程脚本如下: -- =============================================-- Author: -- C ...
MySQL，排序，统计行转列
表 -- ------------------------------ Table structure for a-- ---------------------------- DROP TABLE ...
Mysql 列转行统计查询、行转列统计查询
-- ---------------------------- -- Table structure for `TabName` -- ---------------------------- D ...
Spark基于自定义聚合函数实现【列转行、行转列】
一.分析 Spark提供了非常丰富的算子,可以实现大部分的逻辑处理,例如,要实现行转列,可以用hiveContext中支持的concat_ws(',', collect_set('字段'))实现.但是 ...
Databricks 第11篇：Spark SQL 查询（行转列、列转行、Lateral View、排序）
本文分享在Azure Databricks中如何实现行转列和列转行. 一,行转列在分组中,把每个分组中的某一列的数据连接在一起: collect_list:把一个分组中的列合成为数组,数据不去重,格 ...
Linux Shell 统计一(行\列)数值的总和及行、列转换
(对一列数字求和) 在日常工作当中需要对文本过滤出来的数字进行求和运算,例如想统计一个MySQL分区表现在有多大 # ls -lsh AdPlateform#P#p*.ibd |grep G 2.6 ...
做图表统计你需要掌握SQL Server 行转列和列转行
说在前面做一个数据统计和分析的项目,每天面对着各种数据,经过存储过程从源表计算汇总后需要写入中间结果表以提高数据使用效率,那么此时就需要用到行转列和列转行. 1.列转行数据经过计算加工后会直接生成 ...

随机推荐

170711、Linux下搭建MySQL集群
一.MySQL集群简介 1.什么是MySQL集群 MySQL集群是一个无共享的(shared-nothing).分布式节点架构的存储方案,其目的是提供容错性和高性能. 数据更新使用读已提交隔离级别(r ...
Code Forces 149DColoring Brackets(区间DP)
Coloring Brackets time limit per test 2 seconds memory limit per test 256 megabytes input standard ...
Mysql5.7.10新加用户
INSERT INTO mysql.user(HOST,USER,authentication_string,ssl_cipher,x509_issuer,x509_subject,select_pr ...
Redux 入门教程
Redux 入门教程(三):React-Redux 的用法(53@2016.09.21) Redux 入门教程(二):中间件与异步操作(32@2016.09.20) Redux 入门教程(一):基本用 ...
spring boot开启事务管理，使用事务的回滚机制，使两条插入语句一致
spring boot 事务管理,使用事务的回滚机制 1:配置事务管理在springboot 启动类中添加 @EnableTransactionManagement //开启事务管理 @Enable ...
SQL SERVER 2008 R2序列号
SQL SERVER 2008 R2序列号: 数据中心版:PTTFM-X467G-P7RH2-3Q6CG-4DMYB 开发者版:MC46H-JQR3C-2JRHY-XYRKY-QWPVM 企 ...
SpringData_Repository接口概述
Repository 接口是 Spring Data 的一个核心接口,它不提供任何方法,开发者需要在自己定义的接口中声明需要的方法 public interface Repository<T, ...
Redis 监控方案
一.概述近些天,遇到Redis监控的应用场景,从网上搜罗了一些文章,做了整理. 二.工具列表 2.1 redis-faina 见参考文章1 2.2 redis-live 见参考文章1 2.3 red ...
VS2010/MFC编程入门之十二（对话框：非模态对话框的创建及显示）
上一节鸡啄米讲了模态对话框及其弹出过程,本节接着讲另一种对话框--非模态对话框的创建及显示. 鸡啄米已经说过,非模态对话框显示后,程序其他窗口仍能正常运行,可以响应用户输入,还可以相互切换.鸡啄米会将 ...
牛客国庆集训派对Day1 Solution
A Tobaku Mokushiroku Kaiji 水. #include <bits/stdc++.h> using namespace std; ], b[]; void Ru ...

spark 累加历史 + 统计全部 + 行转列

spark 累加历史 + 统计全部 + 行转列的更多相关文章

随机推荐

热门专题