第8章 Spark SQL实战

8.1 数据说明

数据集是货品交易数据集。

每个订单可能包含多个货品,每个订单可以产生多次交易,不同的货品有不同的单价。

8.2 加载数据

tbStock:

scala> case class tbStock(ordernumber:String,locationid:String,dateid:String) extends Serializable
defined class tbStock scala> val tbStockRdd = spark.sparkContext.textFile("tbStock.txt")
tbStockRdd: org.apache.spark.rdd.RDD[String] = tbStock.txt MapPartitionsRDD[1] at textFile at <console>:23 scala> val tbStockDS = tbStockRdd.map(_.split(",")).map(attr=>tbStock(attr(0),attr(1),attr(2))).toDS
tbStockDS: org.apache.spark.sql.Dataset[tbStock] = [ordernumber: string, locationid: string ... 1 more field] scala> tbStockDS.show()
+------------+----------+---------+
| ordernumber|locationid| dataid|
+------------+----------+---------+
|BYSL00000893| ZHAO|2007-8-23|
|BYSL00000897| ZHAO|2007-8-24|
|BYSL00000898| ZHAO|2007-8-25|
|BYSL00000899| ZHAO|2007-8-26|
|BYSL00000900| ZHAO|2007-8-26|
|BYSL00000901| ZHAO|2007-8-27|
|BYSL00000902| ZHAO|2007-8-27|
|BYSL00000904| ZHAO|2007-8-28|
|BYSL00000905| ZHAO|2007-8-28|
|BYSL00000906| ZHAO|2007-8-28|
|BYSL00000907| ZHAO|2007-8-29|
|BYSL00000908| ZHAO|2007-8-30|
|BYSL00000909| ZHAO| 2007-9-1|
|BYSL00000910| ZHAO| 2007-9-1|
|BYSL00000911| ZHAO|2007-8-31|
|BYSL00000912| ZHAO| 2007-9-2|
|BYSL00000913| ZHAO| 2007-9-3|
|BYSL00000914| ZHAO| 2007-9-3|
|BYSL00000915| ZHAO| 2007-9-4|
|BYSL00000916| ZHAO| 2007-9-4|
+------------+----------+---------+
only showing top 20 rows

tbStockDetail:

scala> case class tbStockDetail(ordernumber:String, rownum:Int, itemid:String, number:Int, price:Double, amount:Double) extends Serializable
defined class tbStockDetail scala> val tbStockDetailRdd = spark.sparkContext.textFile("tbStockDetail.txt")
tbStockDetailRdd: org.apache.spark.rdd.RDD[String] = tbStockDetail.txt MapPartitionsRDD[13] at textFile at <console>:23 scala> val tbStockDetailDS = tbStockDetailRdd.map(_.split(",")).map(attr=> tbStockDetail(attr(0),attr(1).trim().toInt,attr(2),attr(3).trim().toInt,attr(4).trim().toDouble, attr(5).trim().toDouble)).toDS
tbStockDetailDS: org.apache.spark.sql.Dataset[tbStockDetail] = [ordernumber: string, rownum: int ... 4 more fields] scala> tbStockDetailDS.show()
+------------+------+--------------+------+-----+------+
| ordernumber|rownum| itemid|number|price|amount|
+------------+------+--------------+------+-----+------+
|BYSL00000893| 0|FS527258160501| -1|268.0|-268.0|
|BYSL00000893| 1|FS527258169701| 1|268.0| 268.0|
|BYSL00000893| 2|FS527230163001| 1|198.0| 198.0|
|BYSL00000893| 3|24627209125406| 1|298.0| 298.0|
|BYSL00000893| 4|K9527220210202| 1|120.0| 120.0|
|BYSL00000893| 5|01527291670102| 1|268.0| 268.0|
|BYSL00000893| 6|QY527271800242| 1|158.0| 158.0|
|BYSL00000893| 7|ST040000010000| 8| 0.0| 0.0|
|BYSL00000897| 0|04527200711305| 1|198.0| 198.0|
|BYSL00000897| 1|MY627234650201| 1|120.0| 120.0|
|BYSL00000897| 2|01227111791001| 1|249.0| 249.0|
|BYSL00000897| 3|MY627234610402| 1|120.0| 120.0|
|BYSL00000897| 4|01527282681202| 1|268.0| 268.0|
|BYSL00000897| 5|84126182820102| 1|158.0| 158.0|
|BYSL00000897| 6|K9127105010402| 1|239.0| 239.0|
|BYSL00000897| 7|QY127175210405| 1|199.0| 199.0|
|BYSL00000897| 8|24127151630206| 1|299.0| 299.0|
|BYSL00000897| 9|G1126101350002| 1|158.0| 158.0|
|BYSL00000897| 10|FS527258160501| 1|198.0| 198.0|
|BYSL00000897| 11|ST040000010000| 13| 0.0| 0.0|
+------------+------+--------------+------+-----+------+
only showing top 20 rows

tbDate:

scala> case class tbDate(dateid:String, years:Int, theyear:Int, month:Int, day:Int, weekday:Int, week:Int, quarter:Int, period:Int, halfmonth:Int) extends Serializable
defined class tbDate scala> val tbDateRdd = spark.sparkContext.textFile("tbDate.txt")
tbDateRdd: org.apache.spark.rdd.RDD[String] = tbDate.txt MapPartitionsRDD[20] at textFile at <console>:23 scala> val tbDateDS = tbDateRdd.map(_.split(",")).map(attr=> tbDate(attr(0),attr(1).trim().toInt, attr(2).trim().toInt,attr(3).trim().toInt, attr(4).trim().toInt, attr(5).trim().toInt, attr(6).trim().toInt, attr(7).trim().toInt, attr(8).trim().toInt, attr(9).trim().toInt)).toDS
tbDateDS: org.apache.spark.sql.Dataset[tbDate] = [dateid: string, years: int ... 8 more fields] scala> tbDateDS.show()
+---------+------+-------+-----+---+-------+----+-------+------+---------+
| dateid| years|theyear|month|day|weekday|week|quarter|period|halfmonth|
+---------+------+-------+-----+---+-------+----+-------+------+---------+
| 2003-1-1|200301| 2003| 1| 1| 3| 1| 1| 1| 1|
| 2003-1-2|200301| 2003| 1| 2| 4| 1| 1| 1| 1|
| 2003-1-3|200301| 2003| 1| 3| 5| 1| 1| 1| 1|
| 2003-1-4|200301| 2003| 1| 4| 6| 1| 1| 1| 1|
| 2003-1-5|200301| 2003| 1| 5| 7| 1| 1| 1| 1|
| 2003-1-6|200301| 2003| 1| 6| 1| 2| 1| 1| 1|
| 2003-1-7|200301| 2003| 1| 7| 2| 2| 1| 1| 1|
| 2003-1-8|200301| 2003| 1| 8| 3| 2| 1| 1| 1|
| 2003-1-9|200301| 2003| 1| 9| 4| 2| 1| 1| 1|
|2003-1-10|200301| 2003| 1| 10| 5| 2| 1| 1| 1|
|2003-1-11|200301| 2003| 1| 11| 6| 2| 1| 2| 1|
|2003-1-12|200301| 2003| 1| 12| 7| 2| 1| 2| 1|
|2003-1-13|200301| 2003| 1| 13| 1| 3| 1| 2| 1|
|2003-1-14|200301| 2003| 1| 14| 2| 3| 1| 2| 1|
|2003-1-15|200301| 2003| 1| 15| 3| 3| 1| 2| 1|
|2003-1-16|200301| 2003| 1| 16| 4| 3| 1| 2| 2|
|2003-1-17|200301| 2003| 1| 17| 5| 3| 1| 2| 2|
|2003-1-18|200301| 2003| 1| 18| 6| 3| 1| 2| 2|
|2003-1-19|200301| 2003| 1| 19| 7| 3| 1| 2| 2|
|2003-1-20|200301| 2003| 1| 20| 1| 4| 1| 2| 2|
+---------+------+-------+-----+---+-------+----+-------+------+---------+
only showing top 20 rows

注册表:

scala> tbStockDS.createOrReplaceTempView("tbStock")

scala> tbDateDS.createOrReplaceTempView("tbDate")

scala> tbStockDetailDS.createOrReplaceTempView("tbStockDetail")

8.3 计算所有订单中每年的销售单数、销售总额

统计所有订单中每年的销售单数、销售总额

三个表连接后以count(distinct a.ordernumber)计销售单数,sum(b.amount)计销售总额

SELECT c.theyear, COUNT(DISTINCT a.ordernumber), SUM(b.amount)
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear
ORDER BY c.theyear
spark.sql("SELECT c.theyear, COUNT(DISTINCT a.ordernumber), SUM(b.amount) FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear ORDER BY c.theyear").show

结果如下:

+-------+---------------------------+--------------------+
|theyear|count(DISTINCT ordernumber)| sum(amount)|
+-------+---------------------------+--------------------+
| 2004| 1094| 3268115.499199999|
| 2005| 3828|1.3257564149999991E7|
| 2006| 3772|1.3680982900000006E7|
| 2007| 4885|1.6719354559999993E7|
| 2008| 4861| 1.467429530000001E7|
| 2009| 2619| 6323697.189999999|
| 2010| 94| 210949.65999999997|
+-------+---------------------------+--------------------+

8.4 计算所有订单每年最大金额订单的销售额

目标:统计每年最大金额订单的销售额:

1)统计每年,每个订单一共有多少销售额

SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
GROUP BY a.dateid, a.ordernumber
spark.sql("SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber GROUP BY a.dateid, a.ordernumber").show

结果如下:

+----------+------------+------------------+
| dateid| ordernumber| SumOfAmount|
+----------+------------+------------------+
| 2008-4-9|BYSL00001175| 350.0|
| 2008-5-12|BYSL00001214| 592.0|
| 2008-7-29|BYSL00011545| 2064.0|
| 2008-9-5|DGSL00012056| 1782.0|
| 2008-12-1|DGSL00013189| 318.0|
|2008-12-18|DGSL00013374| 963.0|
| 2009-8-9|DGSL00015223| 4655.0|
| 2009-10-5|DGSL00015585| 3445.0|
| 2010-1-14|DGSL00016374| 2934.0|
| 2006-9-24|GCSL00000673|3556.1000000000004|
| 2007-1-26|GCSL00000826| 9375.199999999999|
| 2007-5-24|GCSL00001020| 6171.300000000002|
| 2008-1-8|GCSL00001217| 7601.6|
| 2008-9-16|GCSL00012204| 2018.0|
| 2006-7-27|GHSL00000603| 2835.6|
|2006-11-15|GHSL00000741| 3951.94|
| 2007-6-6|GHSL00001149| 0.0|
| 2008-4-18|GHSL00001631| 12.0|
| 2008-7-15|GHSL00011367| 578.0|
| 2009-5-8|GHSL00014637| 1797.6|
+----------+------------+------------------+

2)以上一步查询结果为基础表,和表tbDate使用dateid join,求出每年最大金额订单的销售额

SELECT theyear, MAX(c.SumOfAmount) AS SumOfAmount
FROM (SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
GROUP BY a.dateid, a.ordernumber
) c
JOIN tbDate d ON c.dateid = d.dateid
GROUP BY theyear
ORDER BY theyear DESC
spark.sql("SELECT theyear, MAX(c.SumOfAmount) AS SumOfAmount FROM (SELECT a.dateid, a.ordernumber, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber GROUP BY a.dateid, a.ordernumber ) c JOIN tbDate d ON c.dateid = d.dateid GROUP BY theyear ORDER BY theyear DESC").show

结果如下:

+-------+------------------+
|theyear| SumOfAmount|
+-------+------------------+
| 2010|13065.280000000002|
| 2009|25813.200000000008|
| 2008| 55828.0|
| 2007| 159126.0|
| 2006| 36124.0|
| 2005|38186.399999999994|
| 2004| 23656.79999999997|
+-------+------------------+

8.5 计算所有订单中每年最畅销货品

目标:统计每年最畅销货品(哪个货品销售额amount在当年最高,哪个就是最畅销货品)

第一步、求出每年每个货品的销售额

SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear, b.itemid
spark.sql("SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid").show

结果如下:

+-------+--------------+------------------+
|theyear| itemid| SumOfAmount|
+-------+--------------+------------------+
| 2004|43824480810202| 4474.72|
| 2006|YA214325360101| 556.0|
| 2006|BT624202120102| 360.0|
| 2007|AK215371910101|24603.639999999992|
| 2008|AK216169120201|29144.199999999997|
| 2008|YL526228310106|16073.099999999999|
| 2009|KM529221590106| 5124.800000000001|
| 2004|HT224181030201|2898.6000000000004|
| 2004|SG224308320206| 7307.06|
| 2007|04426485470201|14468.800000000001|
| 2007|84326389100102| 9134.11|
| 2007|B4426438020201| 19884.2|
| 2008|YL427437320101|12331.799999999997|
| 2008|MH215303070101| 8827.0|
| 2009|YL629228280106| 12698.4|
| 2009|BL529298020602| 2415.8|
| 2009|F5127363019006| 614.0|
| 2005|24425428180101| 34890.74|
| 2007|YA214127270101| 240.0|
| 2007|MY127134830105| 11099.92|
+-------+--------------+------------------+

第二步、在第一步的基础上,统计每年单个货品中的最大金额

SELECT d.theyear, MAX(d.SumOfAmount) AS MaxOfAmount
FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear, b.itemid
) d
GROUP BY d.theyear
spark.sql("SELECT d.theyear, MAX(d.SumOfAmount) AS MaxOfAmount FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid ) d GROUP BY d.theyear").show

结果如下:

+-------+------------------+
|theyear| MaxOfAmount|
+-------+------------------+
| 2007| 70225.1|
| 2006| 113720.6|
| 2004|53401.759999999995|
| 2009| 30029.2|
| 2005|56627.329999999994|
| 2010| 4494.0|
| 2008| 98003.60000000003|
+-------+------------------+

第三步、用最大销售额和统计好的每个货品的销售额join,以及用年join,集合得到最畅销货品那一行信息

SELECT DISTINCT e.theyear, e.itemid, f.MaxOfAmount
FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear, b.itemid
) e
JOIN (SELECT d.theyear, MAX(d.SumOfAmount) AS MaxOfAmount
FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS SumOfAmount
FROM tbStock a
JOIN tbStockDetail b ON a.ordernumber = b.ordernumber
JOIN tbDate c ON a.dateid = c.dateid
GROUP BY c.theyear, b.itemid
) d
GROUP BY d.theyear
) f ON e.theyear = f.theyear
AND e.SumOfAmount = f.MaxOfAmount
ORDER BY e.theyear
spark.sql("SELECT DISTINCT e.theyear, e.itemid, f.maxofamount FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS sumofamount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid ) e JOIN (SELECT d.theyear, MAX(d.sumofamount) AS maxofamount FROM (SELECT c.theyear, b.itemid, SUM(b.amount) AS sumofamount FROM tbStock a JOIN tbStockDetail b ON a.ordernumber = b.ordernumber JOIN tbDate c ON a.dateid = c.dateid GROUP BY c.theyear, b.itemid ) d GROUP BY d.theyear ) f ON e.theyear = f.theyear AND e.sumofamount = f.maxofamount ORDER BY e.theyear").show

结果如下:

+-------+--------------+------------------+
|theyear| itemid| maxofamount|
+-------+--------------+------------------+
| 2004|JY424420810101|53401.759999999995|
| 2005|24124118880102|56627.329999999994|
| 2006|JY425468460101| 113720.6|
| 2007|JY425468460101| 70225.1|
| 2008|E2628204040101| 98003.60000000003|
| 2009|YL327439080102| 30029.2|
| 2010|SQ429425090101| 4494.0|
+-------+--------------+------------------+

第8章 Spark SQL实战的更多相关文章

  1. 第7章 Spark SQL 的运行原理(了解)

    第7章 Spark SQL 的运行原理(了解) 7.1 Spark SQL运行架构 Spark SQL对SQL语句的处理和关系型数据库类似,即词法/语法解析.绑定.优化.执行.Spark SQL会先将 ...

  2. 第1章 Spark SQL概述

    第1章 Spark SQL概述 1.1 什么是Spark SQL Spark SQL是Spark用来处理结构化数据的一个模块,它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作 ...

  3. Spark SQL实战

    一.程序 package sparklearning import org.apache.log4j.Logger import org.apache.spark.SparkConf import o ...

  4. 大数据技术之_19_Spark学习_03_Spark SQL 应用解析 + Spark SQL 概述、解析 、数据源、实战 + 执行 Spark SQL 查询 + JDBC/ODBC 服务器

    第1章 Spark SQL 概述1.1 什么是 Spark SQL1.2 RDD vs DataFrames vs DataSet1.2.1 RDD1.2.2 DataFrame1.2.3 DataS ...

  5. Spark SQL知识点大全与实战

    Spark SQL概述 1.什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块. 与基本的Spark RDD API不同,Sp ...

  6. Spark SQL知识点与实战

    Spark SQL概述 1.什么是Spark SQL Spark SQL是Spark用于结构化数据(structured data)处理的Spark模块. 与基本的Spark RDD API不同,Sp ...

  7. 以慕课网日志分析为例-进入大数据Spark SQL的世界

    下载地址.请联系群主 第1章 初探大数据 本章将介绍为什么要学习大数据.如何学好大数据.如何快速转型大数据岗位.本项目实战课程的内容安排.本项目实战课程的前置内容介绍.开发环境介绍.同时为大家介绍项目 ...

  8. 以某课网日志分析为例 进入大数据 Spark SQL 的世界

    第1章 初探大数据 本章将介绍为什么要学习大数据.如何学好大数据.如何快速转型大数据岗位.本项目实战课程的内容安排.本项目实战课程的前置内容介绍.开发环境介绍.同时为大家介绍项目中涉及的Hadoop. ...

  9. Spark SQL 源代码分析系列

    从决定写Spark SQL文章的源代码分析,到现在一个月的时间,一个又一个几乎相同的结束很快,在这里也做了一个综合指数,方便阅读,下面是读取顺序 :) 第一章 Spark SQL源代码分析之核心流程 ...

随机推荐

  1. EF批量插入太慢?那是你的姿势不对

    大概所有的程序员应该都接触过批量插入的场景,我也相信任何的程序员都能写出可正常运行的批量插入的代码.但怎样实现一个高效.快速插入的批量插入功能呢? 由于每个人的工作履历,工作年限的不同,在实现这样的一 ...

  2. 初识Http

     HTTP是一个用在计算机世界里的协议,它确立了一种计算机之间交流通信的规范,以及相关的各种控制 和错误处理方式.  HTTP专门用来在两点之间传输数据,不能用于广播.寻址或路由. HTTP传输的是文 ...

  3. C#结合SMTP实现邮件报警通知

    写在前面 C#是微软推出的一门面向对象的通用型编程语言,它除了可以开发PC软件.网站(借助 http://ASP.NET)和APP(基于 Windows Phone),还能作为游戏脚本,编写游戏逻辑. ...

  4. hostapd阅读(openwrt)-3

    从官网下载相对而言比较干净的源码版本http://w1.fi/hostapd/,然后将其移植到openwrt下,方便在源码阅读时候进行调试编译,移植的过程总结如下心得. 1. openwrt编译与cl ...

  5. php 导出数据到excel类

    原文链接地址:http://www.oschina.net/code/snippet_212240_21885 标注:在使用时一定要屏蔽掉//$bodyVal = $this->charset( ...

  6. 动态规划入门(dp)

    dp的基本思想,是把大问题转化成一个个小问题,然后递归解决. 所以本质思想的话还是递归. dp最重要的是要找到状态转移方程,也就是把大问题化解的过程. 举个例子 一个数字金字塔 在上面的数字三角形中寻 ...

  7. PHP xml_parse_into_struct() 函数

    定义和用法 xml_parse_into_struct() 函数把 XML 数据解析到数组中. 该函数把 XML 数据解析到 2 个数组中:高佣联盟 www.cgewang.com Value 数组 ...

  8. Hadoop学习之NCDC天气数据获取

    期望目的 下载<Hadoop权威教程>里用到的NCDC天气数据,供后续在此数据基础上跑mapred程序. 操作过程 步骤一.编写简单的shell脚本,下载数据文件到本地文件系统 已知NCD ...

  9. Python实现微信读书辅助工具

    [TOC] ##项目来源 这个有意思的项目是我从GitHub上找来的,起因是在不久前微信读书突然就设置了非会员书架数目上限,我总想做点什么来表达我的不满,想到可否用爬虫来获取某一本书的内容, 但是我技 ...

  10. intel:spectre&Meltdown侧信道攻击(二)

    上面一篇介绍了spectre&meltdown基本原理和简单的demo方案,今天继续学习一下该漏洞发现团队原始的POC:https://spectreattack.com/spectre.pd ...