59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例

一、top3热门商品实时统计案例

1、概述

Spark Streaming最强大的地方在于，可以与Spark Core、Spark SQL整合使用，之前已经通过transform、foreachRDD等算子看到，

如何将DStream中的RDD使用Spark Core执行批处理操作。现在就来看看，如何将DStream中的RDD与Spark SQL结合起来使用。

案例：每隔10秒，统计最近60秒的，每个种类的每个商品的点击次数，然后统计出每个种类top3热门的商品。

2、java案例

package cn.spark.study.streaming;

import java.util.ArrayList;

import java.util.List;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.hive.HiveContext;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;

import org.apache.spark.streaming.Durations;

import org.apache.spark.streaming.api.java.JavaPairDStream;

import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import scala.Tuple2;

/**

 * 与Spark SQL整合使用，top3热门商品实时统计

 * @author Administrator

 *

 */

public class Top3HotProduct {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf()

                .setMaster("local[2]")

                .setAppName("Top3HotProduct");

        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

        // 首先看一下，输入日志的格式

        // leo iphone mobile_phone

        // 首先，获取输入数据流

        // 这里顺带提一句，之前没有讲过，就是说，我们的Spark Streaming的案例为什么都是基于socket的呢？

        // 因为方便啊。。。

        // 其实，企业里面，真正最常用的，都是基于Kafka这种数据源

        // 但是我觉得我们的练习，用socket也无妨，比较方便，而且一点也不影响学习

        // 因为不同的输入来源的，不同之处，只是在创建输入DStream的那一点点代码

        // 所以，核心是在于之后的Spark Streaming的实时计算

        // 所以只要我们掌握了各个案例和功能的使用

        // 在企业里，切换到Kafka，易如反掌，因为我们之前都详细讲过，而且实验过，实战编码过，将Kafka作为

        // 数据源的两种方式了

        // 获取输入数据流

        JavaReceiverInputDStream<String> productClickLogsDStream = jssc.socketTextStream("spark1", 9999);

        // 然后，应该是做一个映射，将每个种类的每个商品，映射为(category_product, 1)的这种格式

        // 从而在后面可以使用window操作，对窗口中的这种格式的数据，进行reduceByKey操作

        // 从而统计出来，一个窗口中的每个种类的每个商品的，点击次数

        JavaPairDStream<String, Integer> categoryProductPairsDStream = productClickLogsDStream

                .mapToPair(new PairFunction<String, String, Integer>() {

                    private static final long serialVersionUID = 1L;

                    @Override

                    public Tuple2<String, Integer> call(String productClickLog)

                            throws Exception {

                        String[] productClickLogSplited = productClickLog.split(" ");

                        return new Tuple2<String, Integer>(productClickLogSplited[2] + "_" +

                                productClickLogSplited[1], 1);

                    }

                });

        // 然后执行window操作

        // 到这里，就可以做到，每隔10秒钟，对最近60秒的数据，执行reduceByKey操作

        // 计算出来这60秒内，每个种类的每个商品的点击次数

        JavaPairDStream<String, Integer> categoryProductCountsDStream =

                categoryProductPairsDStream.reduceByKeyAndWindow(

                        new Function2<Integer, Integer, Integer>() {

                            private static final long serialVersionUID = 1L;

                            @Override

                            public Integer call(Integer v1, Integer v2) throws Exception {

                                return v1 + v2;

                            }

                        }, Durations.seconds(60), Durations.seconds(10));  

        // 然后针对60秒内的每个种类的每个商品的点击次数

        // foreachRDD，在内部，使用Spark SQL执行top3热门商品的统计

        categoryProductCountsDStream.foreachRDD(new Function<JavaPairRDD<String,Integer>, Void>() {

            private static final long serialVersionUID = 1L;

            @Override

            public Void call(JavaPairRDD<String, Integer> categoryProductCountsRDD) throws Exception {

                // 将该RDD，转换为JavaRDD<Row>的格式

                JavaRDD<Row> categoryProductCountRowRDD = categoryProductCountsRDD.map(

                        new Function<Tuple2<String,Integer>, Row>() {

                            private static final long serialVersionUID = 1L;

                            @Override

                            public Row call(Tuple2<String, Integer> categoryProductCount)

                                    throws Exception {

                                String category = categoryProductCount._1.split("_")[0];

                                String product = categoryProductCount._1.split("_")[1];

                                Integer count = categoryProductCount._2;

                                return RowFactory.create(category, product, count);

                            }

                        });

                // 然后，执行DataFrame转换

                List<StructField> structFields = new ArrayList<StructField>();

                structFields.add(DataTypes.createStructField("category", DataTypes.StringType, true));

                structFields.add(DataTypes.createStructField("product", DataTypes.StringType, true));

                structFields.add(DataTypes.createStructField("click_count", DataTypes.IntegerType, true));

                StructType structType = DataTypes.createStructType(structFields);

                HiveContext hiveContext = new HiveContext(categoryProductCountsRDD.context());

                DataFrame categoryProductCountDF = hiveContext.createDataFrame(

                        categoryProductCountRowRDD, structType);

                // 将60秒内的每个种类的每个商品的点击次数的数据，注册为一个临时表

                categoryProductCountDF.registerTempTable("product_click_log");  

                // 执行SQL语句，针对临时表，统计出来每个种类下，点击次数排名前3的热门商品

                DataFrame top3ProductDF = hiveContext.sql(

                        "SELECT category,product,click_count "

                        + "FROM ("

                            + "SELECT "

                                + "category,"

                                + "product,"

                                + "click_count,"

                                + "row_number() OVER (PARTITION BY category ORDER BY click_count DESC) rank "

                            + "FROM product_click_log"

                        + ") tmp "

                        + "WHERE rank<=3");

                // 这里说明一下，其实在企业场景中，可以不是打印的

                // 案例说，应该将数据保存到redis缓存、或者是mysql db中

                // 然后，应该配合一个J2EE系统，进行数据的展示和查询、图形报表

                top3ProductDF.show();      

                return null;

            }

        });

        jssc.start();

        jssc.awaitTermination();

        jssc.close();

    }

}

3、scala案例

package cn.spark.study.streaming

import org.apache.spark.SparkConf

import org.apache.spark.streaming.StreamingContext

import org.apache.spark.streaming.Seconds

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.types.StructField

import org.apache.spark.sql.types.StringType

import org.apache.spark.sql.types.IntegerType

import org.apache.spark.sql.hive.HiveContext

/**

 * @author Administrator

 */

object Top3HotProduct {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()

        .setMaster("local[2]")

        .setAppName("Top3HotProduct")

    val ssc = new StreamingContext(conf, Seconds(1))

    val productClickLogsDStream = ssc.socketTextStream("spark1", 9999)

    val categoryProductPairsDStream = productClickLogsDStream

        .map { productClickLog => (productClickLog.split(" ")(2) + "_" + productClickLog.split(" ")(1), 1)}

    val categoryProductCountsDStream = categoryProductPairsDStream.reduceByKeyAndWindow(

        (v1: Int, v2: Int) => v1 + v2,

        Seconds(60),

        Seconds(10))  

    categoryProductCountsDStream.foreachRDD(categoryProductCountsRDD => {

      val categoryProductCountRowRDD = categoryProductCountsRDD.map(tuple => {

        val category = tuple._1.split("_")(0)

        val product = tuple._1.split("_")(1)

        val count = tuple._2

        Row(category, product, count)

      })

      val structType = StructType(Array(

          StructField("category", StringType, true),

          StructField("product", StringType, true),

          StructField("click_count", IntegerType, true)))

      val hiveContext = new HiveContext(categoryProductCountsRDD.context)

      val categoryProductCountDF = hiveContext.createDataFrame(categoryProductCountRowRDD, structType)  

      categoryProductCountDF.registerTempTable("product_click_log")  

      val top3ProductDF = hiveContext.sql(

            "SELECT category,product,click_count "

            + "FROM ("

              + "SELECT "

                + "category,"

                + "product,"

                + "click_count,"

                + "row_number() OVER (PARTITION BY category ORDER BY click_count DESC) rank "

              + "FROM product_click_log"

            + ") tmp "

            + "WHERE rank<=3")

      top3ProductDF.show()

    })

    ssc.start()

    ssc.awaitTermination()

  }

}

59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例的更多相关文章

Spark2.2（三十三）：Spark Streaming和Spark Structured Streaming更新broadcast总结（一）
背景: 需要在spark2.2.0更新broadcast中的内容,网上也搜索了不少文章,都在讲解spark streaming中如何更新,但没有spark structured streaming更新 ...
Spark2.3（四十二）：Spark Streaming和Spark Structured Streaming更新broadcast总结（二）
本次此时是在SPARK2,3 structured streaming下测试,不过这种方案,在spark2.2 structured streaming下应该也可行(请自行测试).以下是我测试结果: ...
48、Spark SQL之与Spark Core整合之每日top3热点搜索词统计案例实战
一.概述 1.需求分析数据格式: 日期用户搜索词城市平台版本需求: 1.筛选出符合查询条件(城市.平台.版本)的数据 2.统计出每天搜索uv排名前3的搜索词 3.按照每天的top3搜索词 ...
小记---------spark组件与其他组件的比较 spark/mapreduce ;spark sql/hive ; spark streaming/storm
Spark与Hadoop的对比 Scala是Spark的主要编程语言,但Spark还支持Java.Python.R作为编程语言 Hadoop的编程语言是Java
基于案例贯通 Spark Streaming 流计算框架的运行源码
本期内容 : Spark Streaming+Spark SQL案例展示基于案例贯穿Spark Streaming的运行源码一. 案例代码阐述 : 在线动态计算电商中不同类别中最热门的商品排名,例 ...
通过案例对 spark streaming 透彻理解三板斧之一： spark streaming 另类实验
本期内容 : spark streaming另类在线实验瞬间理解spark streaming本质一．我们最开始将从Spark Streaming入手为何从Spark Streaming切入 ...
spark streaming 实战
最近在学习spark的相关知识, 重点在看spark streaming 和spark mllib相关的内容. 关于spark的配置: http://www.powerxing.com/spark-q ...
9.Spark Streaming
Spark Streaming 1 Why Apache Spark 2 关于Apache Spark 3 如何安装Apache Spark 4 Apache Spark的工作原理 5 spark弹性 ...
Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南概述一个入门示例基础概念依赖初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...

随机推荐

c#mysql数据库备份还原
1:引用dll MySql.Data.dll, MySqlbackup.dll 2:建一个数据连接静态类 public static class mysql { public static str ...
C# vb .net实现位图蒙版特效滤镜
在.net中,如何简单快捷地实现Photoshop滤镜组中的位图蒙版特效呢?答案是调用SharpImage!专业图像特效滤镜和合成类库.下面开始演示关键代码,您也可以在文末下载全部源码: 设置授权第 ...
window 包管理器--Chocolatey
Chocolatey 介绍在 Linux 下,大家喜欢用 apt-get 来安装应用程序,如今在 windows 下,大家可以使用 Chocolatey 来快速下载搭建一个开发环境. Chocola ...
使用JDK的zip编写打包工具类
JDK自带的zip AIP在java.util.zip包下面,主要有以下几个类: java.util.zip.ZipEntryjava.util.zip.ZipInputStreamjava.util ...
Dijkstra堆优化+邻接表
Dijkstra算法是个不错的算法,但是在优化前时间复杂度太高了,为O(nm). 在经过堆优化后(具体实现用的c++ STL的priority_queue),时间复杂度为O((m+n) log n), ...
记录screen屏幕日志
1.建立日志存放目录#mkdir /var/log/screen/ 2.修改配置文件,在末尾添加配置内容#vi /etc/screenrclogfile /var/log/screen/%t.log ...
koa2---koa-bodyparser中间件
对于POST请求的处理,koa-bodyparser中间件可以把koa2上下文的formData数据解析到ctx.request.body中安装: npm install --save koa-bo ...
pgrep,pkill
pgrep, pkill - look up or signal processes based on name and other attributes 根据名称和其它属性来查找进程 pgrep: ...
Oracle11g安装步骤（CentOS7）
安装环境:CentOS 7(64位) . oracle11G 的压缩包第一步:创建相关目录,并将安装包放在指定路径下 [root@localhost data]# pwd/data[root@loc ...
cmdb资产管理2
新增资产现在api服务端已经能获取到我们要做的操作了.接下来应该是补充获取操作后对应的程序编写我们要做的是把post请求发过来的数据保存到数据库.我们创建repository 名字的app,并设计 ...

59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例

59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例的更多相关文章

随机推荐

热门专题