59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例

一、top3热门商品实时统计案例

1、概述

Spark Streaming最强大的地方在于，可以与Spark Core、Spark SQL整合使用，之前已经通过transform、foreachRDD等算子看到，

如何将DStream中的RDD使用Spark Core执行批处理操作。现在就来看看，如何将DStream中的RDD与Spark SQL结合起来使用。

案例：每隔10秒，统计最近60秒的，每个种类的每个商品的点击次数，然后统计出每个种类top3热门的商品。

2、java案例

package cn.spark.study.streaming;

import java.util.ArrayList;

import java.util.List;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaPairRDD;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.sql.DataFrame;

import org.apache.spark.sql.Row;

import org.apache.spark.sql.RowFactory;

import org.apache.spark.sql.hive.HiveContext;

import org.apache.spark.sql.types.DataTypes;

import org.apache.spark.sql.types.StructField;

import org.apache.spark.sql.types.StructType;

import org.apache.spark.streaming.Durations;

import org.apache.spark.streaming.api.java.JavaPairDStream;

import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import scala.Tuple2;

/**

 * 与Spark SQL整合使用，top3热门商品实时统计

 * @author Administrator

 *

 */

public class Top3HotProduct {

    public static void main(String[] args) {

        SparkConf conf = new SparkConf()

                .setMaster("local[2]")

                .setAppName("Top3HotProduct");

        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

        // 首先看一下，输入日志的格式

        // leo iphone mobile_phone

        // 首先，获取输入数据流

        // 这里顺带提一句，之前没有讲过，就是说，我们的Spark Streaming的案例为什么都是基于socket的呢？

        // 因为方便啊。。。

        // 其实，企业里面，真正最常用的，都是基于Kafka这种数据源

        // 但是我觉得我们的练习，用socket也无妨，比较方便，而且一点也不影响学习

        // 因为不同的输入来源的，不同之处，只是在创建输入DStream的那一点点代码

        // 所以，核心是在于之后的Spark Streaming的实时计算

        // 所以只要我们掌握了各个案例和功能的使用

        // 在企业里，切换到Kafka，易如反掌，因为我们之前都详细讲过，而且实验过，实战编码过，将Kafka作为

        // 数据源的两种方式了

        // 获取输入数据流

        JavaReceiverInputDStream<String> productClickLogsDStream = jssc.socketTextStream("spark1", 9999);

        // 然后，应该是做一个映射，将每个种类的每个商品，映射为(category_product, 1)的这种格式

        // 从而在后面可以使用window操作，对窗口中的这种格式的数据，进行reduceByKey操作

        // 从而统计出来，一个窗口中的每个种类的每个商品的，点击次数

        JavaPairDStream<String, Integer> categoryProductPairsDStream = productClickLogsDStream

                .mapToPair(new PairFunction<String, String, Integer>() {

                    private static final long serialVersionUID = 1L;

                    @Override

                    public Tuple2<String, Integer> call(String productClickLog)

                            throws Exception {

                        String[] productClickLogSplited = productClickLog.split(" ");

                        return new Tuple2<String, Integer>(productClickLogSplited[2] + "_" +

                                productClickLogSplited[1], 1);

                    }

                });

        // 然后执行window操作

        // 到这里，就可以做到，每隔10秒钟，对最近60秒的数据，执行reduceByKey操作

        // 计算出来这60秒内，每个种类的每个商品的点击次数

        JavaPairDStream<String, Integer> categoryProductCountsDStream =

                categoryProductPairsDStream.reduceByKeyAndWindow(

                        new Function2<Integer, Integer, Integer>() {

                            private static final long serialVersionUID = 1L;

                            @Override

                            public Integer call(Integer v1, Integer v2) throws Exception {

                                return v1 + v2;

                            }

                        }, Durations.seconds(60), Durations.seconds(10));  

        // 然后针对60秒内的每个种类的每个商品的点击次数

        // foreachRDD，在内部，使用Spark SQL执行top3热门商品的统计

        categoryProductCountsDStream.foreachRDD(new Function<JavaPairRDD<String,Integer>, Void>() {

            private static final long serialVersionUID = 1L;

            @Override

            public Void call(JavaPairRDD<String, Integer> categoryProductCountsRDD) throws Exception {

                // 将该RDD，转换为JavaRDD<Row>的格式

                JavaRDD<Row> categoryProductCountRowRDD = categoryProductCountsRDD.map(

                        new Function<Tuple2<String,Integer>, Row>() {

                            private static final long serialVersionUID = 1L;

                            @Override

                            public Row call(Tuple2<String, Integer> categoryProductCount)

                                    throws Exception {

                                String category = categoryProductCount._1.split("_")[0];

                                String product = categoryProductCount._1.split("_")[1];

                                Integer count = categoryProductCount._2;

                                return RowFactory.create(category, product, count);

                            }

                        });

                // 然后，执行DataFrame转换

                List<StructField> structFields = new ArrayList<StructField>();

                structFields.add(DataTypes.createStructField("category", DataTypes.StringType, true));

                structFields.add(DataTypes.createStructField("product", DataTypes.StringType, true));

                structFields.add(DataTypes.createStructField("click_count", DataTypes.IntegerType, true));

                StructType structType = DataTypes.createStructType(structFields);

                HiveContext hiveContext = new HiveContext(categoryProductCountsRDD.context());

                DataFrame categoryProductCountDF = hiveContext.createDataFrame(

                        categoryProductCountRowRDD, structType);

                // 将60秒内的每个种类的每个商品的点击次数的数据，注册为一个临时表

                categoryProductCountDF.registerTempTable("product_click_log");  

                // 执行SQL语句，针对临时表，统计出来每个种类下，点击次数排名前3的热门商品

                DataFrame top3ProductDF = hiveContext.sql(

                        "SELECT category,product,click_count "

                        + "FROM ("

                            + "SELECT "

                                + "category,"

                                + "product,"

                                + "click_count,"

                                + "row_number() OVER (PARTITION BY category ORDER BY click_count DESC) rank "

                            + "FROM product_click_log"

                        + ") tmp "

                        + "WHERE rank<=3");

                // 这里说明一下，其实在企业场景中，可以不是打印的

                // 案例说，应该将数据保存到redis缓存、或者是mysql db中

                // 然后，应该配合一个J2EE系统，进行数据的展示和查询、图形报表

                top3ProductDF.show();      

                return null;

            }

        });

        jssc.start();

        jssc.awaitTermination();

        jssc.close();

    }

}

3、scala案例

package cn.spark.study.streaming

import org.apache.spark.SparkConf

import org.apache.spark.streaming.StreamingContext

import org.apache.spark.streaming.Seconds

import org.apache.spark.sql.Row

import org.apache.spark.sql.types.StructType

import org.apache.spark.sql.types.StructField

import org.apache.spark.sql.types.StringType

import org.apache.spark.sql.types.IntegerType

import org.apache.spark.sql.hive.HiveContext

/**

 * @author Administrator

 */

object Top3HotProduct {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()

        .setMaster("local[2]")

        .setAppName("Top3HotProduct")

    val ssc = new StreamingContext(conf, Seconds(1))

    val productClickLogsDStream = ssc.socketTextStream("spark1", 9999)

    val categoryProductPairsDStream = productClickLogsDStream

        .map { productClickLog => (productClickLog.split(" ")(2) + "_" + productClickLog.split(" ")(1), 1)}

    val categoryProductCountsDStream = categoryProductPairsDStream.reduceByKeyAndWindow(

        (v1: Int, v2: Int) => v1 + v2,

        Seconds(60),

        Seconds(10))  

    categoryProductCountsDStream.foreachRDD(categoryProductCountsRDD => {

      val categoryProductCountRowRDD = categoryProductCountsRDD.map(tuple => {

        val category = tuple._1.split("_")(0)

        val product = tuple._1.split("_")(1)

        val count = tuple._2

        Row(category, product, count)

      })

      val structType = StructType(Array(

          StructField("category", StringType, true),

          StructField("product", StringType, true),

          StructField("click_count", IntegerType, true)))

      val hiveContext = new HiveContext(categoryProductCountsRDD.context)

      val categoryProductCountDF = hiveContext.createDataFrame(categoryProductCountRowRDD, structType)  

      categoryProductCountDF.registerTempTable("product_click_log")  

      val top3ProductDF = hiveContext.sql(

            "SELECT category,product,click_count "

            + "FROM ("

              + "SELECT "

                + "category,"

                + "product,"

                + "click_count,"

                + "row_number() OVER (PARTITION BY category ORDER BY click_count DESC) rank "

              + "FROM product_click_log"

            + ") tmp "

            + "WHERE rank<=3")

      top3ProductDF.show()

    })

    ssc.start()

    ssc.awaitTermination()

  }

}

59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例的更多相关文章

Spark2.2（三十三）：Spark Streaming和Spark Structured Streaming更新broadcast总结（一）
背景: 需要在spark2.2.0更新broadcast中的内容,网上也搜索了不少文章,都在讲解spark streaming中如何更新,但没有spark structured streaming更新 ...
Spark2.3（四十二）：Spark Streaming和Spark Structured Streaming更新broadcast总结（二）
本次此时是在SPARK2,3 structured streaming下测试,不过这种方案,在spark2.2 structured streaming下应该也可行(请自行测试).以下是我测试结果: ...
48、Spark SQL之与Spark Core整合之每日top3热点搜索词统计案例实战
一.概述 1.需求分析数据格式: 日期用户搜索词城市平台版本需求: 1.筛选出符合查询条件(城市.平台.版本)的数据 2.统计出每天搜索uv排名前3的搜索词 3.按照每天的top3搜索词 ...
小记---------spark组件与其他组件的比较 spark/mapreduce ;spark sql/hive ; spark streaming/storm
Spark与Hadoop的对比 Scala是Spark的主要编程语言,但Spark还支持Java.Python.R作为编程语言 Hadoop的编程语言是Java
基于案例贯通 Spark Streaming 流计算框架的运行源码
本期内容 : Spark Streaming+Spark SQL案例展示基于案例贯穿Spark Streaming的运行源码一. 案例代码阐述 : 在线动态计算电商中不同类别中最热门的商品排名,例 ...
通过案例对 spark streaming 透彻理解三板斧之一： spark streaming 另类实验
本期内容 : spark streaming另类在线实验瞬间理解spark streaming本质一．我们最开始将从Spark Streaming入手为何从Spark Streaming切入 ...
spark streaming 实战
最近在学习spark的相关知识, 重点在看spark streaming 和spark mllib相关的内容. 关于spark的配置: http://www.powerxing.com/spark-q ...
9.Spark Streaming
Spark Streaming 1 Why Apache Spark 2 关于Apache Spark 3 如何安装Apache Spark 4 Apache Spark的工作原理 5 spark弹性 ...
Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南概述一个入门示例基础概念依赖初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...

随机推荐

linux安装好的mysql rpm -qa |grep mysql不见
输入: rpm -qa|grep -i mysql
MEF在WCF REST中实际应用2（Global.asax注册）
IOCContainer文件: public class IOCContainer { /// <summary> /// 容器 /// </summary> public s ...
Eclipse创建ssm项目
1.创建Maven项目 2.勾选上面的 3.打成war包的形式 4.配置webapp.xml Project Facets——Dynamic Wed Module 2.5 ——然后点击下面的提示 5 ...
$(...).wordExport is not a function
参考网址:https://laod.cn/code-audit/jquery-is-not-a-function.html 问题描述: 1.view页面引用的是jquery-1.10.2.min.j ...
C# 计时程序运行时间
第一种 System.DateTime public static void SubTest() { DateTime beforeDT = System.DateTime.Now; , , , ...
flask 与 SQLAlchemy的使用
flask 与 SQLAlchemy的使用安装模块 pip install flask-sqlalchemy 在单个python中与flask使用 # 文件名:manage.py from flas ...
在Windows中运行Linux bash命令的几种方法
如果你正在课程中正在学习 shell 脚本,那么需要使用 Linux 命令来练习命令和脚本. 你的学校实验室可能安装了 Linux,但是你自己没有安装了 Linux 的笔记本电脑,而是像其他人一样的 ...
解决centos7下 selenium报错--unknown error: DevToolsActivePort file doesn't exist
解决centos7下 selenium报错--unknown error: DevToolsActivePort file doesn't exist 早上在linux下用selenium启动Chro ...
vue mixins是什么及应用
mixins是什么? 官网对此的解释比较文绉绉,通俗的理解很简单,就是提供功能抽象如A,B,C ...Z等很多个页面用到同一个功能,此时的做法就应该把该功能抽象出来,mixins就是干这个的当然, ...
springcloud中gateway的实际应用
之前我一直用的是Zuul网关,用过gateway以后感觉比Zuul功能还是强大很多. Spring Cloud Gateway是基于Spring5.0,Spring Boot2.0和Project R ...

59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例

59、Spark Streaming与Spark SQL结合使用之top3热门商品实时统计案例的更多相关文章

随机推荐

热门专题