用Spark完成复杂TopN计算的两种逻辑

如果有商品品类的数据pairRDD(categoryId,clickCount_orderCount_payCount)，用Spark完成Top5,你会怎么做?

这里假设使用Java语言进行编写，那么你有两种思路：

1.简化成RDD(categoryObject)，其中categoryObject实现了java.lang.Comparable.然后使用top(5)获得topN

2.转换成PairRDD(categoryKey,info)，其中categoryKey实现了scala.math.Ordered。然后进行sortByKey之后再take(5).

注意:

1)top(n)函数在Java的Spark API中内部调用的比较器是java.lang.Comparable进行比较.

2)而sortByKey函数在Java的Spark API中依然调用scala.math.Ordered进行比较.

相比之下，思路2的空间和时间都不如思路1，但是如果我们需要sort结果的过程中顺便获得topN,则使用思路2更好一些。

思路1实现:

CategoryObject:

package com.stan.core.spark.userAction;

import java.io.Serializable;

public class ComparableCategoryObject

    implements Comparable<ComparableCategoryObject>, Serializable {

    String categoryId;

    Long clickCategoryCount;

    Long orderCategoryCount;

    Long defrayCategoryCount;

    @Override

    public int compareTo(ComparableCategoryObject o) {

        long compareNum =

                (this.defrayCategoryCount - o.defrayCategoryCount) * 10000

                        +

                        (this.orderCategoryCount - o.orderCategoryCount) * 100

                        +

                        (this.clickCategoryCount - o.clickCategoryCount) * 1;

        return (int)(compareNum%1000);

    }

    @Override

    public String toString() {

        return "ComparableCategoryObject{" +

                "categoryId='" + categoryId + '\'' +

                ", clickCategoryCount=" + clickCategoryCount +

                ", orderCategoryCount=" + orderCategoryCount +

                ", defrayCategoryCount=" + defrayCategoryCount +

                '}';

    }

    public String getCategoryId() {

        return categoryId;

    }

    public void setCategoryId(String categoryId) {

        this.categoryId = categoryId;

    }

    public Long getClickCategoryCount() {

        return clickCategoryCount;

    }

    public void setClickCategoryCount(Long clickCategoryCount) {

        this.clickCategoryCount = clickCategoryCount;

    }

    public Long getOrderCategoryCount() {

        return orderCategoryCount;

    }

    public void setOrderCategoryCount(Long orderCategoryCount) {

        this.orderCategoryCount = orderCategoryCount;

    }

    public Long getDefrayCategoryCount() {

        return defrayCategoryCount;

    }

    public void setDefrayCategoryCount(Long defrayCategoryCount) {

        this.defrayCategoryCount = defrayCategoryCount;

    }

}

具体调用方法:

        // 1.封装

        JavaRDD<ComparableCategoryObject> comparableCategoryObjectJavaRDD =

                categoryId2allCount.map(

                        new Function<Tuple2<String, String>, ComparableCategoryObject>() {

                            @Override

                            public ComparableCategoryObject call(Tuple2<String, String> stringStringTuple2) throws Exception {

                                String categoryId = stringStringTuple2._1;

                                String allCount = stringStringTuple2._2;

                                String[] tmpAllCountSplited = allCount.split("_");

                                Long clickCount = Long.valueOf(tmpAllCountSplited[0]);

                                Long orderCount = Long.valueOf(tmpAllCountSplited[1]);

                                Long defrayCount = Long.valueOf(tmpAllCountSplited[2]);

                                ComparableCategoryObject comparableCategoryObject =

                                        new ComparableCategoryObject();

                                comparableCategoryObject.setCategoryId(categoryId);

                                comparableCategoryObject.setClickCategoryCount(clickCount);

                                comparableCategoryObject.setOrderCategoryCount(orderCount);

                                comparableCategoryObject.setDefrayCategoryCount(defrayCount);

                                return comparableCategoryObject;

                            }

                        }

                );

         // 2.top(5)

        List<ComparableCategoryObject> top10Categorys = comparableCategoryObjectJavaRDD.top(5);

思路2实现:

CategoryKey：

package com.stan.core.spark.userAction;

import scala.Serializable;

import scala.math.Ordered;

/**

 * 用于按照

 * (clickCategoryCount,orderCategoryCount,defrayCategoryCount)的优先级排序

 */

public class ComparableCategoryKey

        // scala中可比较，以便于进行RDD排序

        implements Ordered<ComparableCategoryKey>, Serializable {

    String categoryId;

    Long clickCategoryCount;

    Long orderCategoryCount;

    Long defrayCategoryCount;

    /**

     * 计算比较数

     *

     * 因为优先级为 : 先比较支付量，若支付量相同，则比较下单量，若下单量相同，则继续比较点击量

     * 所以我在进行比较的时候直接使用 比较值 = 支付量差 * 10000 + 下单量差 * 100 + 点击量差

     * 若比较值小于 0 ，则小于，若比较值大于0 ， 则大于，若比较值等于0，则等于

     * @param comparableCategoryWithAllCount

     * @return

     */

    public long computeCompareNum(ComparableCategoryKey comparableCategoryWithAllCount){

        long compareNum =

                (this.defrayCategoryCount - comparableCategoryWithAllCount.defrayCategoryCount) * 10000

                        +

                        (this.orderCategoryCount - comparableCategoryWithAllCount.orderCategoryCount) * 100

                        +

                        (this.clickCategoryCount - comparableCategoryWithAllCount.clickCategoryCount) * 1;

        return compareNum;

    }

    @Override

    public int compare(ComparableCategoryKey comparableCategoryWithAllCount) {

        return (int)(computeCompareNum(comparableCategoryWithAllCount)%1000);

    }

    @Override

    public boolean $less(ComparableCategoryKey comparableCategoryWithAllCount) {

        return computeCompareNum(comparableCategoryWithAllCount) < 0;

    }

    @Override

    public boolean $greater(ComparableCategoryKey comparableCategoryWithAllCount) {

        return computeCompareNum(comparableCategoryWithAllCount) > 0;

    }

    @Override

    public boolean $less$eq(ComparableCategoryKey comparableCategoryWithAllCount) {

        return computeCompareNum(comparableCategoryWithAllCount) <= 0;

    }

    @Override

    public boolean $greater$eq(ComparableCategoryKey comparableCategoryWithAllCount) {

        return computeCompareNum(comparableCategoryWithAllCount) >= 0;

    }

    @Override

    public int compareTo(ComparableCategoryKey comparableCategoryWithAllCount) {

        return (int)(computeCompareNum(comparableCategoryWithAllCount)%1000);

    }

    public String getCategoryId() {

        return categoryId;

    }

    public void setCategoryId(String categoryId) {

        this.categoryId = categoryId;

    }

    public Long getClickCategoryCount() {

        return clickCategoryCount;

    }

    public void setClickCategoryCount(Long clickCategoryCount) {

        this.clickCategoryCount = clickCategoryCount;

    }

    public Long getOrderCategoryCount() {

        return orderCategoryCount;

    }

    public void setOrderCategoryCount(Long orderCategoryCount) {

        this.orderCategoryCount = orderCategoryCount;

    }

    public Long getDefrayCategoryCount() {

        return defrayCategoryCount;

    }

    public void setDefrayCategoryCount(Long defrayCategoryCount) {

        this.defrayCategoryCount = defrayCategoryCount;

    }

    @Override

    public String toString() {

        return "ComparableCategoryKey{" +

                "categoryId='" + categoryId + '\'' +

                ", clickCategoryCount=" + clickCategoryCount +

                ", orderCategoryCount=" + orderCategoryCount +

                ", defrayCategoryCount=" + defrayCategoryCount +

                '}';

    }

}

具体的调用过程:

// 1.封装成(categoryKey,info)

JavaPairRDD<ComparableCategoryKey,String> comparableCategory2AllCountRDD =

                categoryId2allCount.mapToPair(

                new PairFunction<Tuple2<String, String>, ComparableCategoryKey,String>() {

                    @Override

                    public Tuple2<ComparableCategoryKey,String> call(Tuple2<String, String> stringStringTuple2) throws Exception {

                        String categoryId = stringStringTuple2._1;

                        String allCount = stringStringTuple2._2;

                        String[] tmpAllCountSplited = allCount.split("_");

                        Long clickCount = Long.valueOf(tmpAllCountSplited[0]);

                        Long orderCount = Long.valueOf(tmpAllCountSplited[1]);

                        Long defrayCount = Long.valueOf(tmpAllCountSplited[2]);

                        ComparableCategoryKey comparableCategoryWithAllCount =

                                new ComparableCategoryKey();

                        comparableCategoryWithAllCount.setCategoryId(categoryId);

                        comparableCategoryWithAllCount.setClickCategoryCount(clickCount);

                        comparableCategoryWithAllCount.setOrderCategoryCount(orderCount);

                        comparableCategoryWithAllCount.setDefrayCategoryCount(defrayCount);

                        return new Tuple2<>(comparableCategoryWithAllCount,allCount);

                    }

                }

        );

        // 2.sortByKey 排序

        comparableCategory2AllCountRDD.sortByKey();

        // 3.获取前五

        List<Tuple2<ComparableCategoryKey,String>> top10Categorys =  comparableCategory2AllCountRDD.take(5);

用Spark完成复杂TopN计算的两种逻辑的更多相关文章

Spark Streaming中空batches处理的两种方法（转）
原文链接:Spark Streaming中空batches处理的两种方法 Spark Streaming是近实时(near real time)的小批处理系统.对给定的时间间隔(interval),S ...
【Spark篇】---SparkStreaming+Kafka的两种模式receiver模式和Direct模式
一.前述 SparkStreamin是流式问题的解决的代表,一般结合kafka使用,所以本文着重讲解sparkStreaming+kafka两种模式. 二.具体 1.Receiver模式原理图 ...
spark streaming 接收kafka消息之一 -- 两种接收方式
源码分析的spark版本是1.6. 首先,先看一下 org.apache.spark.streaming.dstream.InputDStream 的类说明: This is the abstrac ...
第一章：1-20、试计算以下两种情况的发送时延和传播时延：（1）数据长度为107bit，数据发送速率为100kbit/s，传播距离为1000km，信号在媒体上的传播速率为2×108m/s。（2）数据长度为103bit，数据发送速率为1Gbit/s，传输距离和信号在媒体上的传播速率同上。
<计算机网络>谢希仁著第四版课后习题答案答: 1):发送延迟=107/(100×1000)=100s 传播延迟=1000×1000/(2×108)=5×10-3s=5ms ...
spark提交任务的两种的方法
在学习Spark过程中,资料中介绍的提交Spark Job的方式主要有两种(我所知道的): 第一种: 通过命令行的方式提交Job,使用spark 自带的spark-submit工具提交,官网和大多数参 ...
sparkStreaming读取kafka的两种方式
概述 Spark Streaming 支持多种实时输入源数据的读取,其中包括Kafka.flume.socket流等等.除了Kafka以外的实时输入源,由于我们的业务场景没有涉及,在此将不会讨论.本篇 ...
编译spark源码 Maven 、SBT 2种方式编译
由于实际环境较为复杂,从Spark官方下载二进制安装包可能不具有相关功能或不支持指定的软件版本,这就需要我们根据实际情况编译Spark源代码,生成所需要的部署包. Spark可以通过Maven和SBT ...
Spark源码剖析 - 计算引擎
本章导读 RDD作为Spark对各种数据计算模型的统一抽象,被用于迭代计算过程以及任务输出结果的缓存读写.在所有MapReduce框架中,shuffle是连接map任务和reduce任务的桥梁.map ...
spark 例子groupByKey分组计算
spark 例子groupByKey分组计算例子描述: [分组.计算] 主要为两部分,将同类的数据分组归纳到一起,并将分组后的数据进行简单数学计算. 难点在于怎么去理解groupBy和groupBy ...

随机推荐

Xamarin Forms Api请求开源框架Refit
用于.NET Core,Xamarin和.NET的自动类型安全的REST库,Refit是一个受Square Square Retrofit库影响的库,但它比REST API更容易: public in ...
appium 3 跑起来
1. 代码如下: from appium import webdriver capabilitise = { "platformName": "Android" ...
修改hostname
修改hostname步骤 1. 修改/etc/sysconfig/network中的hostname选项 2. 在/etc/hosts中添加hostname对应的ip地址 3.执行命令:hostnam ...
C# Window Service安装、卸载、恢复选项操作
using System;using System.Diagnostics;using System.Linq;using System.ServiceProcess; namespace ScmWr ...
windows下redis 配置文件参数说明
1.先看redis.windows.conf 文件 # Redis configuration file example # Note on units: when memory size is ne ...
一个handle使用更新线程的实例
handle更新线程实例 package com.example.administrator.handle; import android.app.Activity;import android.os ...
maven clean或package报错
[ERROR] Failed to execute goal on project jeesns-service: Could not resolve dependencies for project ...
Atom以及amWiki个人维基安装设置
amWiki个人维基 amWiki支持.md文件的静态维基系统安装参考安装Atom 下载amWiki解压zip到C:\Users\用户名\.atom\packages 目录下,或在Atom中搜索 ...
信步漫谈之Jenkins—集成自动化部署 SVN 项目
一.环境准备 1)Jenkins 部署 WAR 包:jenkins.war(2.164.2 版本,WAR 包官方下载路径:https://jenkins.io/download/)2)Tomcat 服 ...
tomcat部署项目
提示:指定jdk版本在bin路径下的setclasspath.bat文件添加 set JAVA_HOME=C:\Program Files\Java\jdk1.8.0_201 set JRE_HO ...

用Spark完成复杂TopN计算的两种逻辑

用Spark完成复杂TopN计算的两种逻辑的更多相关文章

随机推荐

热门专题