Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？

代码场景：

1）设定的几种数据场景，遍历所有场景：依次统计满足每种场景条件下的数据，并把统计结果存入hive；

2）已有代码如下：

    case class IndoorOTTCalibrateBuildingVecotrLegend(oid: Int, minHeight: Int, maxHeight: Int, minGridIDCount: Int, maxGridIDCount: Int, heightType: Int) extends Serializable

    //  实例化建筑物区间段：按照栅格的个数（面积）、楼的高度（商场等场景）来划分场景

    val buildingHeightLegends = List(

      IndoorOTTCalibrateBuildingVecotrLegend(1, 1, 30, 1, 21, BuildingCalibrateHeightType.HeightType1.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(2, 1, 30, 21, 45, BuildingCalibrateHeightType.HeightType2.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(3, 1, 30, 45, 100, BuildingCalibrateHeightType.HeightType3.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(4, 30, 50, 1, 21, BuildingCalibrateHeightType.HeightType4.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(5, 30, 50, 21, 45, BuildingCalibrateHeightType.HeightType5.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(6, 30, 50, 45, 100, BuildingCalibrateHeightType.HeightType6.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(7, 50, 5000, 1, 100, BuildingCalibrateHeightType.HeightType7.toString.toInt)

    )

    spark.sparkContext.parallelize(buildingHeightLegends).collect().foreach(buildingHeightLegend => {

      generateSampleBySenceType(spark, p_city, p_hour_start, p_hour_end, p_fpb_day, p_day_sample, linkLossCalibrateParameter, buildingHeightLegend)

    })

备注：

在generateSampleBySenceType()函数内部包含有:

spark.sql(s"""
|xxx

|where t10.heihgt>=${buildingHieghtLegend.MinHeight} and t10.height<${buildingHieghtLegend.MaxHeight}

|and t10.gridcount<=${buildingHieghtLegend.MinGridIDCount} and  t10.gridcount>${buildingHieghtLegend.MaxGridIDCount}
|""".stripMargin)

如果把代码修改：

    val buildingHeightLegends_df = spark.sqlContext.createDataFrame(buildingHeightLegends)

    buildingHeightLegends_df.createOrReplaceTempView("temp_buildingheightlegends")

    sql(s"""|select * from temp_buildingheightlegends""".stripMargin).repartition(buildingHeightLegends.length).foreachPartition(rows => {

      for (row <- rows) {

        val buildingHeightLegend = new IndoorOTTCalibrateBuildingVecotrLegend(

          row.getAs[Int]("oid"),

          row.getAs[Int]("minheight"),

          row.getAs[Int]("maxheight"),

          row.getAs[Int]("mingrididcount"),

          row.getAs[Int]("maxgrididcount"),

          row.getAs[Int]("heighttype"))

        generateSampleBySenceType(spark, p_city, p_hour_start, p_hour_end, p_fpb_day, p_day_sample, linkLossCalibrateParameter, buildingHeightLegend)

      }

    })

则会提示：generateSampleBySenceType()内部sql代码位置抛出SparkSession为NULL的异常。

修改方案：

把buildingHeightLegends注册为临时表temp_buildingHeightLegends，去掉外层的foreach，之后在generateSampleBySenceType()内部把temp_buildingHeightLegends与其他结果集合进行cross join：

测试代码如下：

-- 场景表

CREATE TABLE [dbo].[test_senceitems](

    [sencetype] [int] NULL,

    [minheight] [int] NULL,

    [maxheight] [int] NULL,

    [mingridcount] [int] NULL,

    [maxgridcount] [int] NULL

)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (1, 1, 30, 1, 21)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (2, 1, 30, 21, 45)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (3, 1, 30, 45, 100)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (4, 30, 50, 1, 21)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (5, 30, 50, 21, 45)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (6, 30, 50, 45, 100)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (7, 50, 5000, 1, 100)

-- 业务过滤统计表

CREATE TABLE [dbo].[test_grid](

    [gridid] [nvarchar](50) NULL,

    [height] [int] NULL,

    [gridcount] [int] NULL

) 

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g1', 8, 23)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g2', 3, 87)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g3', 4, 34)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g4', 30, 54)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g5', 32, 32)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g6', 32, 20)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g7', 120, 34)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g8', 89, 54)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g9', 9, 16)

替换generateSampleBySenceType()内部sql(s"""|""".stripMargin)代码类似如下：

select t10.*,t11.*

from test_grid t10

cross join test_senceitems t11

where t10.height>=t11.minheight and t10.height<t11.maxheight

and t10.gridcount>=t11.mingridcount and t10.gridcount<t11.maxgridcount

Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？的更多相关文章

单表千亿电信大数据场景，使用Spark+CarbonData替换Impala案例
[背景介绍] 国内某移动局点使用Impala组件处理电信业务详单,每天处理约100TB左右详单,详单表记录每天大于百亿级别,在使用impala过程中存在以下问题: 详单采用Parquet格式存储,数据 ...
arrayObj.splice(start, deleteCount, [item1[, item2[, . . . [,itemN]]]])
测试方法 function test(){ var arr = [0,1,2,3]; arr.splice(1,1,'a');//case console.dir(arr); } case1: arr ...
Spark(二)【sc.textfile的分区策略源码分析】
sparkcontext.textFile()返回的是HadoopRDD! 关于HadoopRDD的官方介绍,使用的是旧版的hadoop api ctrl+F12搜索 HadoopRDD的getPar ...
Spark算子--first、count、reduce、collect、lookup
转载请标明出处http://www.cnblogs.com/haozhengfei/p/4b8582c8dde1529abb11e4ccc8296171.html first.count.reduce ...
Spark学习之路（四）—— RDD常用算子详解
一.Transformation spark常用的Transformation算子如下表: Transformation算子 Meaning(含义) map(func) 对原RDD中每个元素运用 fu ...
Spark 系列（四）—— RDD常用算子详解
一.Transformation spark 常用的 Transformation 算子如下表: Transformation 算子 Meaning(含义) map(func) 对原 RDD 中每个元 ...
【spark】常用转换操作：sortByKey()和sortBy()
1.sortByKey() 功能: 返回一个根据键排序的RDD 示例 val list = List(("a",3),("b",2),("c" ...
Spark_Transformation和Action算子
Transformation 和 Action 常用算子一.Transformation 1.1 map 1.2 filter 1.3 flatMap ...
入门大数据---Spark_Transformation和Action算子
一.Transformation spark 常用的 Transformation 算子如下表: Transformation 算子 Meaning(含义) map(func) 对原 RDD 中每个元 ...

随机推荐

Mycat 分片规则详解--固定 hash 分片
实现方式:该算法类似于十进制的求模运算,但是为二进制的操作,例如,取 id 的二进制低 10 位与 1111111111 进行 & 运算优点:这种策略比较灵活,可以均匀分配也可以非均匀分配 ...
笔记：Jersey REST 传输格式-JSON
JSON 类型已经成为Ajax技术中数据传输的实际标准,Jersey 提供了多种处理JSON数据的包和解析方式,下表展示了JSON包和解析方式: 解析方式\JSON支持包 MOXy JSON-P Ja ...
深入理解Session与Cookie(一）
Session,Cookie简介: Session和Cookie的作用都是为了保持用户与后端服务器的交互状态,但是各自都有缺陷: Cookie: 随着Cookie的个数的增多和访问量的增加,它占用的网 ...
JDK中的Timer和TimerTask详解
http://www.cnblogs.com/lingiu/p/3782813.html
NVL2 这个函数，
NVL2(expr1,expr2,expr3) 如果参数表达式expr1值为NULL,则NVL2()函数返回参数表达式expr3的值:如果参数表达式expr1值不为NULL,则NVL2()函数 ...
Beta 第五天
今天遇到的困难: 前端大部分代码由我们放逐的组员完成,这影响到了我们解决"Fragment碎片刷新时总产生的固定位置"的进程,很难找到源码对应新加入的成员对界面代码不熟悉. 我们 ...
Bate版敏捷冲刺每日报告--day1
1 团队介绍团队组成: PM:齐爽爽(258) 小组成员:马帅(248),何健(267),蔡凯峰(285) Git链接:https://github.com/WHUSE2017/C-team 2 ...
JAVA中if多分支和switch的优劣性。
Switch多分支语句switch语句是多分支选择语句.常用来根据表达式的值选择要执行的语句.例如,在某程序中,要求将输入的或是获取的用0-6代表的星期,转换为用中文表示的星期.该需求通过伪代码描述的 ...
Struts2之配置文件中Action的详细配置(续)
承接上一篇 4.处理结果的配置 Action类的实例对象调用某个方法,处理完用户请求之后,将返回一个逻辑视图名的字符串.核心Filter收到返回的逻辑视图名字符串,根据struts.xml中的逻辑视图 ...
2017 国庆湖南 Day6
期望得分:100+100+60=260 实际得分:100+85+0=185 二分最后一条相交线段的位置 #include<cstdio> #include<iostream> ...

Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？

Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？的更多相关文章

随机推荐

热门专题