Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？

代码场景：

1）设定的几种数据场景，遍历所有场景：依次统计满足每种场景条件下的数据，并把统计结果存入hive；

2）已有代码如下：

    case class IndoorOTTCalibrateBuildingVecotrLegend(oid: Int, minHeight: Int, maxHeight: Int, minGridIDCount: Int, maxGridIDCount: Int, heightType: Int) extends Serializable

    //  实例化建筑物区间段：按照栅格的个数（面积）、楼的高度（商场等场景）来划分场景

    val buildingHeightLegends = List(

      IndoorOTTCalibrateBuildingVecotrLegend(1, 1, 30, 1, 21, BuildingCalibrateHeightType.HeightType1.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(2, 1, 30, 21, 45, BuildingCalibrateHeightType.HeightType2.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(3, 1, 30, 45, 100, BuildingCalibrateHeightType.HeightType3.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(4, 30, 50, 1, 21, BuildingCalibrateHeightType.HeightType4.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(5, 30, 50, 21, 45, BuildingCalibrateHeightType.HeightType5.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(6, 30, 50, 45, 100, BuildingCalibrateHeightType.HeightType6.toString.toInt),

      IndoorOTTCalibrateBuildingVecotrLegend(7, 50, 5000, 1, 100, BuildingCalibrateHeightType.HeightType7.toString.toInt)

    )

    spark.sparkContext.parallelize(buildingHeightLegends).collect().foreach(buildingHeightLegend => {

      generateSampleBySenceType(spark, p_city, p_hour_start, p_hour_end, p_fpb_day, p_day_sample, linkLossCalibrateParameter, buildingHeightLegend)

    })

备注：

在generateSampleBySenceType()函数内部包含有:

spark.sql(s"""
|xxx

|where t10.heihgt>=${buildingHieghtLegend.MinHeight} and t10.height<${buildingHieghtLegend.MaxHeight}

|and t10.gridcount<=${buildingHieghtLegend.MinGridIDCount} and  t10.gridcount>${buildingHieghtLegend.MaxGridIDCount}
|""".stripMargin)

如果把代码修改：

    val buildingHeightLegends_df = spark.sqlContext.createDataFrame(buildingHeightLegends)

    buildingHeightLegends_df.createOrReplaceTempView("temp_buildingheightlegends")

    sql(s"""|select * from temp_buildingheightlegends""".stripMargin).repartition(buildingHeightLegends.length).foreachPartition(rows => {

      for (row <- rows) {

        val buildingHeightLegend = new IndoorOTTCalibrateBuildingVecotrLegend(

          row.getAs[Int]("oid"),

          row.getAs[Int]("minheight"),

          row.getAs[Int]("maxheight"),

          row.getAs[Int]("mingrididcount"),

          row.getAs[Int]("maxgrididcount"),

          row.getAs[Int]("heighttype"))

        generateSampleBySenceType(spark, p_city, p_hour_start, p_hour_end, p_fpb_day, p_day_sample, linkLossCalibrateParameter, buildingHeightLegend)

      }

    })

则会提示：generateSampleBySenceType()内部sql代码位置抛出SparkSession为NULL的异常。

修改方案：

把buildingHeightLegends注册为临时表temp_buildingHeightLegends，去掉外层的foreach，之后在generateSampleBySenceType()内部把temp_buildingHeightLegends与其他结果集合进行cross join：

测试代码如下：

-- 场景表

CREATE TABLE [dbo].[test_senceitems](

    [sencetype] [int] NULL,

    [minheight] [int] NULL,

    [maxheight] [int] NULL,

    [mingridcount] [int] NULL,

    [maxgridcount] [int] NULL

)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (1, 1, 30, 1, 21)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (2, 1, 30, 21, 45)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (3, 1, 30, 45, 100)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (4, 30, 50, 1, 21)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (5, 30, 50, 21, 45)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (6, 30, 50, 45, 100)

INSERT [dbo].[test_senceitems] ([sencetype], [minheight], [maxheight], [mingridcount], [maxgridcount]) VALUES (7, 50, 5000, 1, 100)

-- 业务过滤统计表

CREATE TABLE [dbo].[test_grid](

    [gridid] [nvarchar](50) NULL,

    [height] [int] NULL,

    [gridcount] [int] NULL

) 

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g1', 8, 23)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g2', 3, 87)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g3', 4, 34)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g4', 30, 54)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g5', 32, 32)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g6', 32, 20)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g7', 120, 34)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g8', 89, 54)

INSERT [dbo].[test_grid] ([gridid], [height], [gridcount]) VALUES (N'g9', 9, 16)

替换generateSampleBySenceType()内部sql(s"""|""".stripMargin)代码类似如下：

select t10.*,t11.*

from test_grid t10

cross join test_senceitems t11

where t10.height>=t11.minheight and t10.height<t11.maxheight

and t10.gridcount>=t11.mingridcount and t10.gridcount<t11.maxgridcount

Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？的更多相关文章

单表千亿电信大数据场景，使用Spark+CarbonData替换Impala案例
[背景介绍] 国内某移动局点使用Impala组件处理电信业务详单,每天处理约100TB左右详单,详单表记录每天大于百亿级别,在使用impala过程中存在以下问题: 详单采用Parquet格式存储,数据 ...
arrayObj.splice(start, deleteCount, [item1[, item2[, . . . [,itemN]]]])
测试方法 function test(){ var arr = [0,1,2,3]; arr.splice(1,1,'a');//case console.dir(arr); } case1: arr ...
Spark(二)【sc.textfile的分区策略源码分析】
sparkcontext.textFile()返回的是HadoopRDD! 关于HadoopRDD的官方介绍,使用的是旧版的hadoop api ctrl+F12搜索 HadoopRDD的getPar ...
Spark算子--first、count、reduce、collect、lookup
转载请标明出处http://www.cnblogs.com/haozhengfei/p/4b8582c8dde1529abb11e4ccc8296171.html first.count.reduce ...
Spark学习之路（四）—— RDD常用算子详解
一.Transformation spark常用的Transformation算子如下表: Transformation算子 Meaning(含义) map(func) 对原RDD中每个元素运用 fu ...
Spark 系列（四）—— RDD常用算子详解
一.Transformation spark 常用的 Transformation 算子如下表: Transformation 算子 Meaning(含义) map(func) 对原 RDD 中每个元 ...
【spark】常用转换操作：sortByKey()和sortBy()
1.sortByKey() 功能: 返回一个根据键排序的RDD 示例 val list = List(("a",3),("b",2),("c" ...
Spark_Transformation和Action算子
Transformation 和 Action 常用算子一.Transformation 1.1 map 1.2 filter 1.3 flatMap ...
入门大数据---Spark_Transformation和Action算子
一.Transformation spark 常用的 Transformation 算子如下表: Transformation 算子 Meaning(含义) map(func) 对原 RDD 中每个元 ...

随机推荐

搭建nuxtjs程序 —— 用户信息 or token怎么不丢失
框架背景:开发框架采用vue,需要更好的SEO,更快的内容到达时间,从浏览器看不到对服务器的请求接口,选用开箱即用的nuxtjs. 问题背景:1. 前后分离,需前端存储token及登录后的用户信息: ...
python基础学习笔记二之列表
1.列表 ①列表的创建: ②列表的查询(索引): ③列表的切片操作: 此处要注意到:返回索引0到3的元素,顾头不顾尾. ④列表的增加: s.append() #直接在结尾追加 s.insert() ...
多线程——工具类之Semaphore
一.Semaphore功能介绍 Semaphore类相当于线程计数器,在获取Semaphore对象时设定可以产生的线程总数(线程并不是Semaphore类生成的,它只是统计线程的数量),创建Semap ...
实现Windows程序的更新
实现Windows程序的更新一.使用枚举避免不合理的赋值 1.使用枚举的好处: 使用常量类中Student类中加入一个特别属性,StudentGender,而且这个属性只能接受两个有效值," ...
初始css
1.CSS规则由两部分构成,即选择器和声明器声明必须放在{}中并且声明可以是一条或者多条每条声明由一个属性和值构成,属性和值用冒号分开,每条语句用英文冒号分开注意: css的最后一条声明,用以结 ...
input输入框限制输入正整数、小数、字母、文字
有的时候需要限制input的输入格式: 例如,输入大于0的正整数 <input onkeyup="if(this.value.length==1){this.value=this.va ...
Java基础学习笔记总结
Java基础学习笔记一 Java介绍 Java基础学习笔记二 Java基础语法之变量.数据类型 Java基础学习笔记三 Java基础语法之流程控制语句.循环 Java基础学习笔记四 Java基础语法之 ...
JavaScript(第二十九天)【js处理XML】
随着互联网的发展,Web应用程序的丰富,开发人员越来越希望能够使用客户端来操作XML技术.而XML技术一度成为存储和传输结构化数据的标准.所以,本章就详细探讨一下JavaScript中使用XML的技术 ...
C语言助教批改
作业批改每次作业批改后写一篇作业点评,助教轮流写作业总结.(总结分工老师安排). 每个助教点评自己负责的同学博客,点评要详细,不能只有一句话. 有比较优秀博客请或典型问题推荐到qq群,并发给写总结助 ...
APP案例分析
产品蓝叠安卓模拟器选择理由看了一眼桌面,就这个比较有意思.现在很多人喜欢玩手游,经常喜欢开个小号搞事情.这时候身边又没有多余的手机,怎么办?安卓模拟器下一个.手机屏幕太小玩起来没意思怎么 ...

Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？

Spark：如何替换sc.parallelize(List(item1,item2)).collect().foreach(row=>{})为并行？的更多相关文章

随机推荐

热门专题