这两天和同事一起在想着如何把一个表的记录减少,表记录包含了:objectid(主小区信息),gridid(归属栅格),height(高度),rsrp(主小区rsrp),n_objectid(邻区),n_rsrp(邻小区rsrp)

记录中一个主小区对应有多个邻区信息,在分组合并记录时:

1)先按照objectid,gridid,height进行分组,把所有邻区信息给存储到集合中;

2)基于1)的结果之上,按照objectid分组,把gridid,height,rsrp,array(n_objectid),array(n_rsrp)作为集合存储。

实现思路一:采用array<array<string>>单维元祖存储

[my@sdd983 tommyduan_service]$ /app/my/fi_client/spark2/Spark2x/spark/bin/spark-shell
-- ::, | WARN | main | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:)
-- ::, | WARN | main | In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
-- ::, | WARN | main | Detected deprecated memory fraction settings: [spark.shuffle.memoryFraction, spark.storage.memoryFraction, spark.storage.unrollFraction]. As of Spark 1.6, execution and storage memory management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable `spark.memory.useLegacyMode` (not recommended). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
Spark context Web UI available at http://192.168.143.332:23799
Spark context available as 'sc' (master = local[*], app id = local-).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.
/_/ Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_72)
Type in expressions to have them evaluated.
Type :help for more information. scala> import spark.sql
import spark.sql scala> import spark.implicits._
import spark.implicits._ scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields] scala> var fpb_groupby_obj_grid_height_df1 = fpb_df.groupBy("objectid", "gridid", "height", "rsrp").agg(
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("n_objectid1").alias("n_objectid"), col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_grid_height_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: string ... more fields] scala> var fpb_groupby_obj_df1 = fpb_groupby_obj_grid_height_df1.groupBy("objectid").agg(
| collect_list("gridid").alias("gridid1"),
| collect_list("height").alias("height1"),
| collect_list("rsrp").alias("rsrp1"),
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("n_objectid1").alias("n_objectid"),col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: array<string> ... more fields] scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).show
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _1| _2| _3| _4| _5| _6|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray(...|[WrappedArray(0.0...|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).schema
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_1,StringType,true),
StructField(_2,ArrayType(StringType,true),true),
StructField(_3,ArrayType(StringType,true),true),
StructField(_4,ArrayType(StringType,true),true),
StructField(_5,ArrayType(ArrayType(StringType,true),true),true),
StructField(_6,ArrayType(ArrayType(DoubleType,false),true),true)
)

方案二:存储格式为:array<array<(string,double)>>,读取失败。

scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields]
scala> var fpb_groupby_obj_grid_height_df2 = fpb_df.map(s =>
(s.getAs[String]("objectid"), s.getAs[String]("gridid"), s.getAs[String]("height"), s.getAs[String]("rsrp"), (s.getAs[String]("n_objectid"), s.getAs[Double]("rsrp_dis")))
).toDF("objectid", "gridid", "height", "rsrp", "neighbour").groupBy("objectid", "gridid", "height", "rsrp").agg(
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("neighbour1").alias("neighbour")) scala> var fpb_groupby_obj_df2 = fpb_groupby_obj_grid_height_df2.groupBy("objectid").agg(
collect_list("gridid").alias("gridid1"),
collect_list("height").alias("height1"),
collect_list("rsrp").alias("rsrp1"),
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("neighbour1").alias("neighbour"))
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary] scala> fpb_groupby_obj_df2.show
+---------+--------------------+--------------------+--------------------+--------------------+
| objectid| gridid| height| rsrp| neighbour|
+---------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray([...|
+---------+--------------------+--------------------+--------------------+--------------------+ scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[AC ED ...|[AC ED ...|[AC ED ...|[AC ED ...|
+---------+--------------------+--------------------+--------------------+--------------------+ scala> fpb_groupby_obj_df2.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]())).show()
[Stage :======================================================>( + ) / ]-- ::, | ERROR | Executor task launch worker for task | Exception in task 0.0 in stage 7.0 (TID ) | org.apache.spark.internal.Logging$class.logError(Logging.scala:)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$$anon$.hasNext(WholeStageCodegenExec.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:)
at org.apache.spark.scheduler.Task.run(Task.scala:)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:)
at java.lang.Thread.run(Thread.java:) scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary] scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[ ...|[ ...|[ ...|[ ...|
+---------+--------------------+--------------------+--------------------+--------------------+

spark2.1:读取hive中存储的多元组(string,double)失败的更多相关文章

  1. SparkSQL读取Hive中的数据

    由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...

  2. 通过spark-sql快速读取hive中的数据

    1 配置并启动 1.1 创建并配置hive-site.xml 在运行Spark SQL CLI中需要使用到Hive Metastore,故需要在Spark中添加其uris.具体方法是将HIVE_CON ...

  3. [转] C#实现在Sql Server中存储和读取Word文件 (Not Correct Modified)

    出处 C#实现在Sql Server中存储和读取Word文件 要实现在Sql Server中实现将文件读写Word文件,需要在要存取的表中添加Image类型的列,示例表结构为: CREATE TABL ...

  4. excel中存储的icount,赋值完之后

    最近需要实现一个功能,为了确保每次函数运行的时候count是唯一的,所以想读取excel中存储的icount,赋值完之后对其进行+1操作,并存入excel文件,确保下次读取的count是新的,没有出现 ...

  5. android读取apk中已经存在的数据库信息

    在android数据库编程方面,大家有没有遇到过,我要从指定位置的已经存在的数据库来进行操作的问题.之前我尝试了很多方法都没有成功,后来找到了解决的方法.   下面说明下这段代码的意思,第一步先判断在 ...

  6. 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中

    说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...

  7. 关于mysql中存储json数据的读取问题

    在mysql中存储json数据,字段类型用text,java实体中用String接受. 返回前端时(我这里返回前端的是一个map),为了保证读取出的数据排序错乱问题,定义Map时要用LinkedHas ...

  8. 使用Hive读取ElasticSearch中的数据

    本文将介绍如何通过Hive来读取ElasticSearch中的数据,然后我们可以像操作其他正常Hive表一样,使用Hive来直接操作ElasticSearch中的数据,将极大的方便开发人员.本文使用的 ...

  9. Hive中的HiveServer2、Beeline及数据的压缩和存储

    1.使用HiveServer2及Beeline HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接. 启动namenode.datanode.resource ...

随机推荐

  1. windows下安装mongoDB以及配置启动

    1.下载MongoDB的windows版本,有32位和64位版本,根据系统情况下载,下载地址:http://www.mongodb.org/downloads 2.解压缩至D:/mongodb即可 3 ...

  2. 1-1 maven 学习笔记(1-6章)

    一.基础概念 1.Maven作为Apache组织中颇为成功的开源项目,主要服务于基于Java平台的项目构建,依赖管理和项目信息管理.从清理,编译,测试到生成报告,到打包部署,自动化构建过程. 还可以跨 ...

  3. 聊一聊JS的原型链之高级篇

    首先呢JS的继承实现是借助原型链,原型链即__proto__形成的链条. 下面一个例子初步认识下原型链: function Animal (){ } var cat = new Animal() 我们 ...

  4. 在react中引入下拉刷新和上拉加载

    1. 首先引入插件 import ReactPullLoad, {STATS} from 'react-pullload' 2. 初始化: constructor(props) { super(pro ...

  5. 数据库 --> MySQL使用

    MySQL使用 代码: #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h>#includ ...

  6. [poj2367]Genealogical tree_拓扑排序

    Genealogical tree poj-2367 题目大意:给你一个n个点关系网,求任意一个满足这个关系网的序列,使得前者是后者的上级. 注释:1<=n<=100. 想法:刚刚学习to ...

  7. (译文)学习ES6非常棒的特性——Async / Await函数

    try/catch 在使用Async/Await前,我们可能这样写: const main = (paramsA, paramsB, paramsC, done) => { funcA(para ...

  8. Alpha阶段小结

    1 团队的源码仓库地址 https://github.com/WHUSE2017/MyGod 2 Alpha过程回顾 2.1 团队项目预期 有一个可视化的安卓APP,实现二手交易基本功能.预期的典型用 ...

  9. 视频聊天插件:AnyChat使用攻略之iOS开发指南

    AnyChat使用攻略之iOS开发指南 这套攻略主要指导刚开始使用AnyChat SDK For iOS的同学,快速搭建SDK环境,和实现音视频开发流程. (需要工程案例文件可联系我们) 在iOS平台 ...

  10. Flask 扩展 缓存

    如果同一个请求会被多次调用,每次调用都会消耗很多资源,并且每次返回的内容都相同,就该使用缓存了 自定义缓存装饰器 在使用Flask-Cache扩展实现缓存功能之前,我们先来自己写个视图缓存装饰器,方便 ...