这两天和同事一起在想着如何把一个表的记录减少,表记录包含了:objectid(主小区信息),gridid(归属栅格),height(高度),rsrp(主小区rsrp),n_objectid(邻区),n_rsrp(邻小区rsrp)

记录中一个主小区对应有多个邻区信息,在分组合并记录时:

1)先按照objectid,gridid,height进行分组,把所有邻区信息给存储到集合中;

2)基于1)的结果之上,按照objectid分组,把gridid,height,rsrp,array(n_objectid),array(n_rsrp)作为集合存储。

实现思路一:采用array<array<string>>单维元祖存储

[my@sdd983 tommyduan_service]$ /app/my/fi_client/spark2/Spark2x/spark/bin/spark-shell
-- ::, | WARN | main | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:)
-- ::, | WARN | main | In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
-- ::, | WARN | main | Detected deprecated memory fraction settings: [spark.shuffle.memoryFraction, spark.storage.memoryFraction, spark.storage.unrollFraction]. As of Spark 1.6, execution and storage memory management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable `spark.memory.useLegacyMode` (not recommended). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
Spark context Web UI available at http://192.168.143.332:23799
Spark context available as 'sc' (master = local[*], app id = local-).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.
/_/ Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_72)
Type in expressions to have them evaluated.
Type :help for more information. scala> import spark.sql
import spark.sql scala> import spark.implicits._
import spark.implicits._ scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields] scala> var fpb_groupby_obj_grid_height_df1 = fpb_df.groupBy("objectid", "gridid", "height", "rsrp").agg(
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("n_objectid1").alias("n_objectid"), col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_grid_height_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: string ... more fields] scala> var fpb_groupby_obj_df1 = fpb_groupby_obj_grid_height_df1.groupBy("objectid").agg(
| collect_list("gridid").alias("gridid1"),
| collect_list("height").alias("height1"),
| collect_list("rsrp").alias("rsrp1"),
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("n_objectid1").alias("n_objectid"),col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: array<string> ... more fields] scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).show
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _1| _2| _3| _4| _5| _6|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray(...|[WrappedArray(0.0...|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).schema
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_1,StringType,true),
StructField(_2,ArrayType(StringType,true),true),
StructField(_3,ArrayType(StringType,true),true),
StructField(_4,ArrayType(StringType,true),true),
StructField(_5,ArrayType(ArrayType(StringType,true),true),true),
StructField(_6,ArrayType(ArrayType(DoubleType,false),true),true)
)

方案二:存储格式为:array<array<(string,double)>>,读取失败。

scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields]
scala> var fpb_groupby_obj_grid_height_df2 = fpb_df.map(s =>
(s.getAs[String]("objectid"), s.getAs[String]("gridid"), s.getAs[String]("height"), s.getAs[String]("rsrp"), (s.getAs[String]("n_objectid"), s.getAs[Double]("rsrp_dis")))
).toDF("objectid", "gridid", "height", "rsrp", "neighbour").groupBy("objectid", "gridid", "height", "rsrp").agg(
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("neighbour1").alias("neighbour")) scala> var fpb_groupby_obj_df2 = fpb_groupby_obj_grid_height_df2.groupBy("objectid").agg(
collect_list("gridid").alias("gridid1"),
collect_list("height").alias("height1"),
collect_list("rsrp").alias("rsrp1"),
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("neighbour1").alias("neighbour"))
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary] scala> fpb_groupby_obj_df2.show
+---------+--------------------+--------------------+--------------------+--------------------+
| objectid| gridid| height| rsrp| neighbour|
+---------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray([...|
+---------+--------------------+--------------------+--------------------+--------------------+ scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[AC ED ...|[AC ED ...|[AC ED ...|[AC ED ...|
+---------+--------------------+--------------------+--------------------+--------------------+ scala> fpb_groupby_obj_df2.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]())).show()
[Stage :======================================================>( + ) / ]-- ::, | ERROR | Executor task launch worker for task | Exception in task 0.0 in stage 7.0 (TID ) | org.apache.spark.internal.Logging$class.logError(Logging.scala:)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$$anon$.hasNext(WholeStageCodegenExec.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:)
at org.apache.spark.scheduler.Task.run(Task.scala:)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:)
at java.lang.Thread.run(Thread.java:) scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary] scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[ ...|[ ...|[ ...|[ ...|
+---------+--------------------+--------------------+--------------------+--------------------+

spark2.1:读取hive中存储的多元组(string,double)失败的更多相关文章

  1. SparkSQL读取Hive中的数据

    由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...

  2. 通过spark-sql快速读取hive中的数据

    1 配置并启动 1.1 创建并配置hive-site.xml 在运行Spark SQL CLI中需要使用到Hive Metastore,故需要在Spark中添加其uris.具体方法是将HIVE_CON ...

  3. [转] C#实现在Sql Server中存储和读取Word文件 (Not Correct Modified)

    出处 C#实现在Sql Server中存储和读取Word文件 要实现在Sql Server中实现将文件读写Word文件,需要在要存取的表中添加Image类型的列,示例表结构为: CREATE TABL ...

  4. excel中存储的icount,赋值完之后

    最近需要实现一个功能,为了确保每次函数运行的时候count是唯一的,所以想读取excel中存储的icount,赋值完之后对其进行+1操作,并存入excel文件,确保下次读取的count是新的,没有出现 ...

  5. android读取apk中已经存在的数据库信息

    在android数据库编程方面,大家有没有遇到过,我要从指定位置的已经存在的数据库来进行操作的问题.之前我尝试了很多方法都没有成功,后来找到了解决的方法.   下面说明下这段代码的意思,第一步先判断在 ...

  6. 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中

    说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...

  7. 关于mysql中存储json数据的读取问题

    在mysql中存储json数据,字段类型用text,java实体中用String接受. 返回前端时(我这里返回前端的是一个map),为了保证读取出的数据排序错乱问题,定义Map时要用LinkedHas ...

  8. 使用Hive读取ElasticSearch中的数据

    本文将介绍如何通过Hive来读取ElasticSearch中的数据,然后我们可以像操作其他正常Hive表一样,使用Hive来直接操作ElasticSearch中的数据,将极大的方便开发人员.本文使用的 ...

  9. Hive中的HiveServer2、Beeline及数据的压缩和存储

    1.使用HiveServer2及Beeline HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接. 启动namenode.datanode.resource ...

随机推荐

  1. 如何写对kubernetes的模板文件

    kubernetes的模板配置文件随着版本更迭也会有相应的调整,正确配置模板关键字的方式是参考版本发布的doc,如下图 在docs\api-reference下面有不同功能的API目录,如下图 各个A ...

  2. Linux系统命令归纳

    常规操作命令: # netstat -atunpl |egrep "mysql|nginx"# vimdiff php.ini*# runlevel# rpm -e httpd - ...

  3. CSS Grid 网格布局全解析

    介绍 CSS Grid(网格) 布局使我们能够比以往任何时候都可以更灵活构建和控制自定义网格. Grid(网格) 布局使我们能够将网页分成具有简单属性的行和列.它还能使我们在不改变任何HTML的情况下 ...

  4. ReactNative环境配置的坑

    我用的是windows开发android,mac的可以绕道了. 1.android studio及Android SDK的安装 现在需要的Android版本及对应的tool 2.真机运行要配置对and ...

  5. Java NIO之套接字通道

    1.简介 前面一篇文章讲了文件通道,本文继续来说说另一种类型的通道 -- 套接字通道.在展开说明之前,咱们先来聊聊套接字的由来.套接字即 socket,最早由伯克利大学的研究人员开发,所以经常被称为B ...

  6. svn 发布脚本整合

    svn提交时出现(413 Request Entity Too Large)错误解决方法 在nginx的server配置中增加 client_max_body_size 100M; linux多实例a ...

  7. drbd(二):配置和使用

    本文目录:1.drbd配置文件2.创建metadata区并计算metadata区的大小3.启动drbd4.实现drbd主从同步5.数据同步和主从角色切换6.drbd脑裂后的解决办法7.drbd多卷组配 ...

  8. CSS服务器字体

    1,首先要下载ttf文件 推荐下载网站:  https://www.dafont.com/ 2,写css样式 3,服务器字体 font-family:自己随便取个名字就行 注意url里的ttf文件和f ...

  9. java web 初学

    我希望在本学期本堂课上学会使用java web 框架 精通mvc架构模式 学会通过框架和数据库对产品进行构造与编写. 我计划每周用16小时的时间进行学习java web 一周4学时上课时间 周一到周五 ...

  10. 201621123060《JAVA程序设计》第二周学习总结

    1.本周学习总结 本周学习了JAVA中的引用类.包装类(学习了一种语法:自动装箱)和数组(遍历数组的新方法foreach循环). 2. 书面作业 1.String-使用Eclipse关联jdk源代码 ...