这两天和同事一起在想着如何把一个表的记录减少,表记录包含了:objectid(主小区信息),gridid(归属栅格),height(高度),rsrp(主小区rsrp),n_objectid(邻区),n_rsrp(邻小区rsrp)

记录中一个主小区对应有多个邻区信息,在分组合并记录时:

1)先按照objectid,gridid,height进行分组,把所有邻区信息给存储到集合中;

2)基于1)的结果之上,按照objectid分组,把gridid,height,rsrp,array(n_objectid),array(n_rsrp)作为集合存储。

实现思路一:采用array<array<string>>单维元祖存储

[my@sdd983 tommyduan_service]$ /app/my/fi_client/spark2/Spark2x/spark/bin/spark-shell
-- ::, | WARN | main | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:)
-- ::, | WARN | main | In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
-- ::, | WARN | main | Detected deprecated memory fraction settings: [spark.shuffle.memoryFraction, spark.storage.memoryFraction, spark.storage.unrollFraction]. As of Spark 1.6, execution and storage memory management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable `spark.memory.useLegacyMode` (not recommended). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
Spark context Web UI available at http://192.168.143.332:23799
Spark context available as 'sc' (master = local[*], app id = local-).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.
/_/ Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_72)
Type in expressions to have them evaluated.
Type :help for more information. scala> import spark.sql
import spark.sql scala> import spark.implicits._
import spark.implicits._ scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields] scala> var fpb_groupby_obj_grid_height_df1 = fpb_df.groupBy("objectid", "gridid", "height", "rsrp").agg(
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("n_objectid1").alias("n_objectid"), col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_grid_height_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: string ... more fields] scala> var fpb_groupby_obj_df1 = fpb_groupby_obj_grid_height_df1.groupBy("objectid").agg(
| collect_list("gridid").alias("gridid1"),
| collect_list("height").alias("height1"),
| collect_list("rsrp").alias("rsrp1"),
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("n_objectid1").alias("n_objectid"),col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: array<string> ... more fields] scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).show
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _1| _2| _3| _4| _5| _6|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray(...|[WrappedArray(0.0...|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).schema
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_1,StringType,true),
StructField(_2,ArrayType(StringType,true),true),
StructField(_3,ArrayType(StringType,true),true),
StructField(_4,ArrayType(StringType,true),true),
StructField(_5,ArrayType(ArrayType(StringType,true),true),true),
StructField(_6,ArrayType(ArrayType(DoubleType,false),true),true)
)

方案二:存储格式为:array<array<(string,double)>>,读取失败。

scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields]
scala> var fpb_groupby_obj_grid_height_df2 = fpb_df.map(s =>
(s.getAs[String]("objectid"), s.getAs[String]("gridid"), s.getAs[String]("height"), s.getAs[String]("rsrp"), (s.getAs[String]("n_objectid"), s.getAs[Double]("rsrp_dis")))
).toDF("objectid", "gridid", "height", "rsrp", "neighbour").groupBy("objectid", "gridid", "height", "rsrp").agg(
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("neighbour1").alias("neighbour")) scala> var fpb_groupby_obj_df2 = fpb_groupby_obj_grid_height_df2.groupBy("objectid").agg(
collect_list("gridid").alias("gridid1"),
collect_list("height").alias("height1"),
collect_list("rsrp").alias("rsrp1"),
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("neighbour1").alias("neighbour"))
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary] scala> fpb_groupby_obj_df2.show
+---------+--------------------+--------------------+--------------------+--------------------+
| objectid| gridid| height| rsrp| neighbour|
+---------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray([...|
+---------+--------------------+--------------------+--------------------+--------------------+ scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[AC ED ...|[AC ED ...|[AC ED ...|[AC ED ...|
+---------+--------------------+--------------------+--------------------+--------------------+ scala> fpb_groupby_obj_df2.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]())).show()
[Stage :======================================================>( + ) / ]-- ::, | ERROR | Executor task launch worker for task | Exception in task 0.0 in stage 7.0 (TID ) | org.apache.spark.internal.Logging$class.logError(Logging.scala:)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$$anon$.hasNext(WholeStageCodegenExec.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:)
at org.apache.spark.scheduler.Task.run(Task.scala:)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:)
at java.lang.Thread.run(Thread.java:) scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary] scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[ ...|[ ...|[ ...|[ ...|
+---------+--------------------+--------------------+--------------------+--------------------+

spark2.1:读取hive中存储的多元组(string,double)失败的更多相关文章

  1. SparkSQL读取Hive中的数据

    由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...

  2. 通过spark-sql快速读取hive中的数据

    1 配置并启动 1.1 创建并配置hive-site.xml 在运行Spark SQL CLI中需要使用到Hive Metastore,故需要在Spark中添加其uris.具体方法是将HIVE_CON ...

  3. [转] C#实现在Sql Server中存储和读取Word文件 (Not Correct Modified)

    出处 C#实现在Sql Server中存储和读取Word文件 要实现在Sql Server中实现将文件读写Word文件,需要在要存取的表中添加Image类型的列,示例表结构为: CREATE TABL ...

  4. excel中存储的icount,赋值完之后

    最近需要实现一个功能,为了确保每次函数运行的时候count是唯一的,所以想读取excel中存储的icount,赋值完之后对其进行+1操作,并存入excel文件,确保下次读取的count是新的,没有出现 ...

  5. android读取apk中已经存在的数据库信息

    在android数据库编程方面,大家有没有遇到过,我要从指定位置的已经存在的数据库来进行操作的问题.之前我尝试了很多方法都没有成功,后来找到了解决的方法.   下面说明下这段代码的意思,第一步先判断在 ...

  6. 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中

    说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...

  7. 关于mysql中存储json数据的读取问题

    在mysql中存储json数据,字段类型用text,java实体中用String接受. 返回前端时(我这里返回前端的是一个map),为了保证读取出的数据排序错乱问题,定义Map时要用LinkedHas ...

  8. 使用Hive读取ElasticSearch中的数据

    本文将介绍如何通过Hive来读取ElasticSearch中的数据,然后我们可以像操作其他正常Hive表一样,使用Hive来直接操作ElasticSearch中的数据,将极大的方便开发人员.本文使用的 ...

  9. Hive中的HiveServer2、Beeline及数据的压缩和存储

    1.使用HiveServer2及Beeline HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接. 启动namenode.datanode.resource ...

随机推荐

  1. 使用git提交到github,每次都要输入用户名和密码的解决方法

    使用git提交文件到github,每次都要输入用户名和密码,操作起来很麻烦,以下方法可解决,记录以下. 原因:在clone 项目的时候,使用了 https方式,而不是ssh方式. 默认clone 方式 ...

  2. Microsoft AI - Custom Vision

    概述 前几天的 Windows Developer Day 正式发布了 Windows AI Platform,而作为 Windows AI Platform 的模型定义和训练,更多还是需要借助云端来 ...

  3. WinSock 异步I/O模型-3

    重叠I/O(Overlapped I/O) 在 Winsock 中,重叠 I/O(Overlapped I/O)模型能达到更佳的系统性能,高于之前讲过的三种.重叠模型的基本设计原理便是让应用程序使用一 ...

  4. MySQL数据库学习二 MSQL安装和配置

    2.1 下载和安装MySQL软件 2.1.1 基于客户端/服务器(C/S)的数据库管理系统 服务器:MySQL数据库管理系统 客户端:操作MySQL服务器 2.1.2 MySQL的各种版本 社区版(C ...

  5. hadoop2.6.5运行wordcount实例

    运行wordcount实例 在/tmp目录下生成两个文本文件,上面随便写两个单词. cd /tmp/ mkdir file cd file/ echo "Hello world" ...

  6. 0x02 译文:Windows桌面应用Win32第一个程序

    本节课我们将用C++ 写一个最简单的Windows 程序. 目录: 创建一个窗口 窗口消息 编写窗口过程 绘制窗口 关闭窗口 管理应用程序状态 代码如下: #ifndef UNICODE #defin ...

  7. python web开发-flask中sqlalchemy的使用

    SqlAlchemy是一个python的ORM框架. 在flask中有一个flask-sqlalchemy的扩展,使用起来很方便. 1.       创建一个sqlalchemy的Model模块 创建 ...

  8. 中文分词 sphni与scws

    1.安装sphnixcd /usr/local/srcwget http://sphinxsearch.com/files/sphinx-2.2.11-release.tar.gztar -zxvf ...

  9. Java基础笔记(1)----语言基础

    变量 变量:是内存中的一块存储空间,是存储数据的基本单元. 使用:先声明,后赋值,在使用. 声明:数据类型 + 变量名 = 值.(例:int a = 5:) 数据类型 分类:如图: 详解: Strin ...

  10. Jquery判断checkbox是否被选中

    jQuery中: $("input[type='checkbox']").is(':checked') 返回true或false 1.attr()方法  设置或者返回备选元素的值 ...