spark2.1:读取hive中存储的多元组(string,double)失败
这两天和同事一起在想着如何把一个表的记录减少,表记录包含了:objectid(主小区信息),gridid(归属栅格),height(高度),rsrp(主小区rsrp),n_objectid(邻区),n_rsrp(邻小区rsrp)
记录中一个主小区对应有多个邻区信息,在分组合并记录时:
1)先按照objectid,gridid,height进行分组,把所有邻区信息给存储到集合中;
2)基于1)的结果之上,按照objectid分组,把gridid,height,rsrp,array(n_objectid),array(n_rsrp)作为集合存储。
实现思路一:采用array<array<string>>单维元祖存储
[my@sdd983 tommyduan_service]$ /app/my/fi_client/spark2/Spark2x/spark/bin/spark-shell
-- ::, | WARN | main | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:)
-- ::, | WARN | main | In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
-- ::, | WARN | main | Detected deprecated memory fraction settings: [spark.shuffle.memoryFraction, spark.storage.memoryFraction, spark.storage.unrollFraction]. As of Spark 1.6, execution and storage memory management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable `spark.memory.useLegacyMode` (not recommended). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
Spark context Web UI available at http://192.168.143.332:23799
Spark context available as 'sc' (master = local[*], app id = local-).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.
/_/ Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_72)
Type in expressions to have them evaluated.
Type :help for more information. scala> import spark.sql
import spark.sql scala> import spark.implicits._
import spark.implicits._ scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields] scala> var fpb_groupby_obj_grid_height_df1 = fpb_df.groupBy("objectid", "gridid", "height", "rsrp").agg(
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("n_objectid1").alias("n_objectid"), col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_grid_height_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: string ... more fields] scala> var fpb_groupby_obj_df1 = fpb_groupby_obj_grid_height_df1.groupBy("objectid").agg(
| collect_list("gridid").alias("gridid1"),
| collect_list("height").alias("height1"),
| collect_list("rsrp").alias("rsrp1"),
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("n_objectid1").alias("n_objectid"),col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: array<string> ... more fields] scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).show
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _1| _2| _3| _4| _5| _6|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray(...|[WrappedArray(0.0...|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).schema
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_1,StringType,true),
StructField(_2,ArrayType(StringType,true),true),
StructField(_3,ArrayType(StringType,true),true),
StructField(_4,ArrayType(StringType,true),true),
StructField(_5,ArrayType(ArrayType(StringType,true),true),true),
StructField(_6,ArrayType(ArrayType(DoubleType,false),true),true)
)
方案二:存储格式为:array<array<(string,double)>>,读取失败。
scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields]
scala> var fpb_groupby_obj_grid_height_df2 = fpb_df.map(s =>
(s.getAs[String]("objectid"), s.getAs[String]("gridid"), s.getAs[String]("height"), s.getAs[String]("rsrp"), (s.getAs[String]("n_objectid"), s.getAs[Double]("rsrp_dis")))
).toDF("objectid", "gridid", "height", "rsrp", "neighbour").groupBy("objectid", "gridid", "height", "rsrp").agg(
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("neighbour1").alias("neighbour"))
scala> var fpb_groupby_obj_df2 = fpb_groupby_obj_grid_height_df2.groupBy("objectid").agg(
collect_list("gridid").alias("gridid1"),
collect_list("height").alias("height1"),
collect_list("rsrp").alias("rsrp1"),
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("neighbour1").alias("neighbour"))
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary]
scala> fpb_groupby_obj_df2.show
+---------+--------------------+--------------------+--------------------+--------------------+
| objectid| gridid| height| rsrp| neighbour|
+---------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray([...|
+---------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[AC ED ...|[AC ED ...|[AC ED ...|[AC ED ...|
+---------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df2.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]())).show()
[Stage :======================================================>( + ) / ]-- ::, | ERROR | Executor task launch worker for task | Exception in task 0.0 in stage 7.0 (TID ) | org.apache.spark.internal.Logging$class.logError(Logging.scala:)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$$anon$.hasNext(WholeStageCodegenExec.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:)
at org.apache.spark.scheduler.Task.run(Task.scala:)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:)
at java.lang.Thread.run(Thread.java:)
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary]
scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[ ...|[ ...|[ ...|[ ...|
+---------+--------------------+--------------------+--------------------+--------------------+
spark2.1:读取hive中存储的多元组(string,double)失败的更多相关文章
- SparkSQL读取Hive中的数据
由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...
- 通过spark-sql快速读取hive中的数据
1 配置并启动 1.1 创建并配置hive-site.xml 在运行Spark SQL CLI中需要使用到Hive Metastore,故需要在Spark中添加其uris.具体方法是将HIVE_CON ...
- [转] C#实现在Sql Server中存储和读取Word文件 (Not Correct Modified)
出处 C#实现在Sql Server中存储和读取Word文件 要实现在Sql Server中实现将文件读写Word文件,需要在要存取的表中添加Image类型的列,示例表结构为: CREATE TABL ...
- excel中存储的icount,赋值完之后
最近需要实现一个功能,为了确保每次函数运行的时候count是唯一的,所以想读取excel中存储的icount,赋值完之后对其进行+1操作,并存入excel文件,确保下次读取的count是新的,没有出现 ...
- android读取apk中已经存在的数据库信息
在android数据库编程方面,大家有没有遇到过,我要从指定位置的已经存在的数据库来进行操作的问题.之前我尝试了很多方法都没有成功,后来找到了解决的方法. 下面说明下这段代码的意思,第一步先判断在 ...
- 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中
说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...
- 关于mysql中存储json数据的读取问题
在mysql中存储json数据,字段类型用text,java实体中用String接受. 返回前端时(我这里返回前端的是一个map),为了保证读取出的数据排序错乱问题,定义Map时要用LinkedHas ...
- 使用Hive读取ElasticSearch中的数据
本文将介绍如何通过Hive来读取ElasticSearch中的数据,然后我们可以像操作其他正常Hive表一样,使用Hive来直接操作ElasticSearch中的数据,将极大的方便开发人员.本文使用的 ...
- Hive中的HiveServer2、Beeline及数据的压缩和存储
1.使用HiveServer2及Beeline HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接. 启动namenode.datanode.resource ...
随机推荐
- Spring配置文件中如何使用外部配置文件配置数据库连接
直接在spring的配置文件中applicationContext.xml文件中配置数据库连接也可以,但是有个问题,需要在url后带着使用编码集和指定编码集,出现了如下问题,&这个符号报错-- ...
- JavaIO
1.字节流和字符流 在IO有两种数据传输格式一个是字符流还一个是字节流,但是字符流就会涉及到编码的问题. 一开始美国使用的自己的编码表就是ASCII表 中国的字符需要被识别也需要编码表于是就有了GB2 ...
- Webpack结合ES6
一.概述ES6现在正是风华正茂的时候,各个公司都是 尝试去使用,并且作为前端工程师ES6也是体现技术的亮点.但是,现在的浏览器对es6支持不是 特别的兼容,最终还是需要把es6转换为es5,webpa ...
- JS获得一个对象的所有属性和方法
function displayProp(obj){ var names=""; for(var name in obj){ names+=name+": "+ ...
- 福州大学软件1715|W班-助教卞倩虹个人简介
各位好,我是卞倩虹 本科阶段的专业是网络工程,通过学校的学习我掌握了基础的网络组网配置技术,常常在机房配置路由器和交换机等相关设备.后来我接触了软件编程,在深入了解和学习后编程语言后,自主开发了一些项 ...
- C语言指针作业总结
学号 姓名 作业地址 PTA实验作业5 PTA排名2 阅读代码2 总结1 代码规范 总分 是否推荐博客 推荐理由 32 **薇 http://www.cnblogs.com/linyiwei/p/80 ...
- 团队作业7——第二次项目冲刺(Beta版本)
Deadline: 2017-12-10 23:00PM,以博客发表日期为准. 评分基准: 按时交 - 有分,检查的项目包括后文的三个方面 冲刺计划安排(单独1篇博客) 七天的敏捷冲刺(每两天发布 ...
- 20162328蔡文琛week05
学号 20162328 <程序设计与数据结构>第X周学习总结 教材学习内容总结 面向对象程序设计的核心是类的定义,它代表定义了状态和行为的对象. 变量的作用域依赖于变量声明的位置,作用域决 ...
- aws中的路由表
参考官方文档: 由表中包含一系列被称为路由的规则,可用于判断网络流量的导向目的地. 在您的 VPC 中的每个子网必须与一个路由表关联:路由表控制子网的路由.一个子网一次只能与一个路由表关联,但您可以将 ...
- Flask 应用最佳实践
一个好的应用目录结构可以方便代码的管理和维护,一个好的应用管理维护方式也可以强化程序的可扩展性 应用目录结构 假定我们的应用主目录是"flask-demo",首先我们建议每个应用都 ...