spark2.1:读取hive中存储的多元组(string,double)失败
这两天和同事一起在想着如何把一个表的记录减少,表记录包含了:objectid(主小区信息),gridid(归属栅格),height(高度),rsrp(主小区rsrp),n_objectid(邻区),n_rsrp(邻小区rsrp)
记录中一个主小区对应有多个邻区信息,在分组合并记录时:
1)先按照objectid,gridid,height进行分组,把所有邻区信息给存储到集合中;
2)基于1)的结果之上,按照objectid分组,把gridid,height,rsrp,array(n_objectid),array(n_rsrp)作为集合存储。
实现思路一:采用array<array<string>>单维元祖存储
[my@sdd983 tommyduan_service]$ /app/my/fi_client/spark2/Spark2x/spark/bin/spark-shell
-- ::, | WARN | main | Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:)
-- ::, | WARN | main | In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
-- ::, | WARN | main | Detected deprecated memory fraction settings: [spark.shuffle.memoryFraction, spark.storage.memoryFraction, spark.storage.unrollFraction]. As of Spark 1.6, execution and storage memory management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable `spark.memory.useLegacyMode` (not recommended). | org.apache.spark.internal.Logging$class.logWarning(Logging.scala:)
Spark context Web UI available at http://192.168.143.332:23799
Spark context available as 'sc' (master = local[*], app id = local-).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.
/_/ Using Scala version 2.11. (Java HotSpot(TM) -Bit Server VM, Java 1.8.0_72)
Type in expressions to have them evaluated.
Type :help for more information. scala> import spark.sql
import spark.sql scala> import spark.implicits._
import spark.implicits._ scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields] scala> var fpb_groupby_obj_grid_height_df1 = fpb_df.groupBy("objectid", "gridid", "height", "rsrp").agg(
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("n_objectid1").alias("n_objectid"), col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_grid_height_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: string ... more fields] scala> var fpb_groupby_obj_df1 = fpb_groupby_obj_grid_height_df1.groupBy("objectid").agg(
| collect_list("gridid").alias("gridid1"),
| collect_list("height").alias("height1"),
| collect_list("rsrp").alias("rsrp1"),
| collect_list("n_objectid").alias("n_objectid1"),
| collect_list("rsrp_dis").alias("rsrp_dis1")
| ).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("n_objectid1").alias("n_objectid"),col("rsrp_dis1").alias("rsrp_dis"))
fpb_groupby_obj_df1: org.apache.spark.sql.DataFrame = [objectid: string, gridid: array<string> ... more fields] scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).show
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
| _1| _2| _3| _4| _5| _6|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray(...|[WrappedArray(0.0...|
+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df1.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[String]](), s.getSeq[Seq[Double]]())).schema
res4: org.apache.spark.sql.types.StructType =
StructType(
StructField(_1,StringType,true),
StructField(_2,ArrayType(StringType,true),true),
StructField(_3,ArrayType(StringType,true),true),
StructField(_4,ArrayType(StringType,true),true),
StructField(_5,ArrayType(ArrayType(StringType,true),true),true),
StructField(_6,ArrayType(ArrayType(DoubleType,false),true),true)
)
方案二:存储格式为:array<array<(string,double)>>,读取失败。
scala> sql("use my_hive_db")
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
-- ::, | WARN | main | load mapred-default.xml, HIVE_CONF_DIR env not found! | org.apache.hadoop.hive.ql.session.SessionState.loadMapredDefaultXml(SessionState.java:)
res0: org.apache.spark.sql.DataFrame = []
scala> var fpb_df = sql(
| s"""|select gridid,height,objectid,n_objectid,rsrp,(rsrp-n_rsrp) as rsrp_dis
| |from fpd_tabke
| |where p_city= and p_day= limit
| |""".stripMargin)
fpb_df: org.apache.spark.sql.DataFrame = [gridid: string, height: string ... more fields]
scala> var fpb_groupby_obj_grid_height_df2 = fpb_df.map(s =>
(s.getAs[String]("objectid"), s.getAs[String]("gridid"), s.getAs[String]("height"), s.getAs[String]("rsrp"), (s.getAs[String]("n_objectid"), s.getAs[Double]("rsrp_dis")))
).toDF("objectid", "gridid", "height", "rsrp", "neighbour").groupBy("objectid", "gridid", "height", "rsrp").agg(
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid"), col("height"), col("rsrp"), col("neighbour1").alias("neighbour"))
scala> var fpb_groupby_obj_df2 = fpb_groupby_obj_grid_height_df2.groupBy("objectid").agg(
collect_list("gridid").alias("gridid1"),
collect_list("height").alias("height1"),
collect_list("rsrp").alias("rsrp1"),
collect_list("neighbour").alias("neighbour1")
).select(col("objectid"), col("gridid1").alias("gridid"), col("height1").alias("height"), col("rsrp1").alias("rsrp"), col("neighbour1").alias("neighbour"))
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[String]],
| Encoders.javaSerialization[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary]
scala> fpb_groupby_obj_df2.show
+---------+--------------------+--------------------+--------------------+--------------------+
| objectid| gridid| height| rsrp| neighbour|
+---------+--------------------+--------------------+--------------------+--------------------+
||[2676906_708106, ...|[, , , , , ...|[-130.399994, -...|[WrappedArray([...|
+---------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[AC ED ...|[AC ED ...|[AC ED ...|[AC ED ...|
+---------+--------------------+--------------------+--------------------+--------------------+
scala> fpb_groupby_obj_df2.map(s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]())).show()
[Stage :======================================================>( + ) / ]-- ::, | ERROR | Executor task launch worker for task | Exception in task 0.0 in stage 7.0 (TID ) | org.apache.spark.internal.Logging$class.logError(Logging.scala:)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$$$anon$.hasNext(WholeStageCodegenExec.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$.apply(SparkPlan.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$$$anonfun$apply$.apply(RDD.scala:)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:)
at org.apache.spark.scheduler.Task.run(Task.scala:)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:)
at java.lang.Thread.run(Thread.java:)
scala> val encoder = Encoders.tuple(
| Encoders.STRING,
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[String]],
| Encoders.kryo[Seq[Seq[(String, Double)]]]
| )
encoder: org.apache.spark.sql.Encoder[(String, Seq[String], Seq[String], Seq[String], Seq[Seq[(String, Double)]])] = class[_1[]: string, _2[]: binary, _3[]: binary, _4[]: binary, _5[]: binary]
scala> fpb_groupby_obj_df2.map { s => (s.getAs[String]("objectid"), s.getSeq[String](), s.getSeq[String](), s.getSeq[String](), s.getSeq[Seq[(String, Double)]]()) }(encoder).show
+---------+--------------------+--------------------+--------------------+--------------------+
| value| _2| _3| _4| _5|
+---------+--------------------+--------------------+--------------------+--------------------+
||[ ...|[ ...|[ ...|[ ...|
+---------+--------------------+--------------------+--------------------+--------------------+
spark2.1:读取hive中存储的多元组(string,double)失败的更多相关文章
- SparkSQL读取Hive中的数据
由于我Spark采用的是Cloudera公司的CDH,并且安装的时候是在线自动安装和部署的集群.最近在学习SparkSQL,看到SparkSQL on HIVE.下面主要是介绍一下如何通过SparkS ...
- 通过spark-sql快速读取hive中的数据
1 配置并启动 1.1 创建并配置hive-site.xml 在运行Spark SQL CLI中需要使用到Hive Metastore,故需要在Spark中添加其uris.具体方法是将HIVE_CON ...
- [转] C#实现在Sql Server中存储和读取Word文件 (Not Correct Modified)
出处 C#实现在Sql Server中存储和读取Word文件 要实现在Sql Server中实现将文件读写Word文件,需要在要存取的表中添加Image类型的列,示例表结构为: CREATE TABL ...
- excel中存储的icount,赋值完之后
最近需要实现一个功能,为了确保每次函数运行的时候count是唯一的,所以想读取excel中存储的icount,赋值完之后对其进行+1操作,并存入excel文件,确保下次读取的count是新的,没有出现 ...
- android读取apk中已经存在的数据库信息
在android数据库编程方面,大家有没有遇到过,我要从指定位置的已经存在的数据库来进行操作的问题.之前我尝试了很多方法都没有成功,后来找到了解决的方法. 下面说明下这段代码的意思,第一步先判断在 ...
- 关于sparksql操作hive,读取本地csv文件并以parquet的形式装入hive中
说明:spark版本:2.2.0 hive版本:1.2.1 需求: 有本地csv格式的一个文件,格式为${当天日期}visit.txt,例如20180707visit.txt,现在需要将其通过spar ...
- 关于mysql中存储json数据的读取问题
在mysql中存储json数据,字段类型用text,java实体中用String接受. 返回前端时(我这里返回前端的是一个map),为了保证读取出的数据排序错乱问题,定义Map时要用LinkedHas ...
- 使用Hive读取ElasticSearch中的数据
本文将介绍如何通过Hive来读取ElasticSearch中的数据,然后我们可以像操作其他正常Hive表一样,使用Hive来直接操作ElasticSearch中的数据,将极大的方便开发人员.本文使用的 ...
- Hive中的HiveServer2、Beeline及数据的压缩和存储
1.使用HiveServer2及Beeline HiveServer2的作用:将hive变成一种server服务对外开放,多个客户端可以连接. 启动namenode.datanode.resource ...
随机推荐
- 走近webpack(4)--css相关拓展
我们前面已经学了很多webpack基本的处理情况,一句话总结就是,一个优秀的webpack项目,主要的核心用法就是整合loader和plugin去处理你想要的任何需求. 下面,咱们一起来学学如何用we ...
- Redis相关命令
一.命令示例 1. KEYS/RENAME/DEL/EXISTS/MOVE/RENAMENX: #在Shell命令行下启动Redis客户端工具. /> redis-cli #清空当前选择的数据库 ...
- 【Python】 xml解析与生成 xml
xml *之前用的时候也没想到..其实用BeautifulSoup就可以解析xml啊..因为html只是xml的一种实现方式吧.但是很蛋疼的一点就是,bs不提供获取对象的方法,其find大多获取的都是 ...
- 大数据 --> 淘宝异构数据源数据交换工具 DataX
淘宝异构数据源数据交换工具 DataX DataX是什么? DataX是一个在异构的数据库/文件系统之间高速交换数据的工具,实现了在任意的数据处理系统(RDBMS/Hdfs/Local filesys ...
- 常用linux日志查询命令
1.查看实时日志: tail -f nohup.out 2.分页查看所有日志: cat nohup.out | more 4.分页查看前N行日志: tail -n 1000 nohup.out | m ...
- python全栈开发-Day9 函数对象、函数嵌套、名称空间与作用域
一 .函数对象 一 .函数是第一类对象,即函数可以当作数据传递 可以被引用 可以当作参数传递 返回值可以是函数 可以当作容器类型的元素 二. 利用该特性,优雅的取代多分支的if def foo(): ...
- SQL注入之Sqli-labs系列第二篇
废话不在多说 let's go! 继续挑战第二关(Error Based- String) 同样的前奏,就不截图了 ,and 1=1和and 1=2进行测试,出现报错 还原sql语句 查看源代码 ...
- The Beginning of the Graph Theory
The Beginning of the Graph Theory 是的,这不是一道题.最近数论刷的实在是太多了,我要开始我的图论与树的假期生活了. 祝愿我吧??!ShuraK...... poj18 ...
- node.js与比特币(typescript实现)
BTC中的utxo模型 BTC中引入了许多创新的概念与技术,区块链.PoW共识.RSA加密.萌芽阶段的智能合约等名词是经常被圈内人所提及,诚然这些创新的实现使得BTC变成了一种有可靠性和安全性保证的封 ...
- [日常] Codeforces Round #441 Div.2 实况
上次打了一发 Round #440 Div.2 结果被垃圾交互器卡掉 $200$ Rating后心情复杂... 然后立了个 Round #441 要翻上蓝的flag QAQ 晚饭回来就开始搞事情, 大 ...