: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1;
INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5):
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:,), comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds
INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5):
INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds
INFO : OK
Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at   in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0)
: jdbc:hive2://master01.hadoop.dtmobile.cn:1>

通过spark2.3 sparksql执行写数据到hive,saveAsTable(),sparksql写数据到hive时候,默认是保存为parquet+snappy的数据。在数据保存完成之后,通过hive beeline查询,报错如上。但是通过spark查询,执行正常。

stackoverflow上找到同样的问题:

根本原因如下:

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

所以尝试调整参数 spark.sql.parquet.writeLegacyFormat = true,问题解决。

到spark2.3源代码中查找该参数(spark.sql.parquet.writeLegacyFormat):

package org.apache.spark.sql.internal 中 关于sparksql的默认配置 SQLConf.scala中相关描述如下

  val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat")
    .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " +
      "versions, when converting Parquet schema to Spark SQL schema and vice versa.")
    .booleanConf
    .createWithDefault(false)

可以看到默认值为false

在 package org.apache.spark.sql.execution.datasources.parquet 的关于ParquetWriteSupport.scala 的描述如下:

/**
 * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
 * messages.  This class can write Parquet data in two modes:
 *
 *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
 *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
 *
 * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
 * of this option is propagated to this class by the `init()` method and its Hadoop configuration
 * argument.
 */

ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...的更多相关文章

  1. python3: error while loading shared libraries: libpython3.5m.so.1.0: cannot open shared object file: No such file or directory

    安装python3遇到报错: wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz ./configure --prefix=/u ...

  2. svnadmin:error while loading shared libraries: libaprutil-1.so.0:cannot open shared object file: No such file or directory

    wdcp下安装svn后一直提示 svnadmin:error while loading shared libraries: libaprutil-1.so.0:cannot open shared ...

  3. 动态链接库找不到 : error while loading shared libraries: libgsl.so.0: cannot open shared object file: No such file or directory

    问题: 运行gsl(GNU scientific Library)的函数库,用 gcc erf.c -I/usr/local/include -L/usr/local/lib64 -L/usr/loc ...

  4. 解决libpython2.6.so.1.0: cannot open shared object file

    文章解决的问题:安装nginx中需要Python2.6的支持,下面介绍如何安装Python2.6,并建立lib的连接. 问题展示:error while loading shared librarie ...

  5. ./filezilla: error while loading shared libraries: libpng12.so.0: cannot open shared object file: No such file or directory

    opensuse系统 在filezilla官网下载压缩文件解压运行后报 ./filezilla: error while loading shared libraries: libpng12.so.0 ...

  6. error while loading shared libraries: libpthread.so.0: cannot open shared object file: No such file

    安装rac10g,出现例如以下错误: [root@rac2 oracle]# /u01/product/crs/root.sh WARNING: directory '/u01/product' is ...

  7. tensorflow-gpu版本出现libcublas.so.8.0:cannot open shared object file

    文章主要参考以下博客https://www.aliyun.com/zixun/wenji/1289957.html 在利用GPU加速tensorflow时,出现了libcublas.so.8.0:ca ...

  8. ubuntu下tensorflow 报错 libcusolver.so.8.0: cannot open shared object file: No such file or directory

    解决方法1. 在终端执行: export LD_LIBRARY_PATH=”$LD_LIBRARY_PATH:/usr/local/cuda/lib64” export CUDA_HOME=/usr/ ...

  9. 〖Android〗arm-linux-androideabi-gdb报 libpython2.6.so.1.0: cannot open shared object file错误的解决方法

    执行: prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.6/bin/arm-linux-androideabi-gdb out/target/p ...

随机推荐

  1. 如何在jsp中显示数据库的内容

    用Eclipse tomcat新建一个JSP页面(一)介绍了如何创建一个web程序和第一个jsp页面,以及Eclipse需要的一些必要配置.今天,我们重点说一下如何从数据库中查询数据,并且在JSP页面 ...

  2. IO流4

    IO流的分类 根据处理数据类型的不同分为:字符流和字节流 根据数据流向不同分为:输入流和输出流 字符流和字节流 字符流的由来: 因为数据编码的不同,而有了对字符进行高效操作的流对象.本质其实就是基于字 ...

  3. rabbitMQ_rpc(六)

    远程过程调用(RPC) 在前面我们已经学习了如何使用工作队列在多个消费者之间分配耗时的任务. 但是如果我们需要在远程计算机上运行功能并等待结果怎么办?那将会是一个不同的故事.此模式通常称为远程过程调用 ...

  4. About dycf

    SYSU  数媒在读 所有资料可能与课程相关可能与参与项目相关 欢迎交流 . . . 中之人: ↓↓↓ ↑↑↑   是他  就是他  ↑↑↑

  5. Python3数据驱动ddt

    对于同一个方法执行大量数据的程序时,我们可以采用ddt数据驱动的方式,来对数据规范化整理及输出 一.需要使用python的ddt库,ddt,data,unpack方法 1.仅使用ddt和data,代码 ...

  6. python 爬取豆瓣电影评论,并进行词云展示及出现的问题解决办法

    本文旨在提供爬取豆瓣电影<我不是药神>评论和词云展示的代码样例 1.分析URL 2.爬取前10页评论 3.进行词云展示 1.分析URL 我不是药神 短评 第一页url https://mo ...

  7. 【iOS】Error: Error Domain=PBErrorDomain Code=7 "Cannot connect to pasteboard server

    这几天在用 Swift 开发一个简单的键盘扩展,真机调试时遇到了这个问题,详细信息如下: ***[:] Could not save pasteboard named com.apple.UIKit. ...

  8. Struts完成用户新增操作

    点击新增客户出现该页面并完成前后台交互 代码逻辑分析: jsp 页面部分代码 <TABLE id=table_1 style="DISPLAY: none" cellSpac ...

  9. css3加js做一个简单的3D行星运转效果

    前几天在园子里看到一篇关于CSS3D行星运转的文章,原文在这里,感觉这个效果也太酷炫了,于是自己也就心血来潮的来尝试的做了一下.因为懒得去用什么插件了,于是就原生的JS写,效果有点粗超,还有一些地方处 ...

  10. CSS3 Flex 布局教程

    网页布局(layout)是 CSS 的一个重点应用. 布局的传统解决方案,基于盒状模型,依赖 display 属性 + position属性 + float属性.它对于那些特殊布局非常不方便,比如,垂 ...