spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo
问题复现:
G:\bigdata\spark-2.3.3-bin-hadoop2.7\bin>spark-shell
2020-12-26 10:20:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-01KN1P4:4040
Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.3
/_/ Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information. scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
++
||
++
++ (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 99 more
该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:
SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)
SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)
解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true
G:\bigdata\spark-2.3.3-bin-hadoop2.7\bin>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
2020-12-26 10:29:06 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-01KN1P4:4040
Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.3
/_/ Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information. scala> sql("select * from empty_orc").show +---+
| a|
+---+
+---+
spark的帮助文档种介绍如下:
ORC Files
Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.
https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files
spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo的更多相关文章
- sparksql读取hive数据报错:java.lang.RuntimeException: serious problem
问题: Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: ...
- shiro使用redis作为缓存,出现要清除缓存时报错 java.lang.Exception: Failed to deserialize at org.crazycake.shiro.SerializeUtils.deserialize(SerializeUtils.java:41) ~[shiro-redis-2.4.2.1-RELEASE.jar:na]
shiro使用redis作为缓存,出现要清除缓存时报错 java.lang.Exception: Failed to deserialize at org.crazycake.shiro.Serial ...
- 使用RestTemplate时报错java.lang.IllegalStateException: No instances available for 127.0.0.1
我在RestTemplate的配置类里使用了 @LoadBalanced@Componentpublic class RestTemplateConfig { @Bean @LoadBalanced ...
- 云笔记项目- 上传文件报错"java.lang.IllegalStateException: File has been moved - cannot be read again"
在做文件上传时,当写入上传的文件到文件时,会报错“java.lang.IllegalStateException: File has been moved - cannot be read again ...
- hive启动时报错 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at org.apache.hadoop.fs.Path.initialize
错误提示信息如下 错误信息如下 [root@node1 bin]# ./hive Logging initialized -bin/lib/hive-common-.jar!/hive-log4j.p ...
- 运用反射时报错java.lang.NoSuchMethodException,以解决,记录一下
问题:想调用service类中的私有方法时, Method target=clz.getMethod("say", String.class);用Class的getMethod报错 ...
- storm supervisor启动报错java.lang.RuntimeException: java.io.EOFException
storm因机器断电或其他异常导致的supervisor意外终止,再次启动时报错: 1. 2013-09-24 09:15:44,361 INFO [main] daemon.supervisor ( ...
- Android Studio 首次安装报错 Java.lang.RuntimeException:java.lang.NullPointerException...错
下次安装报:Java.lang.RuntimeException: java.lang.NullPointerException......错 只需在文件..\Android Studio\bin\i ...
- 我的Android进阶之旅------>Android中MediaRecorder.stop()报错 java.lang.RuntimeException: stop failed.【转】
本文转载自:http://blog.csdn.net/ouyang_peng/article/details/48048975 今天在调用MediaRecorder.stop(),报错了,java.l ...
- maven报错 java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors
2 errors java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, ...
随机推荐
- 阿里云视频云人脸生成领域最新研究成果入选CVPR2022
CVPR(IEEE Conference on Computer Vision and Pattern Recognition)作为计算机视觉和模式识别领域的顶级会议,在全球具有极高的权威性.目前在中 ...
- 2016年第七届 蓝桥杯A组 C/C++决赛题解
蓝桥杯历年国赛真题汇总:Here 1.随意组合 小明被绑架到X星球的巫师W那里. 其时,W正在玩弄两组数据 (2 3 5 8) 和 (1 4 6 7) 他命令小明从一组数据中分别取数与另一组中的数配对 ...
- Python | 解放双手,用Python实现自动发送邮件
解放双手,用Python实现自动发送邮件 使用Python实现自动化邮件发送,可以让你摆脱繁琐的重复性业务,节省非常多的时间. Python有两个内置库:smtplib和email,能够实现邮件功能, ...
- 运行vue项目时报错“ValidationError: Progress Plugin Invalid Options”
https://blog.csdn.net/M_Nobody/article/details/123135041?spm=1001.2101.3001.6650.1&utm_medium=di ...
- 类的MRO属性 C3算法
C3算法 class A(object): pass class B(A): pass class C(A): pass class D(B): pass class E(C): pass class ...
- KVM 管理工具:libvirt
libvirt 简介 libvirt 是目前使用最为广泛的对 KVM 虚拟机进行管理的工具和应用程序接口.
- Julia编程基础
技术背景 Julia目前来说算是一个比较冷门的编程语言,主要是因为它所针对的应用场景实在是比较有限,Julia更注重于科学计算领域的应用.而Julia最大的特点,就是官方所宣传的:拥有C的性能,且可以 ...
- VMware虚拟机部署Linux Ubuntu系统的方法
本文介绍基于VMware Workstation Pro虚拟机软件,配置Linux Ubuntu操作系统环境的方法. 首先,我们需要进行VMware Workstation Pro虚拟机软件的 ...
- [转帖]Web技术(五):HTTP/2 是如何解决HTTP/1.1 性能瓶颈的?
文章目录 一.HTTP/2 概览 二.HTTP/2 协议原理 2.1 Binary frame layer 2.1.1 DATA帧定义 2.1.2 HEADERS帧定义 2.2 Streams and ...
- [转帖]TiDB修改配置参数
https://www.jianshu.com/p/2ecdb4642579 在TiDB 中,"修改配置参数"似乎是个不精准的说法,它实际包含了以下内容: 修改 TiDB 的系统变 ...