spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo
问题复现:
G:\bigdata\spark-2.3.3-bin-hadoop2.7\bin>spark-shell
2020-12-26 10:20:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-01KN1P4:4040
Spark context available as 'sc' (master = local[*], app id = local-1608949256544).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.3
/_/ Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information. scala> sql("create table empty_orc(a int) stored as orc location '/tmp/empty_orc'").show
++
||
++
++ (其他窗口新建一个空文件) touch /tmp/empty_orc/zero.orc scala> sql("select * from empty_orc").show java.lang.RuntimeException: serious problem
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3278)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2489)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2489)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2703)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
... 49 elided
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 99 more
该问题的主要原因是在读取orc表时,遇到有空文件时报错,bug记录地址:
SPARK-19809:NullPointerException on zero-size ORC file(https://issues.apache.org/jira/browse/SPARK-19809)
SPARK-29773:Unable to process empty ORC files in Hive Table using Spark SQL(https://issues.apache.org/jira/browse/SPARK-29773)
解决办法:使用参数spark.sql.hive.convertMetastoreOrc=true
G:\bigdata\spark-2.3.3-bin-hadoop2.7\bin>spark-shell --conf spark.sql.hive.convertMetastoreOrc=true
2020-12-26 10:29:06 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://DESKTOP-01KN1P4:4040
Spark context available as 'sc' (master = local[*], app id = local-1608949754291).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.3
/_/ Using Scala version 2.11.8 (Java HotSpot(TM) Client VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information. scala> sql("select * from empty_orc").show +---+
| a|
+---+
+---+
spark的帮助文档种介绍如下:
ORC Files
Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC file format for ORC files. To do that, the following configurations are newly added. The vectorized reader is used for the native ORC tables (e.g., the ones created using the clause USING ORC) when spark.sql.orc.impl is set to native and spark.sql.orc.enableVectorizedReader is set to true. For the Hive ORC serde tables (e.g., the ones created using the clause USING HIVE OPTIONS (fileFormat 'ORC')), the vectorized reader is used when spark.sql.hive.convertMetastoreOrc is also set to true.
https://spark.apache.org/docs/2.3.3/sql-programming-guide.html#orc-files
spark读取空orc文件时报错java.lang.RuntimeException: serious problem at OrcInputFormat.generateSplitsInfo的更多相关文章
- sparksql读取hive数据报错:java.lang.RuntimeException: serious problem
问题: Caused by: java.util.concurrent.ExecutionException: java.lang.IndexOutOfBoundsException: Index: ...
- shiro使用redis作为缓存,出现要清除缓存时报错 java.lang.Exception: Failed to deserialize at org.crazycake.shiro.SerializeUtils.deserialize(SerializeUtils.java:41) ~[shiro-redis-2.4.2.1-RELEASE.jar:na]
shiro使用redis作为缓存,出现要清除缓存时报错 java.lang.Exception: Failed to deserialize at org.crazycake.shiro.Serial ...
- 使用RestTemplate时报错java.lang.IllegalStateException: No instances available for 127.0.0.1
我在RestTemplate的配置类里使用了 @LoadBalanced@Componentpublic class RestTemplateConfig { @Bean @LoadBalanced ...
- 云笔记项目- 上传文件报错"java.lang.IllegalStateException: File has been moved - cannot be read again"
在做文件上传时,当写入上传的文件到文件时,会报错“java.lang.IllegalStateException: File has been moved - cannot be read again ...
- hive启动时报错 java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D at org.apache.hadoop.fs.Path.initialize
错误提示信息如下 错误信息如下 [root@node1 bin]# ./hive Logging initialized -bin/lib/hive-common-.jar!/hive-log4j.p ...
- 运用反射时报错java.lang.NoSuchMethodException,以解决,记录一下
问题:想调用service类中的私有方法时, Method target=clz.getMethod("say", String.class);用Class的getMethod报错 ...
- storm supervisor启动报错java.lang.RuntimeException: java.io.EOFException
storm因机器断电或其他异常导致的supervisor意外终止,再次启动时报错: 1. 2013-09-24 09:15:44,361 INFO [main] daemon.supervisor ( ...
- Android Studio 首次安装报错 Java.lang.RuntimeException:java.lang.NullPointerException...错
下次安装报:Java.lang.RuntimeException: java.lang.NullPointerException......错 只需在文件..\Android Studio\bin\i ...
- 我的Android进阶之旅------>Android中MediaRecorder.stop()报错 java.lang.RuntimeException: stop failed.【转】
本文转载自:http://blog.csdn.net/ouyang_peng/article/details/48048975 今天在调用MediaRecorder.stop(),报错了,java.l ...
- maven报错 java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, see the following errors
2 errors java.lang.RuntimeException: com.google.inject.CreationException: Unable to create injector, ...
随机推荐
- P1379 八数码难题 ( A* 算法 与 IDA_star 算法)
P1379 八数码难题 题目描述 在3×3的棋盘上,摆有八个棋子,每个棋子上标有1至8的某一数字.棋盘中留有一个空格,空格用0来表示.空格周围的棋子可以移到空格中.要求解的问题是:给出一种初始布局(初 ...
- AcWing 第 3 场周赛
比赛链接:Here AcWing 3660. 最短时间 比较四个方向和 \((r,c)\) 的距离 void solve() { ll n, m, r, c; cin >> n >& ...
- Android 加载图片占用内存分析
本文首发于 vivo互联网技术 微信公众号 链接:https://mp.weixin.qq.com/s/aRDzmMlkqB14Ty67GJs9vg作者:Xu Jie 不同Android版本,对一张图 ...
- 记一次 .NET某道闸收费系统 内存溢出分析
一:背景 1. 讲故事 前些天有位朋友找到我,说他的程序几天内存就要爆一次,不知道咋回事,找不出原因,让我帮忙看一下,这种问题分析dump是最简单粗暴了,拿到dump后接下来就是一顿分析. 二:Win ...
- 3 分钟创建 Serverless Job 定时获取新闻热搜
不用掏手机,不用登微博,借助 SAE 定时任务就可以实现每小时获取实时新闻热搜!SAE 场景体验火热开启中,参与还可领好礼! Job 作为一种运完即停的负载类型,在企业级开发中承载着丰富的使用场景.S ...
- linux下jdk1.7、1.8版本的安装
-----1.7------ (1)解压安装包 tar -zxvf jdk-7u80-linux-x64.tar.gz (2)移动到安装目录 ...
- python+appium使用方法
一.python环境安装 确保需安装Appium-Python-Client包
- 探讨Java死锁的现象和解决方法
死锁是多线程编程中常见的问题,它会导致线程相互等待,无法继续执行.在Java中,死锁是一个需要注意和解决的重要问题.让我们通过一系列详细的例子来深入了解Java死锁的现象和解决方法. 1. 什么是死锁 ...
- [转帖]Strong crypto defaults in RHEL 8 and deprecation of weak crypto algorithms
https://access.redhat.com/articles/3642912 TABLE OF CONTENTS What policies are provided? Removed c ...
- [转帖]PostgreSQL 10.0 preview 功能增强 - 国际化功能增强,支持ICU(International Components for Unicode)
https://developer.aliyun.com/article/72935 标签 PostgreSQL , 10.0 , International Components for Unico ...