【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more
跟进代码
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
static enum SplitStrategyKind {
HYBRID,
BI,
ETL
}
...
Context(Configuration conf) {
this.conf = conf;
minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
splitStrategyKind = SplitStrategyKind.HYBRID;
} else {
LOG.info("Enforcing " + ss + " ORC split strategy");
splitStrategyKind = SplitStrategyKind.valueOf(ss);
}
...
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
}
break;
}
org.apache.hadoop.hive.conf.HiveConf.ConfVars
HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
"This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
" as opposed to query execution (split generation does not read or cache file footers)." +
" ETL strategy is used when spending little more time in split generation is acceptable" +
" (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
" based on heuristics."),
The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.
可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足
if (avgFileSize > context.maxSize) {
则
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题
set hive.exec.orc.split.strategy=ETL
问题暂时解决,未完待续;
【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的更多相关文章
- 【原创】大叔问题定位分享(24)hbase standalone方式启动报错
hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...
- 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(9)oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...
- 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException
spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...
- 【原创】大叔问题定位分享(29)datanode启动报错:50020端口被占用
集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...
- 【原创】大叔问题定位分享(13)HBase Region频繁下线
问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...
随机推荐
- PS快速祛除脸上小雀斑
首先我们要把图片放到PS软件中,然后在PS左侧工具栏中找到污点修复画笔工具(J), 配合着污点修复画笔中的修补工具一起使用,注意:模式要选择正常,属性栏中类型要选择内容识别. 下一步我们需要在图层上添 ...
- Spring Security 无法登陆,报错:There is no PasswordEncoder mapped for the id “null”
编写好继承了WebSecurityConfigurerAdapter类的WebSecurityConfig类后,我们需要在configure(AuthenticationManagerBuilder ...
- C#中字符串的字面值(转义序列)
在程序开发中,经常会碰到在字符串中字面值中使用转义序列,下面表格收集了下转义序列的完整列表,以便大家查看引用: 转义序列列表 转义序列 产生的字符 字符的Unicode值 \' 单引号 0x0027 ...
- SSL 证书生成与转化
1.windows 的keytool工具 2.如何将jks文件转换为pfx格式并导入客户端 https://jingyan.baidu.com/article/a65957f4c69dfc24e67f ...
- Magento2自定义命令
命令命名准则 命名指南概述 Magento 2引入了一个新的命令行界面(CLI),使组件开发人员能够插入模块提供的命令. Command name Command name 在命令中,它紧跟在命令的名 ...
- 【CF1132F】Clear the String(动态规划)
[CF1132F]Clear the String(动态规划) 题面 CF 题解 考虑区间\(dp\). 增量考虑,每次考虑最后一个字符和谁一起删去,然后直接转移就行了. #include<io ...
- maven转gradle ,windows错误重定向
gradle init --type pom --stacktrace > g.log 2>&1
- 神经网络3_M-P模型
sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程,QQ:231469242) https://study.163.com/course/introduction.htm?courseId ...
- Ubuntu16下Hadoop安装
1. 安装Ubuntu 2. 新装Ubuntu常用软件安装和系统设置 (1) 安装vim yum install vim (2) 更改hostname为hadoop_master sudo vim / ...
- makefile $@, $^, $<, $? 表示的意义
ref:https://www.cnblogs.com/gamesun/p/3323155.html $@ 表示目标文件$^ 表示所有的依赖文件$< 表示第一个依赖文件$? 表示比目标还 ...