【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more
跟进代码
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
static enum SplitStrategyKind {
HYBRID,
BI,
ETL
}
...
Context(Configuration conf) {
this.conf = conf;
minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
splitStrategyKind = SplitStrategyKind.HYBRID;
} else {
LOG.info("Enforcing " + ss + " ORC split strategy");
splitStrategyKind = SplitStrategyKind.valueOf(ss);
}
...
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
}
break;
}
org.apache.hadoop.hive.conf.HiveConf.ConfVars
HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
"This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
" as opposed to query execution (split generation does not read or cache file footers)." +
" ETL strategy is used when spending little more time in split generation is acceptable" +
" (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
" based on heuristics."),
The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.
可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足
if (avgFileSize > context.maxSize) {
则
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题
set hive.exec.orc.split.strategy=ETL
问题暂时解决,未完待续;
【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的更多相关文章
- 【原创】大叔问题定位分享(24)hbase standalone方式启动报错
hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...
- 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(9)oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...
- 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException
spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...
- 【原创】大叔问题定位分享(29)datanode启动报错:50020端口被占用
集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...
- 【原创】大叔问题定位分享(13)HBase Region频繁下线
问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...
随机推荐
- Python Revisited Day 05(模块)
目录 5.1 模块与包 5.1.1 包 5.2 Python 标准库概览 5.2.1 字符串处理 io.StringIO 类 5.2.3 命令行设计 5.2.4 数学与数字 5.2.5 时间与日期 5 ...
- c++入门之函数指针和函数对象
函数指针可以方便我们调用函数,但采用函数对象,更能体现c++面向对象的程序特性.函数对象的本质:()运算符的重载.我们通过一段代码来感受函数指针和函数对象的使用: int AddFunc(int a, ...
- 在Linux上安装ant环境
原文链接:http://www.cnblogs.com/sell/archive/2013/07/24/3210198.html 1.从http://ant.apache.org 上下载tar.gz版 ...
- vue.js实战——升级版购物车
HTML: <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF ...
- Angular 框架下ng-repeat内部使用tooltip插件的办法
普通情况下 <button type="button" class="btn btn-default" data-toggle="tooltip ...
- MySQL聚簇索引和非聚簇索引的对比
首先要清楚:聚簇索引并不是一种单独的索引类型,而是一种存储数据的方式. 聚簇索引在实际中用的很多,Innodb就是聚簇索引,Myisam 是非聚簇索引. 在之前我想插入一段关于innodb和myisa ...
- vhdl 数组
TYPE matrix_index is array (511 downto 0) of std_logic_vector(7 downto 0);signal cnt_freq : matrix_i ...
- Fixing “Did you mean to run dotnet SDK commands?” error when running dotnet –version
I recently installed the dotnet 1.11.0 Windows Server Hosting package which apparently installs the ...
- python之生成器和列表推导式
一.生成器函数 1.生成器 就是自己用python代码写的迭代器,生成器的本质就是迭代器(所以自带了__iter__方法和__next__方法,不需要我们去实现). 2.构建生成器的两种方式 1,生成 ...
- ORACLE表数据误删除的恢复方法(提交事务也可以)
ORACLE表数据误删除的恢复方法(提交事务也可以) 缓存加时间戳 开启行移动功能:ALTER TABLE tablename ENABLE row movement 把表还原到指定时间点:flash ...