spark查orc格式的数据有时会报这个错

Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more

跟进代码

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

  static enum SplitStrategyKind {
HYBRID,
BI,
ETL
}
... Context(Configuration conf) {
this.conf = conf;
minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
splitStrategyKind = SplitStrategyKind.HYBRID;
} else {
LOG.info("Enforcing " + ss + " ORC split strategy");
splitStrategyKind = SplitStrategyKind.valueOf(ss);
} ...
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
}
break;
}

org.apache.hadoop.hive.conf.HiveConf.ConfVars

    HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
"This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
" as opposed to query execution (split generation does not read or cache file footers)." +
" ETL strategy is used when spending little more time in split generation is acceptable" +
" (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
" based on heuristics."),

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足

if (avgFileSize > context.maxSize) {

splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);

报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题

set hive.exec.orc.split.strategy=ETL

问题暂时解决,未完待续;

【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的更多相关文章

  1. 【原创】大叔问题定位分享(24)hbase standalone方式启动报错

    hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...

  2. 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

    最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...

  3. 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat

    spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...

  4. 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead

    spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...

  5. 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration

    spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...

  6. 【原创】大叔问题定位分享(9)oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer

    oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...

  7. 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException

    spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...

  8. 【原创】大叔问题定位分享(29)datanode启动报错:50020端口被占用

    集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...

  9. 【原创】大叔问题定位分享(13)HBase Region频繁下线

    问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...

随机推荐

  1. 个人hp笔记本默认设置更改

    1.将F1-F12默认的多媒体键(调静音亮度控制声音大小等)改为功能键: (****笔记本型号为惠普****) ·进入BIOS方法:关机状态下,按电源键开机,立刻连续多次点击ESC,看到 F1.F2. ...

  2. SpringBoot配置devtools实现热部署

    spring为开发者提供了一个名为spring-boot-devtools的模块来使Spring Boot应用支持热部署,提高开发者的开发效率,无需手动重启Spring Boot应用. devtool ...

  3. Linux(Ubuntu)使用日记(四)------印象笔记相关使用

    在Ubuntu系统下没有印象笔记官方的客户端,但是这并不能阻拦我们使用印象笔记. 我们一般的的使用习惯: 印象笔记客户端 印象笔记剪藏 Linux下也可以使用两个工具,剪藏的话安装比较简单,印象笔记客 ...

  4. Python使用turtle库与random库绘制雪花

    记录Python使用turtle库与random库绘制雪花,代码非常容易理解,画着玩玩还是可以的. 完整代码如下:   效果图如下:  

  5. RCTF 2017 easyre-153

    die查一下发现是upx壳 直接脱掉 ELF文件 跑一下: 没看懂是什么意思 随便输一个数就结束了 ida打开 看一下: pipe是完成两个进程之间通信的函数 1是写,0是读 fork是通过系统调用创 ...

  6. DesigningFormsinAccess2010

  7. bzoj3277-串

    Code #include<cstdio> #include<iostream> #include<cmath> #include<cstring> # ...

  8. python之函数闭包、可迭代对象和迭代器

    一.函数名的应用 # 1,函数名就是函数的内存地址,而函数名()则是运行这个函数. def func(): return print(func) # 返回一个地址 # 2,函数名可以作为变量. def ...

  9. mpvue——引入echarts图表

    安装 mpvue-echarts的github地址 https://github.com/F-loat/mpvue-echarts $ cnpm install mpvue-echarts $ cnp ...

  10. Educational Codeforces Round 53 (Rated for Div. 2) E. Segment Sum (数位dp求和)

    题目链接:https://codeforces.com/contest/1073/problem/E 题目大意:给定一个区间[l,r],需要求出区间[l,r]内符合数位上的不同数字个数不超过k个的数的 ...