spark查orc格式的数据有时会报这个错

Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more

跟进代码

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

  static enum SplitStrategyKind {
HYBRID,
BI,
ETL
}
... Context(Configuration conf) {
this.conf = conf;
minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
splitStrategyKind = SplitStrategyKind.HYBRID;
} else {
LOG.info("Enforcing " + ss + " ORC split strategy");
splitStrategyKind = SplitStrategyKind.valueOf(ss);
} ...
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
}
break;
}

org.apache.hadoop.hive.conf.HiveConf.ConfVars

    HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
"This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
" as opposed to query execution (split generation does not read or cache file footers)." +
" ETL strategy is used when spending little more time in split generation is acceptable" +
" (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
" based on heuristics."),

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足

if (avgFileSize > context.maxSize) {

splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);

报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题

set hive.exec.orc.split.strategy=ETL

问题暂时解决,未完待续;

【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的更多相关文章

  1. 【原创】大叔问题定位分享(24)hbase standalone方式启动报错

    hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...

  2. 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

    最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...

  3. 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat

    spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...

  4. 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead

    spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...

  5. 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration

    spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...

  6. 【原创】大叔问题定位分享(9)oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer

    oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...

  7. 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException

    spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...

  8. 【原创】大叔问题定位分享(29)datanode启动报错:50020端口被占用

    集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...

  9. 【原创】大叔问题定位分享(13)HBase Region频繁下线

    问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...

随机推荐

  1. 好的LCT板子和一句话

    typedef long long ll; const int maxn = 400050; struct lct { int ch[maxn][2], fa[maxn], w[maxn]; bool ...

  2. 关于h5绘制canvas生成图片的注意点!

    1.第一个是关于移动端自适应的问题: 答:如果是最后只要一张canvas生成的图片,而不是要绘制的canvas的图形,则不需要考虑自适应,绘制canvas的时候的宽高,可以直接写成UI提供的图的大小, ...

  3. mysql-笔记-命名、索引规范

    1 命名规范 所有数据库对象名称必须使用小写字母并用下划线分割 禁止使用mysql保留关键字 ---如果表名中包含关键字查询时,需要将其有单引号括起来 见名识意,并且最后不要超过32个字符 临时库表以 ...

  4. Python——爬虫——爬虫的原理与数据抓取

    一.使用Fiddler抓取HTTPS设置 (1)菜单栏 Tools > Telerik Fiddler Options 打开“Fiddler Options”对话框 (2)HTTPS设置:选中C ...

  5. python常用的基本操作

    打开cmd,pip list 可以查看python安装的所以第三方包

  6. 通过SQL脚本来查询SQLServer 中主外键关系

    在SQLServer中主外键是什么,以及主外键如何创建,在这里就不说了,不懂的可以点击这里,这篇文章也是博客园的博友写的,我觉得总结的很好: 此篇文章主要介绍通过SQL脚本来查看Sqlserver中主 ...

  7. 【UR #7】水题走四方

    题目描述 今天是世界水日,著名的水题资源专家蝈蝈大臣发起了水题走四方活动,向全世界发放成千上万的水题. 蝈蝈大臣是家里蹲大学的教授,当然不愿意出门发水题啦!所以他委托他的助手欧姆来发. 助手欧姆最近做 ...

  8. 为 Java 程序员准备的 Go 入门 PPT

    为 Java 程序员准备的 Go 入门 PPT 这是 Google 的 Go 团队技术主管经理 Sameer Ajmani 分享的 PPT,为 Java 程序员快速入门 Go 而准备的. 视频 这个 ...

  9. 「LibreOJ NOI Round #1」验题

    麻烦的动态DP写了2天 简化题意:给树,求比给定独立集字典序大k的独立集是哪一个 主要思路: k排名都是类似二分的按位确定过程. 字典序比较本质是LCP下一位,故枚举LCP,看多出来了多少个独立集,然 ...

  10. GWAS: 曼哈顿图,QQ plot 图,膨胀系数( manhattan、Genomic Inflation Factor)

    画曼哈顿图和QQ plot 首推R包“qqman”,简约方便.下面具体介绍以下. 一.画曼哈顿图 install.packages("qqman") library(qqman) ...