【原创】大叔问题定位分享（17）spark查orc格式数据偶尔报错NullPointerException

spark查orc格式的数据有时会报这个错

Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more

跟进代码

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

  static enum SplitStrategyKind {

    HYBRID,

    BI,

    ETL

  }

...

    Context(Configuration conf) {

      this.conf = conf;

      minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);

      maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);

      String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);

      if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {

        splitStrategyKind = SplitStrategyKind.HYBRID;

      } else {

        LOG.info("Enforcing " + ss + " ORC split strategy");

        splitStrategyKind = SplitStrategyKind.valueOf(ss);

      }

...

        switch(context.splitStrategyKind) {

          case BI:

            // BI strategy requested through config

            splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,

                deltas, covered);

            break;

          case ETL:

            // ETL strategy requested through config

            splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,

                deltas, covered);

            break;

          default:

            // HYBRID strategy

            if (avgFileSize > context.maxSize) {

              splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,

                  covered);

            } else {

              splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,

                  covered);

            }

            break;

        }

org.apache.hadoop.hive.conf.HiveConf.ConfVars

    HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),

        "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +

        " as opposed to query execution (split generation does not read or cache file footers)." +

        " ETL strategy is used when spending little more time in split generation is acceptable" +

        " (split generation reads and caches file footers). HYBRID chooses between the above strategies" +

        " based on heuristics."),

The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.

可见hive.exec.orc.split.strategy默认是HYBRID，HYBRID时如果不满足

if (avgFileSize > context.maxSize) {

则

splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);

报错的就是BISplitStrategy，具体这个类为什么报错还没有细看，不过可以修改设置避免这个问题

set hive.exec.orc.split.strategy=ETL

问题暂时解决，未完待续；

【原创】大叔问题定位分享（17）spark查orc格式数据偶尔报错NullPointerException的更多相关文章

【原创】大叔问题定位分享（24）hbase standalone方式启动报错
hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...
【原创】大叔问题定位分享（2）spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...
【原创】大叔问题定位分享（16）spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
【原创】大叔问题定位分享（15）spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
【原创】大叔问题定位分享（10）提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一问题重现问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
【原创】大叔问题定位分享（9）oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...
【原创】大叔问题定位分享（8）提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException
spark 2.1.1 一问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...
【原创】大叔问题定位分享（29）datanode启动报错：50020端口被占用
集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...
【原创】大叔问题定位分享（13）HBase Region频繁下线
问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...

随机推荐

深入研究EF Core AddDbContext 引起的内存泄露的原因
前两天逛园子,看到 @Jeffcky 发的这篇文章<EntityFramework Core依赖注入上下文方式不同造成内存泄漏了解一下>. 一开始只是粗略的扫了一遍没仔细看,只是觉得是多次 ...
Asp.Net Core使用NLog+Mysql的几个小问题
项目中使用NLog记录日志,很好用,之前一直放在文本文件中,准备放到db中,方便查询. 项目使用了Mysql,所以日志也放到Mysql上,安装NLog不用说,接着你需要安装Mysql.Data安装包: ...
REST命令控制Player
本文用Postman工具演示通过REST控制Cnario Playr 注意:Player的REST通信默认关闭,使用前需要从Setting>>Remote devices打开Use RES ...
在Linux上安装ant环境
原文链接:http://www.cnblogs.com/sell/archive/2013/07/24/3210198.html 1.从http://ant.apache.org 上下载tar.gz版 ...
monkey日志管理
日志管理作用 Monkey日志管理是Monkey测试中非常重要的一个环节,通过日志管理分析,可以获取当前测试对象在测试过程中是否会发生异常,以及发生的概率,同时还可以获取对应的错误信息,帮助开发定位和 ...
[模板] 虚树 && bzoj2286-[Sdoi2011]消耗战
简介虚树可以解决一些关于树上一部分节点的问题. 对于一棵树 $T$ 的一个子集 $S$, 可以在 $O(|S| \log |S|)$ 的时间复杂度内求出 $S$ 的虚树. 虚树包括根 ...
【题解】放球游戏A
题目描述校园里在上活动课,Red和Blue两位小朋友在玩一种游戏,他俩在一排N个格子里,自左到右地轮流放小球,每个格子只能放一个小球.每个人一次只能放1至5个球,最后面对没有空格而不能放球的人为输. ...
【Spring】Spring bean的实例化
Spring实现HelloWord 前提: 1.已经在工程中定义了Spring配置文件beans.xml 2.写好了一个测试类HelloWorld,里面有方法getMessage()用于输出" ...
帝国cms更新报错解决办法
帝国cms更新,经常会报以下的错误: PHP Warning: array_merge(): Argument #2 is not an array in D:\wwwroot\www.536831 ...
jmeter5.1在windows（含插件安装）及linux环境下安装
jmeter下载前提:已经安装jdk8+ jmeter下载地址:http://jmeter.apache.org/download_jmeter.cgi 有Binaries和Source版本前者是 ...

【原创】大叔问题定位分享（17）spark查orc格式数据偶尔报错NullPointerException

【原创】大叔问题定位分享（17）spark查orc格式数据偶尔报错NullPointerException的更多相关文章

随机推荐

热门专题