【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more
跟进代码
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
static enum SplitStrategyKind {
HYBRID,
BI,
ETL
}
...
Context(Configuration conf) {
this.conf = conf;
minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) {
splitStrategyKind = SplitStrategyKind.HYBRID;
} else {
LOG.info("Enforcing " + ss + " ORC split strategy");
splitStrategyKind = SplitStrategyKind.valueOf(ss);
}
...
switch(context.splitStrategyKind) {
case BI:
// BI strategy requested through config
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
case ETL:
// ETL strategy requested through config
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal,
deltas, covered);
break;
default:
// HYBRID strategy
if (avgFileSize > context.maxSize) {
splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
} else {
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
}
break;
}
org.apache.hadoop.hive.conf.HiveConf.ConfVars
HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
"This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
" as opposed to query execution (split generation does not read or cache file footers)." +
" ETL strategy is used when spending little more time in split generation is acceptable" +
" (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
" based on heuristics."),
The HYBRID mode reads the footers for all files if there are fewer files than expected mapper count, switching over to generating 1 split per file if the average file sizes are smaller than the default HDFS blocksize. ETL strategy always reads the ORC footers before generating splits, while the BI strategy generates per-file splits fast without reading any data from HDFS.
可见hive.exec.orc.split.strategy默认是HYBRID,HYBRID时如果不满足
if (avgFileSize > context.maxSize) {
则
splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);
报错的就是BISplitStrategy,具体这个类为什么报错还没有细看,不过可以修改设置避免这个问题
set hive.exec.orc.split.strategy=ETL
问题暂时解决,未完待续;
【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException的更多相关文章
- 【原创】大叔问题定位分享(24)hbase standalone方式启动报错
hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...
- 【原创】大叔问题定位分享(2)spark任务一定几率报错java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
最近用yarn cluster方式提交spark任务时,有时会报错,报错几率是40%,报错如下: 18/03/15 21:50:36 116 ERROR ApplicationMaster91: Us ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(9)oozie提交spark任务报 java.lang.NoClassDefFoundError: org/apache/kafka/clients/producer/KafkaProducer
oozie中支持很多的action类型,比如spark.hive,对应的标签为: <spark xmlns="uri:oozie:spark-action:0.1"> ...
- 【原创】大叔问题定位分享(8)提交spark任务报错 Caused by: java.lang.ClassNotFoundException: org.I0Itec.zkclient.exception.ZkNoNodeException
spark 2.1.1 一 问题重现 spark-submit --master local[*] --class app.package.AppClass --jars /jarpath/zkcli ...
- 【原创】大叔问题定位分享(29)datanode启动报错:50020端口被占用
集群中有一台datanode一直启动报错如下: java.net.BindException: Problem binding to [$server1:50020] java.net.BindExc ...
- 【原创】大叔问题定位分享(13)HBase Region频繁下线
问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...
随机推荐
- 07-JavaScript之常用内置对象
JavaScript之常用内置对象 1.数组Array 1.1数组的创建方式 // 直接创建数组 var colors = ['red', 'blue', 'green']; console.log( ...
- 安装inotify-tools监控工具
安装inotify-tools监控工具 yum install -y inotify-tools 2:查看inotify-tools包的工具程序 [root@dns3 ~]# rpm -ql inot ...
- 软件工程(GZSD2015) 第二次作业文档模板
题目: (此处列出题目) 需求分析: 基本功能 基本功能点1 基本功能点2 ... 扩展功能(可选) 高级功能(可选) 设计 设计点1 设计点2 ... 代码实现 // code here 程序截图 ...
- [官网]Using PuTTY
Previous | Contents | Next Chapter 3: Using PuTTY Section 3.1: During your session Section 3.1.1: Co ...
- Windows 通过批处理自动执行 linux服务器上面命令的办法
1. 使用putty 下载地址 https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html 直接使用 exe版本就可以 https:/ ...
- kettle集群(转换)
1.定义子服务器 新建子服务器中有一个必须为主服务器 新建集群 在需求集群运行的步骤中右键集群进行使用
- a = a + 1, a++, ++a ,a+=1区别在哪
a = a +1; 即最普通的写法,将a的值加1再赋给a:a+=1; 相当于 a = a+1; a++; 是先将a的值赋给一个变量, 再自增: ++a:是先自增, 再把a的值给一个变量
- 修改Linux的编码集
使用SSH secure远程连接linux时,查看linux里的内容, 发现乱码.这是由于SSH secure客户端的编码是gbk ,而 linux的默认编码是utf-8 要解决乱码问题,必须将lin ...
- git 学习(1) ----- git 本地仓库操作
最近在项目中使用git了,在实战中才知道,以前学习的git 知识只是皮毛,需要重新系统的学一下,读了一本叫 Learn Git in a Month of Lunches 的书籍,这本书通俗易懂,使 ...
- Nginx 处理Http请求简单流程
L45 1:三次握手后 系统内核收到请求根据端口负载均衡的分配到某个worker 2:nginx 会分配一个512byte链接内存池 3:初始化nginx的http模块并等待用户请求,假设用户在cli ...