spark 执行报错 java.io.EOFException: Premature EOF from inputStream
使用spark2.4跟spark2.3 做替代公司现有的hive选项。
跑个别任务spark有以下错误
java.io.EOFException: Premature EOF from inputStream
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:74)
at com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:115)
at com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java:54)
at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:112)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:129)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:269)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:268)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:226)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
排查原因 发现是读取0 size 大小的文件时出错

并没有发现spark官方有修复该bug
手动修改代码 过滤掉这种文件
在 HadoopRDD.scala 类相应位置修改如图即可
// We get our input bytes from thread-local Hadoop FileSystem statistics.
// If we do a coalesce, however, we are likely to compute multiple partitions in the same
// task and in the same thread, in which case we need to avoid override values written by
// previous partitions (SPARK-13071).
private def updateBytesRead(): Unit = {
getBytesReadCallback.foreach { getBytesRead =>
inputMetrics.setBytesRead(existingBytesRead + getBytesRead())
}
} private var reader: RecordReader[K, V] = null
private val inputFormat = getInputFormat(jobConf)
HadoopRDD.addLocalConfiguration(
new SimpleDateFormat("yyyyMMddHHmmss", Locale.US).format(createTime),
context.stageId, theSplit.index, context.attemptNumber, jobConf) reader =
try {
if (split.inputSplit.value.getLength != 0) { //文件大小不为零 采取读取
inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
} else {
logWarning(s"Skipped the file size 0 file: ${split.inputSplit}")
finished = true //大小为0 即结束 跳过
null
}
} catch {
case e: FileNotFoundException if ignoreMissingFiles =>
logWarning(s"Skipped missing file: ${split.inputSplit}", e)
finished = true
null
// Throw FileNotFoundException even if `ignoreCorruptFiles` is true
case e: FileNotFoundException if !ignoreMissingFiles => throw e
case e: IOException if ignoreCorruptFiles =>
logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
finished = true
null
}
// Register an on-task-completion callback to close the input stream.
context.addTaskCompletionListener[Unit] { context =>
// Update the bytes read before closing is to make sure lingering bytesRead statistics in
// this thread get correctly added.
updateBytesRead()
closeIfNeeded()
}
spark 执行报错 java.io.EOFException: Premature EOF from inputStream的更多相关文章
- 关于spark入门报错 java.io.FileNotFoundException: File file:/home/dummy/spark_log/file1.txt does not exist
不想看废话的可以直接拉到最底看总结 废话开始: master: master主机存在文件,却报 执行spark-shell语句: ./spark-shell --master spark://ma ...
- Spark启动报错|java.io.FileNotFoundException: File does not exist: hdfs://hadoop101:9000/directory
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:) at org.a ...
- hadoop MR 任务 报错 "Error: java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io"
错误原文分析 文件操作超租期,实际上就是data stream操作过程中文件被删掉了.一般是由于Mapred多个task操作同一个文件.一个task完毕后删掉文件导致. 这个错误跟dfs.datano ...
- hbase_异常_03_java.io.EOFException: Premature EOF: no length prefix available
一.异常现象 更改了hadoop的配置文件:core-site.xml 和 mapred-site.xml 之后,重启hadoop 和 hbase 之后,发现hbase日志中抛出了如下异常: ...
- Spark报错java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
Spark 读取 JSON 文件时运行报错 java.io.IOException: Could not locate executable null\bin\winutils.exe in the ...
- 关于SpringMVC项目报错:java.io.FileNotFoundException: Could not open ServletContext resource [/WEB-INF/xxxx.xml]
关于SpringMVC项目报错:java.io.FileNotFoundException: Could not open ServletContext resource [/WEB-INF/xxxx ...
- Kafka 启动报错java.io.IOException: Can't resolve address.
阿里云上 部署Kafka 启动报错java.io.IOException: Can't resolve address. 本地调试的,报错 需要在本地添加阿里云主机的 host 映射 linux ...
- 文件上传报错java.io.FileNotFoundException拒绝访问
局部代码如下: File tempFile = new File("G:/tempfileDir"+"/"+fileName); if(!tempFile.ex ...
- 完美解决JavaIO流报错 java.io.FileNotFoundException: F:\ (系统找不到指定的路径。)
完美解决JavaIO流报错 java.io.FileNotFoundException: F:\ (系统找不到指定的路径.) 错误原因 读出文件的路径需要有被拷贝的文件名,否则无法解析地址 源代码(用 ...
随机推荐
- [HNOI2008] 越狱 快速幂
[HNOI2008] 越狱 快速幂 水.考虑不发生越狱的情况:即宗教相同的都不相邻,一号任意放\(m\)种宗教的人,此后\(n-1\)个房间都放与上一个宗教不同的人,有\(m-1\)种,所以共有\(m ...
- jumpserver官方手动安装
测试环境 CPU: 64位双核处理器 内存: 4G DDR3 数据库:mysql 版本大于等于 5.6 mariadb 版本大于等于 5.5.6 环境 系统: CentOS 7 IP: 192.168 ...
- 解决Virtualbox的根分区容量不够用问题
现在Virtualbox新建一块磁盘.容量一定要比原来的大.然后执行克隆命令. 把原来的磁盘内容克隆到新磁盘上.然后重新启动电脑. 运行相关扩容命令即可. #克隆磁盘 cd C:\Program Fi ...
- hive安装运行hive报错通解
参考博文:https://blog.csdn.net/lsxy117/article/details/47703155 大部分问题还是hadoop的配置文件的问题: 修改配置文件hadoop/conf ...
- 【学习笔记】OI模板整理
CSP2019前夕整理一下模板,顺便供之后使用 0. 非算法内容 0.1. 读入优化 描述: 使用getchar()实现的读入优化. 代码: inline int read() { int x=0; ...
- 重读APUE(11)-信号安全的可重入函数
重入时间点 进程捕捉到信号并对其进行处理时,进程正在执行的正常指令序列就会被信号处理程序临时中断,它首先执行该信号粗合理程序中的指令:如果从信号处理程序返回,则继续执行捕捉到信号时进程正在执行的正常指 ...
- K-Means算法及代码实现
1.K-Means算法 K-Means算法,也被称为K-平均或K-均值算法,是一种广泛使用的聚类算法.K-Means算法是聚焦于相似的无监督的算法,以距离作为数据对象间相似性度量的标准,即数据对象间的 ...
- 执行git pull时提示Connection reset by 13.229.188.59 port 22
问题如下图: 解决办法: 1. 2. 3. 4. 5. 6.
- hadoop查看文件大小
hadoop fs -du /yj/input/ 列出input下所有文件的大小,以B为单位 #!/bin/sh #echo "hadoop fs -du /" hadoop fs ...
- SQL-W3School-高级:SQL AUTO INCREMENT 字段
ylbtech-SQL-W3School-高级:SQL AUTO INCREMENT 字段 1.返回顶部 1. Auto-increment 会在新记录插入表中时生成一个唯一的数字. AUTO INC ...