使用spark2.4跟spark2.3 做替代公司现有的hive选项。

跑个别任务spark有以下错误

java.io.EOFException: Premature EOF from inputStream
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:74)
at com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:115)
at com.hadoop.compression.lzo.LzopInputStream.<init>(LzopInputStream.java:54)
at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:112)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:129)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:269)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:268)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:226)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:294)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

排查原因 发现是读取0 size 大小的文件时出错

并没有发现spark官方有修复该bug

手动修改代码 过滤掉这种文件

在 HadoopRDD.scala 类相应位置修改如图即可

      // We get our input bytes from thread-local Hadoop FileSystem statistics.
// If we do a coalesce, however, we are likely to compute multiple partitions in the same
// task and in the same thread, in which case we need to avoid override values written by
// previous partitions (SPARK-13071).
private def updateBytesRead(): Unit = {
getBytesReadCallback.foreach { getBytesRead =>
inputMetrics.setBytesRead(existingBytesRead + getBytesRead())
}
} private var reader: RecordReader[K, V] = null
private val inputFormat = getInputFormat(jobConf)
HadoopRDD.addLocalConfiguration(
new SimpleDateFormat("yyyyMMddHHmmss", Locale.US).format(createTime),
context.stageId, theSplit.index, context.attemptNumber, jobConf) reader =
try {
if (split.inputSplit.value.getLength != 0) { //文件大小不为零 采取读取
inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
} else {
logWarning(s"Skipped the file size 0 file: ${split.inputSplit}")
finished = true //大小为0 即结束 跳过
null
}
} catch {
case e: FileNotFoundException if ignoreMissingFiles =>
logWarning(s"Skipped missing file: ${split.inputSplit}", e)
finished = true
null
// Throw FileNotFoundException even if `ignoreCorruptFiles` is true
case e: FileNotFoundException if !ignoreMissingFiles => throw e
case e: IOException if ignoreCorruptFiles =>
logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
finished = true
null
}
// Register an on-task-completion callback to close the input stream.
context.addTaskCompletionListener[Unit] { context =>
// Update the bytes read before closing is to make sure lingering bytesRead statistics in
// this thread get correctly added.
updateBytesRead()
closeIfNeeded()
}

spark 执行报错 java.io.EOFException: Premature EOF from inputStream的更多相关文章

  1. 关于spark入门报错 java.io.FileNotFoundException: File file:/home/dummy/spark_log/file1.txt does not exist

    不想看废话的可以直接拉到最底看总结 废话开始: master: master主机存在文件,却报 执行spark-shell语句:  ./spark-shell  --master spark://ma ...

  2. Spark启动报错|java.io.FileNotFoundException: File does not exist: hdfs://hadoop101:9000/directory

    at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:) at org.a ...

  3. hadoop MR 任务 报错 &quot;Error: java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io&quot;

    错误原文分析 文件操作超租期,实际上就是data stream操作过程中文件被删掉了.一般是由于Mapred多个task操作同一个文件.一个task完毕后删掉文件导致. 这个错误跟dfs.datano ...

  4. hbase_异常_03_java.io.EOFException: Premature EOF: no length prefix available

    一.异常现象 更改了hadoop的配置文件:core-site.xml  和   mapred-site.xml  之后,重启hadoop 和 hbase 之后,发现hbase日志中抛出了如下异常: ...

  5. Spark报错java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

    Spark 读取 JSON 文件时运行报错 java.io.IOException: Could not locate executable null\bin\winutils.exe in the ...

  6. 关于SpringMVC项目报错:java.io.FileNotFoundException: Could not open ServletContext resource [/WEB-INF/xxxx.xml]

    关于SpringMVC项目报错:java.io.FileNotFoundException: Could not open ServletContext resource [/WEB-INF/xxxx ...

  7. Kafka 启动报错java.io.IOException: Can't resolve address.

    阿里云上 部署Kafka 启动报错java.io.IOException: Can't resolve address. 本地调试的,报错 需要在本地添加阿里云主机的 host 映射   linux ...

  8. 文件上传报错java.io.FileNotFoundException拒绝访问

    局部代码如下: File tempFile = new File("G:/tempfileDir"+"/"+fileName); if(!tempFile.ex ...

  9. 完美解决JavaIO流报错 java.io.FileNotFoundException: F:\ (系统找不到指定的路径。)

    完美解决JavaIO流报错 java.io.FileNotFoundException: F:\ (系统找不到指定的路径.) 错误原因 读出文件的路径需要有被拷贝的文件名,否则无法解析地址 源代码(用 ...

随机推荐

  1. leetcode解题报告(10):Merge Two Sorted Lists

    描述 Merge two sorted linked lists and return it as a new list. > The new list should be made by sp ...

  2. 【poj2709】Painter--贪心

    Painter Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 5621   Accepted: 3228 Descripti ...

  3. ORM SQLAlchemy 简介

    对象关系映射(Object Relational Mapping,简称ORM使用DB-API访问数据库,需要懂 SQL 语言,能够写 SQL 语句,如果不想懂 SQL,又想使用关系型数据库,可以使用 ...

  4. 跨域方案JSONP与CORS的各自优缺点以及应用场景

    转自 https://www.zhihu.com/question/41992168/answer/217903179 首先明确:JSONP与CORS的使用目的相同,并且都需要服务端和客户端同时支持, ...

  5. IdentityServer4入门一

    这几天学习IdentityServer4,感觉内容有点乱,也可能自己水平有限吧.但为了巩固学习的内容,也打算自己理一下思路. 首先IdentityServer解决什么问题? 下图是我们的一个程序的组织 ...

  6. varnish web cache服务

    varnish介绍 缓存开源解决方案: - varnish - 充分利用epoll机制(能显著提高程序在大量并发连接中只有少量活跃的情况下的系统CPU利用率),并发量大,单连接资源较轻 - squid ...

  7. 黑马vue---46、vue使用过渡类名实现动画

    黑马vue---46.vue使用过渡类名实现动画 一.总结 一句话总结: vue动画的过渡类名的时间点中没有设置样式的话就是默认的样式 使用 transition 元素,把 需要被动画控制的元素,包裹 ...

  8. Linux 时间同步 Chrony

    chrony 是网络时间协议(NTP)的通用实现. chrony 包含两个程序:chronyd 是一个可以在启动时启动的守护程序.chronyc 是一个命令行界面程序,用于监视 chronyd 的性能 ...

  9. jQuery源码解读----part 2

    分离构造器 通过new操作符构建一个对象,一般经过四步: A.创建一个新对象 B.将构造函数的作用域赋给新对象(所以this就指向了这个新对象) C.执行构造函数中的代码 D.返回这个新对象 最后一点 ...

  10. 如何丧心病狂的使用python爬虫读小说

    写在前边 其实一直想入门python很久了,慕课网啊,菜鸟教程啊python的基础的知识被我翻了很多遍了,但是一直没有什么实践.刚好,这两天被别人一直安利一本小说<我可能修的是假仙>,还在 ...