错误分析

堆栈信息中有一个错误信息:Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 264, idc-xx-xx-3-30.d.xx.com, executor 2): java.lang.OutOfMemoryError: Java heap space

根据提示信息可以得到以下几点

  • stage由一堆task组成,也就是taskset,编号1的task在stage2中失败了4次
  • executor 是实际执行task的节点,编号2的executor发生了Java heap space
  • executor 内存配置的是512M,没有配置 spark.executor.memoryOverhead,spark在计算executor最终需要分配多少内存时有以下机制
  1. 未配置spark.executor.memoryOverhead来直接控制off-heap时(堆外内存,将对象序列化后放在一大块gc不会直接管理的内存中,需要的时候再反序列化使用,就像放到磁盘上一样,此处堆外内存包含了方法区,直接内存,虚拟机栈,本地方法栈)

    realMem = executorMemory[heap] + (executorMemory * 0.10, with minimum of 384)[off-heap]

    2)配置spark.executor.memoryOverhead

    realMem = executorMemory[heap] + memoryOverhead[off-heap]

readMem表示java进程需要申请的总内存,如果超过container的内存容量,会被直接kill掉

异常种类

  • OutOfMemoryError: Java heap space,堆内存不足,溢出,需调整--executor-memory
  • OutOfMemoryError: Java permgen space,堆外内存不足,溢出,需调整spark.executor.memoryOverhead

下述异常属于Java heap space,调整--executor-memory

RDD的位置,根据MemoryMode可以选择是堆内或堆外

日志中查看到的异常信息

: org.apache.spark.SparkException: Job aborted.

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)

at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

at org.apache.spark.sql.Dataset.(Dataset.scala:185)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:280)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:214)

at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 264, idc-xx-xx-3-30.d.xx.com, executor 2): java.lang.OutOfMemoryError: Java heap space

at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)

at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)

at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)

at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)

at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source)

at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)

at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)

at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:132)

at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)

at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1005)

at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)

at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)

at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)

at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)

at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

导致异常的代码

/**
* @param f file to read the chunks from
* @return the chunks
* @throws IOException
*/
public List<Chunk> readAll(FSDataInputStream f) throws IOException {
List<Chunk> result = new ArrayList<Chunk>(chunks.size());
f.seek(offset);
byte[] chunksBytes = new byte[length]; //778行,分配长为length的byte[]时没有足够的可用内存导致heap space
f.readFully(chunksBytes);
// report in a counter the data we just scanned
BenchmarkCounter.incrementBytesRead(length);
int currentChunkOffset = 0;
for (int i = 0; i < chunks.size(); i++) {
ChunkDescriptor descriptor = chunks.get(i);
if (i < chunks.size() - 1) {
result.add(new Chunk(descriptor, chunksBytes, currentChunkOffset));
} else {
// because of a bug, the last chunk might be larger than descriptor.size
result.add(new WorkaroundChunk(descriptor, chunksBytes, currentChunkOffset, f));
}
currentChunkOffset += descriptor.size;
}
return result ;
}

Spark执行失败时的一个错误分析的更多相关文章

  1. Hadoop概念学习系列之为什么hadoop/spark执行作业时,输出路径必须要不存在?(三十九)

    很多人只会,但没深入体会和想为什么要这样? 拿Hadoop来说,当然,spark也一样的道理. 输出路径由Hadoop自己创建,实际的结果文件遵守part-nnnn的约定. 如何指定一个已有目录作为H ...

  2. 程序中使用事务来管理sql语句的执行,执行失败时,可以达到回滚的要求。

    1.设置使用事务的SQL执行语句 /// <summary> /// 使用有事务的SQL语句 /// </summary> /// <param name="s ...

  3. svn cleanup 执行失败时,可以勾选 break locks,

  4. 【转】mysql触发器的实战(触发器执行失败,sql会回滚吗)

    1   引言Mysql的触发器和存储过程一样,都是嵌入到mysql的一段程序.触发器是mysql5新增的功能,目前线上凤巢系统.北斗系统以及哥伦布系统使用的数据库均是mysql5.0.45版本,很多程 ...

  5. ThreadPoolExecutor线程池任务执行失败的时候会怎样

    接上一篇 <JDK1.8中的线程池> 1.  任务执行失败时的处理逻辑 1.1.  Worker Worker相当于线程池中的线程 可以看到,Worker有几个重要的属性: thread ...

  6. pybot执行多条用例时,某一个用例执行失败,停止所有用例的执行

    问题: pybot执行多条用例时,某一个用例执行失败,停止所有用例的执行 解决办法: pybot -exitonfailure E:\robot\呼送项目\测试用例\基本流程\主流程.txt 参考文章 ...

  7. vs2015 生成项目时,提示执行失败,参数错误

    今天vs2015 生成项目时,提示执行失败,参数错误.查了很多资料未解决 后来,发现只有一个项目出现这个问题,其他项目生成正常.怀疑是该项目解决方案的问题 于是将解决项目中的项目移除,逐一生成引用,解 ...

  8. spark 在yarn执行job时一直抱0.0.0.0:8030错误

    近日新写完的spark任务放到yarn上面执行时,在yarn的slave节点中一直看到报错日志:连接不到0.0.0.0:8030 . The logs are as below: 2014-08-11 ...

  9. 大数据学习day23-----spark06--------1. Spark执行流程(知识补充:RDD的依赖关系)2. Repartition和coalesce算子的区别 3.触发多次actions时,速度不一样 4. RDD的深入理解(错误例子,RDD数据是如何获取的)5 购物的相关计算

    1. Spark执行流程 知识补充:RDD的依赖关系 RDD的依赖关系分为两类:窄依赖(Narrow Dependency)和宽依赖(Shuffle Dependency) (1)窄依赖 窄依赖指的是 ...

随机推荐

  1. Kubernetes K8S之Ingress详解与示例

    K8S之Ingress概述与说明,并详解Ingress常用示例 主机配置规划 服务器名称(hostname) 系统版本 配置 内网IP 外网IP(模拟) k8s-master CentOS7.7 2C ...

  2. JSTL1.1函数标签库(functions)

    JSTL1.1函数标签库(functions) 在jstl中的fn标签也是我们在网页设计中经常要用到的很关键的标签,在使用的时候要先加上头 <%@ taglib uri="http:/ ...

  3. eureka集群的搭建

    本次将会创建三个注册中心和一个客户端进行集群,架构图如下: 修改本机hosts文件,创建三个域名: 代码结构如图: 由于三个注册中心结构都是一样的,区别在于配置文件: #注册中心(eureka-ser ...

  4. Servlet3.x部署描述符

    简介 web.xml即部署描述符,位于WEB-INF目录下.在Servlet3以上版本有提供了注解的方式部署Servlet,因此web.xml是可选的.web.xml大概框架如下: <?xml ...

  5. MySQL触发器初试:当A表插入新记录,自动在B表中插入相同ID的记录

    今天第一次用MySQL的触发器,怕忘了,赶紧写篇博客记录一下. 废话不说,先上语法: 1 CREATE TRIGGER trigger_name 2 { BEFORE | AFTER } { INSE ...

  6. ECharts系列:玩转ECharts之常用图(折线、柱状、饼状、散点、关系、树)

    一.背景 最近产品叫我做一些集团系列的统计图,包括集团组织.协作.销售.采购等方面的.作为一名后端程序员,于是趁此机会来研究研究这个库. 如果你仅仅停留在用的层面,那还是蛮简单的. 二.介绍 ECha ...

  7. 无所不能的embedding 3. word2vec->Doc2vec[PV-DM/PV-DBOW]

    这一节我们来聊聊不定长的文本向量,这里我们暂不考虑有监督模型,也就是任务相关的句子表征,只看通用文本向量,根据文本长短有叫sentence2vec, paragraph2vec也有叫doc2vec的. ...

  8. 实验 4:Open vSwitch 实验——Mininet 中使用 OVS 命令

    一.安装目的 Mininet 安装之后,会连带安装 Open vSwitch,可以直接通过 Python 脚本调用Open vSwitch 命令,从而直接控制 Open vSwitch,通过实验了解调 ...

  9. DoModal 函数的用法

    转载:https://blog.csdn.net/mpp_king/article/details/79707728                        https://www.cnblog ...

  10. matlab find函数使用语法

    find 找到非零元素的索引和值 语法: 1. ind = find(X) 2. ind = find(X, k) 3. ind = find(X, k, 'first') 4. ind = find ...