1、执行spark-submit时出错

执行任务如下:

# ./spark-submit --class org.apache.spark.examples.SparkPi  /hadoop/spark/examples/jars/spark-examples_2.11-2.4.0.jar 100  

报错如下:

2019-02-22 09:56:26 INFO  StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/1 is now RUNNING
2019-02-22 09:56:26 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO BlockManagerMasterEndpoint:54 - Registering block manager kvm-test:36768 with 366.3 MB RAM, BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, kvm-test, 36768, None)
2019-02-22 09:56:26 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5aae8eb5{/metrics/json,null,AVAILABLE,@Spark}
2019-02-22 09:56:27 INFO EventLoggingListener:54 - Logging events to hdfs://hadoop-cluster/spark/eventLog/app-20190222015626-0020.snappy
2019-02-22 09:56:27 INFO StandaloneSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2019-02-22 09:56:28 INFO SparkContext:54 - Starting job: reduce at SparkPi.scala:38
2019-02-22 09:56:28 INFO DAGScheduler:54 - Got job 0 (reduce at SparkPi.scala:38) with 100 output partitions
2019-02-22 09:56:28 INFO DAGScheduler:54 - Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
2019-02-22 09:56:28 INFO DAGScheduler:54 - Parents of final stage: List()
2019-02-22 09:56:28 INFO DAGScheduler:54 - Missing parents: List()
2019-02-22 09:56:28 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
2019-02-22 09:56:28 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/1 is now EXITED (Command exited with code 1)
2019-02-22 09:56:28 INFO StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/1 removed: Command exited with code 1
2019-02-22 09:56:28 INFO StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190222015626-0020/2 on worker-20190111083714-172.20.1.1-45882 (172.20.1.1:45882) with 1 core(s)
2019-02-22 09:56:28 INFO StandaloneSchedulerBackend:54 - Granted executor ID app-20190222015626-0020/2 on hostPort 172.20.1.1:45882 with 1 core(s), 512.0 MB RAM
2019-02-22 09:56:28 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/2 is now RUNNING
2019-02-22 09:56:28 INFO BlockManagerMaster:54 - Removal of executor 1 requested
2019-02-22 09:56:28 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 1
2019-02-22 09:56:28 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2019-02-22 09:56:28 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
2019-02-22 09:56:28 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 1236.0 B, free 366.3 MB)
2019-02-22 09:56:28 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on kvm-test:36768 (size: 1236.0 B, free: 366.3 MB)
2019-02-22 09:56:28 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1161
2019-02-22 09:56:28 INFO DAGScheduler:54 - Submitting 100 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
2019-02-22 09:56:28 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 100 tasks
2019-02-22 09:56:29 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/2 is now EXITED (Command exited with code 1)
2019-02-22 09:56:29 INFO StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/2 removed: Command exited with code 1
2019-02-22 09:56:29 INFO StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190222015626-0020/3 on worker-20190111083714-172.20.1.1-45882 (172.20.1.1:45882) with 1 core(s)
2019-02-22 09:56:29 INFO BlockManagerMaster:54 - Removal of executor 2 requested
2019-02-22 09:56:29 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 2
2019-02-22 09:56:29 INFO StandaloneSchedulerBackend:54 - Granted executor ID app-20190222015626-0020/3 on hostPort 172.20.1.1:45882 with 1 core(s), 512.0 MB RAM
2019-02-22 09:56:29 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 2 from BlockManagerMaster.
2019-02-22 09:56:29 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/3 is now RUNNING
2019-02-22 09:56:31 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/3 is now EXITED (Command exited with code 1)
2019-02-22 09:56:31 INFO StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/3 removed: Command exited with code 1
2019-02-22 09:56:31 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 3 from BlockManagerMaster.
2019-02-22 09:56:31 INFO BlockManagerMaster:54 - Removal of executor 3 requested
2019-02-22 09:56:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 3
2019-02-22 09:56:31 INFO StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20190222015626-0020/4 on worker-20190111083714-172.20.1.1-45882 (172.20.1.1:45882) with 1 core(s)
2019-02-22 09:56:31 INFO StandaloneSchedulerBackend:54 - Granted executor ID app-20190222015626-0020/4 on hostPort 172.20.1.1:45882 with 1 core(s), 512.0 MB RAM
2019-02-22 09:56:31 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/4 is now RUNNING
2019-02-22 09:56:33 INFO StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20190222015626-0020/4 is now EXITED (Command exited with code 1)
2019-02-22 09:56:33 INFO StandaloneSchedulerBackend:54 - Executor app-20190222015626-0020/4 removed: Command exited with code 1
2019-02-22 09:56:33 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 4 from BlockManagerMaster.
2019-02-22 09:56:33 INFO BlockManagerMaster:54 - Removal of executor 4 requested  

从报错看出来,,任务一直在请求,但是executor莫名退出了,日志后面还有一个警告,如下:

2019-02-22 09:42:58 WARN  TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

分析:从这个信息可以看出来,,task没有获取到资源

解决:

第一种情况:资源不足(可能是CPU,也可能是内存),这种情况可以调整内存(driver或者executor)或者CPU大小

例如,按如下调整,很多情况都是executor内存设置的过大,超出了实际的内存大小

# ./spark-submit --class org.apache.spark.examples.SparkPi  --executor-memory 512M --total-executor-cores 2 --driver-memory 512M /hadoop/spark/examples/jars/spark-examples_2.11-2.4.0.jar 100   

第二种情况:也是我遇到的。我有一个spark集群+一个spark客户端,我在spark集群里面执行任务可以正常执行,但是放到spark客户端执行的时候就报错了。机器内存,cpu都足够大。导致错误的原因竟然是主机名和ip对应出错了,

由于spark集群是以前搭建的,今天做了一个spark,忘记在spark集群里面添加spark客户端的主机和ip映射了。添加上好了。

总结:

出现这类问题一般有几个可能的原因,逐一检查排除即可:
 
(1).因为提交任务的节点不能和worker节点交互,因为提交完任务后提交任务节点上会起一个进程,展示任务进度,大多端口为4044,工作节点需要反馈进度给该该端口,所以如果主机名或者IP在hosts中配置不正确。所以检查下主机名和ip是否配置正确。

(2).也有可能是内存不足造成的。内存设置可以根据情况调整下。另外,也检查下web UI看看,确保worker节点处于alive状态。 

2、错误日志如下

19/04/08 23:47:19 ERROR ContextCleaner: Error cleaning broadcast 11700946
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:148)
at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:321)
at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)

同时,日志里面还报了java.lang.OutOfMemoryError: Java heap space

分析:从上面日志分析,是由于spark内存不够,导致gc,gc会使得executor与driver通信中断。

解决:(1)、增加硬件资源 ,修改executor内存;

(2)、增大作业并发度;

(3)、修改spark-defaults.conf ,加大executor通信超时时间spark.executor.heartbeatInterval

spark拍错与优化

spark错误记录总结的更多相关文章

  1. uploadify插件Http Error(302)错误记录(MVC)

    由于项目(asp.net MVC)需要做一个附件上传的功能,使用的是jQuery的Uploadify插件的2.1.0版本,上传文件到自己项目指定的文件夹下面.做完之后,在谷歌上测试是正确的,在火狐上报 ...

  2. 开发错误记录8:Unable to instantiate application com

    开发错误记录8:Unable to instantiate application com.android.tools.fd.runtime.BootstrapApplication 这是因为在And ...

  3. PHP 错误与异常 笔记与总结(5)配置文件中与错误日志相关的选项 && 将错误记录到指定的文件中

    [记录错误(生产环境)] php.ini: ① 开启 / 关闭 错误日志功能 log_errors = On ② 设置 log_errors 的最大字节数 log_errors_max_len = 其 ...

  4. 安装nagios出现的两个错误记录

    最近在安装nagios,出现几个错误记录: 一 检查nagios配置的时候出现错误如下: Warning: Duplicate definition found for host 'kelly' (c ...

  5. [置顶] 利用Global.asax的Application_Error实现错误记录,错误日志

    利用Global.asax的Application_Error实现错误记录 错误日志 void Application_Error(object sender, EventArgs e) { // 在 ...

  6. streamsets 错误记录处理

    我们可以在stage 级别,或者piepline 级别进行error 处理配置 pipeline的错误记录处理 discard(丢踢) send response to Origin pipeline ...

  7. php设置错误,错误记录

    //设置错误级别. error_reporting(E_ALL);  //显示所有错误 error_reporting(E_ALL&~E_NOTICE);  //显示所有错误但不显示提示级别的 ...

  8. 27:简单错误记录SimpleErrorLog

    题目描述 开发一个简单错误记录功能小模块,能够记录出错的代码所在的文件名称和行号. 处理: 1. 记录最多8条错误记录,循环记录,对相同的错误记录(净文件名称和行号完全匹配)只记录一条,错误计数增加: ...

  9. WebSphere中数据源连接池太小导致的连接超时错误记录

    WebSphere中数据源连接池太小导致的连接超时错误记录. 应用连接超时错误信息: [// ::: CST] webapp E com.ibm.ws.webcontainer.webapp.WebA ...

随机推荐

  1. 1.MVC基础-初识MVC,与WebForm比较

    1.Net WebForm的开发模式

  2. iOS - Objective-C 关联(objc_setAssociatedObject、objc_getAssociatedObject、objc_removeAssociatedObjects)

    关联是指把两个对象相互关联起来,使得其中的一个对象作为另外一个对象的一部分. 关联特性只有在Mac OS X V10.6以及以后的版本上才是可用的. 在类的定义之外为类增加额外的存储空间 使用关联,我 ...

  3. TypeScript入门一:配置TS工作环境

    配置手动编译TS文件工作环境 配置webpack自动化打包编译工作环境(后面补充) 一.TypeScript入门学习引言 进入主题之前,首先说明这个系列的博客是我刚刚接触TypeScript的学习笔记 ...

  4. 自定义过滤器和标签 & 静态文件相关

    自定义过滤器和标签 1.在settings中的INSTALLED_APPS配置当前app,不然django无法找到自定义的simple_tag. 2.在app中创建templatetags模块(模块名 ...

  5. vue锚点

    第一种: router.js中添加 mode: 'history', srcollBehavior(to,from,savedPosition){ if(to.hash){ return { sele ...

  6. 有用的vscode快捷键大全+自定义快捷键

    VS Code是前端的一个比较好用的代码编辑器,但是我们不能老是局限于鼠标操作呀,有时候很不方便,所以呢,快捷键大全来啦,有的可能会和你们电脑自带的快捷键冲突呢,这时候,你自己设置一下就好了呀 一.v ...

  7. S3C2440 gpio

    WATCHDOG TIMER 原理图 手册 举例 start.S .globl _start _start: /* 关看门狗 */ /* 往WTCON(0x53000000)写0 */ ldr r0, ...

  8. 在SAP除了使用Cordova生产移动应用外,还有这种方式

    本文和Jerry过去的文章不太一样,算不上Jerry的知识分享,只是记录一下Jerry用React-Native把应用安装到Android手机上遇到的一些问题,方便以后查看. Jerry的同事Leo用 ...

  9. linux 的GUNB修复问题

    1.意外断电,kali linux 虚拟机没有正常关机的时候 , 突然断电之后重启电脑之后,kali linux 直接黑屏了无法进入系统.如下面的界面 光标一直在闪烁. 这里可以使用 快捷键 同时按住 ...

  10. [dev][ipsec][esp] ipsec链路中断的感知问题

    ipsec如何感知到链路中断了?以下内容讲的是在没有配置DPD,且没有rekey的场 景下. 1. ESP认为,以下两个场景交由应用层来感知,应用层会发现ipsec的连接坏掉了. a,ESP承载的连接 ...