Flink Heartbeat of TaskManager和Heartbeat of ResourceManager timed out问题
最近上了个Flink任务,运行一段时间后就自动停止了,很是郁闷,查看最后一个chekpoint时间点,翻看时间日志
2019-12-13 07:25:24.566 flink [flink-akka.actor.default-dispatcher-41] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job PayOrder (88c9cc0c85875332cc5e4ed6418cd667) switched from state RUNNING to FAILING.java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id container_1566481621886_4397244_01_000004 timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1656)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-12-13 07:25:24.519 flink [flink-akka.actor.default-dispatcher-20] INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9b7931812dbed76060b48a696d72a869: The heartbeat of ResourceManager with id 9b7931812dbed76060b48a696d72a869 timed out..
根据Heartbeat of TaskManager with id和The heartbeat of ResourceManager with id在源码中找出这样的代码
    private class TaskManagerHeartbeatListener implements HeartbeatListener<AccumulatorReport, Void> {
        private final JobMasterGateway jobMasterGateway;
        private TaskManagerHeartbeatListener(JobMasterGateway jobMasterGateway) {
            this.jobMasterGateway = Preconditions.checkNotNull(jobMasterGateway);
        }
        @Override
        public void notifyHeartbeatTimeout(ResourceID resourceID) {
            jobMasterGateway.disconnectTaskManager(
                resourceID,
                new TimeoutException("Heartbeat of TaskManager with id " + resourceID + " timed out."));
        }
        @Override
        public void reportPayload(ResourceID resourceID, AccumulatorReport payload) {
            for (AccumulatorSnapshot snapshot : payload.getAccumulatorSnapshots()) {
                schedulerNG.updateAccumulators(snapshot);
            }
        }
        @Override
        public CompletableFuture<Void> retrievePayload(ResourceID resourceID) {
            return CompletableFuture.completedFuture(null);
        }
    }
    private class ResourceManagerHeartbeatListener implements HeartbeatListener<Void, Void> {
        @Override
        public void notifyHeartbeatTimeout(final ResourceID resourceId) {
            runAsync(() -> {
                log.info("The heartbeat of ResourceManager with id {} timed out.", resourceId);
                if (establishedResourceManagerConnection != null && establishedResourceManagerConnection.getResourceManagerResourceID().equals(resourceId)) {
                    reconnectToResourceManager(
                        new JobMasterException(
                            String.format("The heartbeat of ResourceManager with id %s timed out.", resourceId)));
                }
            });
        }
        @Override
        public void reportPayload(ResourceID resourceID, Void payload) {
            // nothing to do since the payload is of type Void
        }
        @Override
        public CompletableFuture<Void> retrievePayload(ResourceID resourceID) {
            return CompletableFuture.completedFuture(null);
        }
    }
然后在这实例化
this.taskManagerHeartbeatManager = heartbeatServices.createHeartbeatManagerSender(resourceId,new TaskManagerHeartbeatListener(selfGateway),rpcService.getScheduledExecutor(),log);
顺着去heartbeatServices瞅瞅了
/**
* HeartbeatServices gives access to all services needed for heartbeating. This includes the
* creation of heartbeat receivers and heartbeat senders.
*/
public class HeartbeatServices { /** Heartbeat interval for the created services. */
protected final long heartbeatInterval; /** Heartbeat timeout for the created services. */
protected final long heartbeatTimeout; public HeartbeatServices(long heartbeatInterval, long heartbeatTimeout) {
Preconditions.checkArgument(0L < heartbeatInterval, "The heartbeat interval must be larger than 0.");
Preconditions.checkArgument(heartbeatInterval <= heartbeatTimeout, "The heartbeat timeout should be larger or equal than the heartbeat interval."); this.heartbeatInterval = heartbeatInterval;
this.heartbeatTimeout = heartbeatTimeout;
} /**
* Creates a heartbeat manager which does not actively send heartbeats.
*
* @param resourceId Resource Id which identifies the owner of the heartbeat manager
* @param heartbeatListener Listener which will be notified upon heartbeat timeouts for registered
* targets
* @param scheduledExecutor Scheduled executor to be used for scheduling heartbeat timeouts
* @param log Logger to be used for the logging
* @param <I> Type of the incoming payload
* @param <O> Type of the outgoing payload
* @return A new HeartbeatManager instance
*/
public <I, O> HeartbeatManager<I, O> createHeartbeatManager(
ResourceID resourceId,
HeartbeatListener<I, O> heartbeatListener,
ScheduledExecutor scheduledExecutor,
Logger log) { return new HeartbeatManagerImpl<>(
heartbeatTimeout,
resourceId,
heartbeatListener,
scheduledExecutor,
scheduledExecutor,
log);
} /**
* Creates a heartbeat manager which actively sends heartbeats to monitoring targets.
*
* @param resourceId Resource Id which identifies the owner of the heartbeat manager
* @param heartbeatListener Listener which will be notified upon heartbeat timeouts for registered
* targets
* @param scheduledExecutor Scheduled executor to be used for scheduling heartbeat timeouts
* @param log Logger to be used for the logging
* @param <I> Type of the incoming payload
* @param <O> Type of the outgoing payload
* @return A new HeartbeatManager instance which actively sends heartbeats
*/
public <I, O> HeartbeatManager<I, O> createHeartbeatManagerSender(
ResourceID resourceId,
HeartbeatListener<I, O> heartbeatListener,
ScheduledExecutor scheduledExecutor,
Logger log) { return new HeartbeatManagerSenderImpl<>(
heartbeatInterval,
heartbeatTimeout,
resourceId,
heartbeatListener,
scheduledExecutor,
scheduledExecutor,
log);
} /**
* Creates an HeartbeatServices instance from a {@link Configuration}.
*
* @param configuration Configuration to be used for the HeartbeatServices creation
* @return An HeartbeatServices instance created from the given configuration
*/
public static HeartbeatServices fromConfiguration(Configuration configuration) {
long heartbeatInterval = configuration.getLong(HeartbeatManagerOptions.HEARTBEAT_INTERVAL); long heartbeatTimeout = configuration.getLong(HeartbeatManagerOptions.HEARTBEAT_TIMEOUT); return new HeartbeatServices(heartbeatInterval, heartbeatTimeout);
}
}
没错超时时间就在HeartbeatManagerOptions.HEARTBEAT_TIMEOUT
/** Timeout for requesting and receiving heartbeat for both sender and receiver sides. */
public static final ConfigOption<Long> HEARTBEAT_TIMEOUT =
key("heartbeat.timeout")
.defaultValue(50000L)
.withDescription("Timeout for requesting and receiving heartbeat for both sender and receiver sides.");
引起心跳超时有可能是yarn压力比较大引起的,先暂时在conf/flink-conf.yaml将这个值调大一点,再观察。
#Timeout for requesting and receiving heartbeat for both sender and receiver sides.
heartbeat.timeout: 180000

Flink Heartbeat of TaskManager和Heartbeat of ResourceManager timed out问题的更多相关文章
- Flink JobManager 和 TaskManager 原理
		转自:https://www.cnblogs.com/nicekk/p/11561836.html 一.概述 Flink 整个系统主要由两个组件组成,分别为 JobManager 和 TaskMana ... 
- Linux-HA实战(1)— Heartbeat安装
		接触Heartbeat主要是因为之前项目中使用了TFS,最近想给nameserver做HA,因为TFS官方用的Heartbeat,所以刚好了解下,参考了网络上很多内容,这里简单记录下. 内容 环境和软 ... 
- 高可用集群heartbeat全攻略
		heartbeat的概念 Linux-HA的全称是High-Availability Linux,它是一个开源项目,这个开源项目的目标是:通过社区开发者的共同努力,提供一个增强linux可靠性(r ... 
- 使用Heartbeat实现双机热备
		使用Heartbeat实现"双机热备"或者称为"双机互备"heartbeat的工作原理:heartbeat最核心的包含两个部分,心跳监測部分和资源接管部分,心跳 ... 
- Heartbeat实现集群高可用热备
		公司最近需要针对服务器实现热可用热备,这几天也一直在琢磨这个方面的东西,今天做了一些Heartbeat方面的工作,在此记录下来,给需要的人以参考. Heartbeat 项目是 Linux-HA 工程的 ... 
- Heartbeat详解
		转自:http://blog.sina.com.cn/s/blog_7b6fc4c901012om0.html 配置主节点的Heartbeat Heartbeat的主要配置文件有ha.cf.hares ... 
- (转)Linux-HA实战(1)— Heartbeat安装
		原文:http://blog.csdn.net/liaomin416100569/article/details/76087448-------centos7源代码编译安装heartbeat 原文:h ... 
- Heartbeat使用梳理
		在日常的集群系统架构中,一般用到Heartbeat的主要就2种:1)高可用(High Availability)HA集群, 使用Heartbeat实现,也称为"双机热备", &qu ... 
- 1 NFS高可用解决方案之DRBD+heartbeat搭建
		preface NFS作为业界常用的共享存储方案,被众多公司采用.我司也不列外,使用NFS作为共享存储,为前端WEB server提供服务,主要存储网页代码以及其他文件. 高可用方案 说道NFS,不得 ... 
- HA(High available)--Heartbeat高可用性集群(双机热备)菜鸟入门级
		HA(High available)--Heartbeat高可用性集群(双机热备) 1.理解:两台服务器A和B ,当A提供服务,B闲置待命,当A服务宕机,会自动切换至B机器继续提供服务.当主机恢复 ... 
随机推荐
- uni-app学习笔记之----不同平台,独立设置
			(不断补充中...) 1.导航栏 2.条件编译 不同的条件标记,会被编译到不同的平台 开头:[#ifdef]或[#ifndef] + 平台名称 结尾:[#endif] html中: js中: css中 ... 
- 《__cplusplus修饰符的作用:C和CPP接口互相调用时候,编译没问题,链接提示未定义问题》
			关于__cplusplus修饰符说明如下: __cplusplus是cpp中的自定义宏,那么定义了这个宏的话表示这是一段cpp的代码,也就是说,上面的代码的含义是:如果这是一段cpp的代码,那么加入e ... 
- centos7.8 安装 redis5.0.2
			1.安装gcc依赖 redis是由C语言开发,因此安装之前必须要确保服务器已经安装了gcc,可以通过如下命令查看机器是否安装: gcc -v 如果没有安装则通过以下命令安装: yum install ... 
- 在idea中查看jar包源码
			文章目录 准备jar包 idea打开文件夹 最后一步 准备jar包 例如,我准备看resin的jar,在桌面准备了一份 idea打开文件夹 在idea中file====>open=====> ... 
- Array方法学习总结
			Array 对象支持在 单个变量名下存储多个元素. Array方法: 在遍历多个元素的方法中,下面的方法在访问索引之前执行in检查,并且不将空槽与undefined合并:concat() 返回一个新数 ... 
- shell—if + case条件语句
			if 条件语句 1. 概述 在shell的各种条件结构和流程控制结构中都要进行各种测试,然后根据测试结果执行不同的操作,有时候也会与 if 等条件语句相结合,来完成测试判断,以减少程序运行错误. 2. ... 
- django修改认证模型类
			1.我在一个子应用下面创建了一个apps目录,且在apps下又创建了一个子应用users,结构如下图: 2.在users的models.py中 from django.db import models ... 
- 建筑CAD软件如何设置当前默认层高?
			在绘制CAD建筑图的过程中,必然少不了要对层高进行设置,如果每层的层高都一样,想要调整建筑CAD软件默认当前层高的话该如何设置?本节建筑CAD教程就和小编一起来了解一下浩辰CAD建筑软件中调整默认当前 ... 
- 论文笔记:To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?
			论文笔记:To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem? Conclusion 如果对象平均大小 ... 
- One-Shot Transfer Learning of Physics-Informed Neural Networks
			本文提出了一种将迁移学习应用到PINN的方法.可以极大的缩短训练PINN所用的时间,目前,PINN所需要的训练次数往往都在成千上万次, 作者通过批量训练PINN,来学习丰富的潜在空间用来执行迁移学习. ... 
