http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

Feature

  • Phase 1: Non-work-preserving RM restart

    As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below.

    The overall concept is that RM will persist the application metadata (i.e. ApplicationSubmissionContext) in a pluggable state-store when client submits an application and also saves the final status of the application such as the completion state (failed, killed, finished) and diagnostics when the application completes. Besides, RM also saves the credentials like security keys, tokens to work in a secure environment. Any time RM shuts down, as long as the required information (i.e.application metadata and the alongside credentials if running in a secure environment) is available in the state-store, when RM restarts, it can pick up the application metadata from the state-store and re-submit the application. RM won’t re-submit the applications if they were already completed (i.e. failed, killed, finished) before RM went down.

    NodeManagers and clients during the down-time of RM will keep polling RM until RM comes up. When RM becomes alive, it will send a re-sync command to all the NodeManagers and ApplicationMasters it was talking to via heartbeats. As of Hadoop 2.4.0 release, the behaviors for NodeManagers and ApplicationMasters to handle this command are: NMs will kill all its managed containers and re-register with RM. From the RM’s perspective, these re-registered NodeManagers are similar to the newly joining NMs. AMs(e.g. MapReduce AM) are expected to shutdown when they receive the re-sync command. After RM restarts and loads all the application metadata, credentials from state-store and populates them into memory, it will create a new attempt (i.e. ApplicationMaster) for each application that was not yet completed and re-kick that application as usual. As described before, the previously running applications’ work is lost in this manner since they are essentially killed by RM via the re-sync command on restart.

  • Phase 2: Work-preserving RM restart

    As of Hadoop 2.6.0, we further enhanced RM restart feature to address the problem to not kill any applications running on YARN cluster if RM restarts.

    Beyond all the groundwork that has been done in Phase 1 to ensure the persistency of application state and reload that state on recovery, Phase 2 primarily focuses on re-constructing the entire running state of YARN cluster, the majority of which is the state of the central scheduler inside RM which keeps track of all containers’ life-cycle, applications’ headroom and resource requests, queues’ resource usage etc. In this way, RM doesn’t need to kill the AM and re-run the application from scratch as it is done in Phase 1. Applications can simply re-sync back with RM and resume from where it were left off.

    RM recovers its runing state by taking advantage of the container statuses sent from all NMs. NM will not kill the containers when it re-syncs with the restarted RM. It continues managing the containers and send the container statuses across to RM when it re-registers. RM reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information. In the meantime, AM needs to re-send the outstanding resource requests to RM because RM may lose the unfulfilled requests when it shuts down. Application writers using AMRMClient library to communicate with RM do not need to worry about the part of AM re-sending resource requests to RM on re-sync, as it is automatically taken care by the library itself.

yarn 单点故障 重启 ResourceManger Restart的更多相关文章

  1. OpenGL ES 3.0: 图元重启(Primitive restart)

    [TOC] 背景概述 在OpenGL绘制图形时,可能需要绘制多个并不相连的图形.这样的情况下这几个图形没法被当做一个图形来处理.也就需要多次调用 DrawArrays 或 DrawElements. ...

  2. Tomcat重启脚本restart.sh停止脚本stop.sh

    Tomcat重启脚本restart.sh停止脚本stop.sh Tomcat本身提供了 startup.sh(启动)shutdown.sh(关闭)脚本,我们在部署中经常会出现死进程形象,无法杀掉进程需 ...

  3. 四:ResourceManger Restart

    概述: RM是yarn中最重要的组件.但是只有一个RM,因此存在单点失败的问题.RM的重启有两种方式: 1.(Non-work-preserving RM restart) 不保留工作状态的重启   ...

  4. jar包重启脚本-restart.sh

    #!/bin/sh PROJECT_PATH=/var/www/ PROJECT_NAME=demo.jar PROJECT_ALL_LOG_NAME=logs/demo-all.log # stop ...

  5. YARN的重启动问题:RM Restart/RM HA/Timeline Server/NM Restart

    ResourceManger Restart ResourceManager负责资源管理和应用的调度,是YARN的核心组件,有可能存在单点失败的问题.ResourceManager Restart是使 ...

  6. Yarn NodeManager restart

    一.介绍默认Yarn NodeManager重启后会断开所有当前正在运行的container的状态,这意味着重启后需要重新启动container进程,该特性的作用就是把NM的状态临时保存到本地,重启后 ...

  7. pm2 重启策略(restart strategies)

    使用 PM2 启动应用程序 时,应用程序会在自动退出.事件循环为空 (node.js) 或应用程序崩溃时自动重新启动. 但您也可以配置额外的重启策略,例如: 使用定时任务重新启动应用程序 文件更改后重 ...

  8. Hadoop官方文档翻译—— YARN ResourceManager High Availability 2.7.3

    ResourceManager High Availability (RM高可用) Introduction(简介) Architecture(架构) RM Failover(RM 故障切换) Rec ...

  9. yarn资源管理器高可用性的实现

    资源管理器高可用性 . The ResourceManager (RM) is responsible for tracking the resources in a cluster, and sch ...

随机推荐

  1. Linux命令——expr

    前言 有时,在处理命令行时(特别是在处理shell脚本时),您可能会发现自己处于必须执行搜索字符串中的子字符串,查找其索引以及其他内容,或者执行比较和算术运算等情况.上述问题expr都能帮我们解决. ...

  2. Python笔记(30)-----logger

    转自: https://www.jb51.net/article/139080.htm logging模块介绍 Python的logging模块提供了通用的日志系统,熟练使用logging模块可以方便 ...

  3. P1462 通往奥格瑞玛的道路[最短路+二分+堆优化]

    题目来源:洛谷 题目背景 在艾泽拉斯大陆上有一位名叫歪嘴哦的神奇术士,他是部落的中坚力量 有一天他醒来后发现自己居然到了联盟的主城暴风城 在被众多联盟的士兵攻击后,他决定逃回自己的家乡奥格瑞玛 题目描 ...

  4. POI进行导出时候发现有不可读取的内容

    通过后台查询数据,然后使用poi进行导出时候,excel进行打开会出现下面的异常: 但是在WPS中就没有问题, 如果点击否,则不会显示任何内容,点击是,就会弹出来 查看修改记录为: 刚开始也进行了很多 ...

  5. springboot的入门

    SpringBoot SpringBoot是SpringMVC的升级版,简化配置,很可能成为下一代的框架 1.新建项目 怎么创建springBoot项目呢? 创建步骤复杂一点点 New Project ...

  6. python学习之正则表达式,StringIO模块,异常处理,搭建测试环境

    python正则表达式 引入一个强大的匹配功能来匹配字符串 import re 正则表达式的表示类型raw string类型(原生字符串类型) r'sa\\/sad/asd'用r转为raw strin ...

  7. Promise原理实现

    首先先看一下 promise 的调用方式: // 实例化 Promise: new MyPromise((resolve, reject) => { setTimeout(() => { ...

  8. flask中使用ajax 处理前端请求,结果展示在同一页面,不点击页面不展示

    在同一页面点击按钮,后端处理后展示在同一页面,不点击隐藏该结果:与上一篇大同小异,需要在 html.flask.js微调 效果展示: (未点击查询) (点击查询) html: <html> ...

  9. 在 windows 上安装 git 2.15

    下载 by win 下载地址:https://git-scm.com/download/win 如下图.选择对应的版本下载: 安装 by win 1.双击下载好的git安装包.弹出提示框.如下图: 2 ...

  10. 密码加密与微服务鉴权JWT详细使用

    [TOC] 1.1.了解微服务状态 微服务集群中的每个服务,对外提供的都是Rest风格的接口,而Rest风格的一个最重要的规范就是:服务的无状态性. 什么是无状态? 1.服务端不保存任何客户端请求者信 ...