资源管理器高可用性

. The ResourceManager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs). Prior to Hadoop 2.4, the ResourceManager is the single point of failure in a YARN cluster. The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this otherwise single point of failure.

RM负责跟踪集群中的资源，然后调度类似于MR这样具体的应用程序。在Hadoop2.4版本以前，RM在YARN集群中的一个可能造成集群故障的单点。通过以主备RM的方式增加冗余，高可用性功能规避了单点问题导致的集群不可用。

ResourceManager HA is realized through an Active/Standby architecture - at any point of time, one of the RMs is Active, and one or more RMs are in Standby mode waiting to take over should anything happen to the Active. The trigger to transition-to-active comes from either the admin (through CLI) or through the integrated failover-controller when automatic-failover is enabled.

RM HA功能是通过主从备份架构实现的：在任何时候，多个RM中的一个作为主RM提供服务，另有一个或者多个RM处于待命状态，当有主RM出事了以后，待命的RM能够进行接管。如果要触发切换到主RM事务，可以由管理员从命令行的输入，也可在自动failover功能开关打开以后，通过集成failover控制器触发。

Manual transitions and failover手工触发故障切换

When automatic failover is not enabled, admins have to manually transition one of the RMs to Active. To failover from one RM to the other, they are expected to first transition the Active-RM to Standby and transition a Standby-RM to Active. All this can be done using the “yarn rmadmin” CLI.

当自动failover功能未打开时候，管理员必须手工设置多个RM中的一个到主服务状态。为了实现从一个RM 到另外一个的failover切换，需要首先把主RM设置从active状态切换到standby状态，然后把一个standby的切换到active。这些操作可以通过yarn rmadmin 命令行进行。

Automatic failover自动故障切换

The RMs have an option to embed the Zookeeper-based ActiveStandbyElector to decide which RM should be the Active. When the Active goes down or becomes unresponsive, another RM is automatically elected to be the Active which then takes over. Note that, there is no need to run a separate ZKFC daemon as is the case for HDFS because ActiveStandbyElector embedded in RMs acts as a failure detector and a leader elector instead of a separate ZKFC deamon.

RM有个选项去嵌入一个基于Zookeeper的主备选举器，它能够决定哪个RM应该是active的。当主RM挂掉或者无法响应，另外一个RM会自动的被选举为主RM，随后去接管。注意，没有必要去启动一个独立的ZKFC守护进程，因为对HDFS来说，嵌入在RM里面的主从选举器能够作为一个故障检测模块和一个领袖选举器工作，而非一个独立的ZKFC守护进程。

Client, ApplicationMaster and NodeManager on RM failover客户端、应用主节点，节点管理器在资源管理器上的故障切换

When there are multiple RMs, the configuration (yarn-site.xml) used by clients and nodes is expected to list all the RMs. Clients, ApplicationMasters (AMs) and NodeManagers (NMs) try connecting to the RMs in a round-robin fashion until they hit the Active RM. If the Active goes down, they resume the round-robin polling until they hit the “new” Active. This default retry logic is implemented as org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider. You can override the logic by implementing org.apache.hadoop.yarn.client.RMFailoverProxyProvider and setting the value of yarn.client.failover-proxy-provider to the class name.

当有多个资源管理器的时候，被节点和客户端所使用的配置(yarn-site.xml)需要列举出全部资源管理器。客户端、应用主节点们和节点管理器们尝试以轮询方式连接资源管理器们，一直到访问的主资源管理器。如果主资源管理器挂掉，他们继续执行循环查询一直找到新的主节点。默认的重试逻辑是在org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider中实现的。可通过实现org.apache.hadoop.yarn.client.RMFailoverProxyProvider类来重写重试逻辑，然后把类名替换到yarn.client.failover-proxy-provider的值中。

Recovering prevous active-RM’s state修复前一个主资源管理器的状态

With the ResourceManger Restart enabled, the RM being promoted to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work. The state-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore. The ZKRMStateStore implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster. When using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role. When using the ZKRMStateStore, it is advisable to NOT set the “zookeeper.DigestAuthenticationProvider.superDigest” property on the Zookeeper cluster to ensure that the zookeeper admin does not have access to YARN application/user credential information.

在资源管理器重启功能打开情况下，被设置为激活状态的资源管理器，尽最大可能的从前一个激活的资源管理器停止的地方加载其内部状态并恢复操作。资源管理器会尝试把之前提交到资源管理器的中的每个被管理的应用都重新提交。应用程序通过定期设置检查点规避丢失掉任务。不管是对激活的还是备用的资源管理器，状态储存对他们都必须是可见的。当前，有两种实现了持久化存储的资源管理器状态存储：FileSystemRMStateStore 和 ZKRMStateStore。 ZKRMStateStore允许即时向单个的资源管理器更新状态，所以也是在高可用集群中的推荐的一种存储办法。当使用ZKRMStateStore的时候，没有必要设置单独的防御机制，去处理可能出现的多个资源管理器潜在的把自己设置为激活状态的脑裂状态。当使用ZKRMStateStore的时候，建议在Zookeeper集群中不设置zookeeper.DigestAuthenticationProvider.superDigest这个配置，确保Zookeeper管理员不会获取到YARN用户和应用程序的机密信息。

原文见：https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html

ResourceManager Restart:https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

yarn资源管理器高可用性的实现的更多相关文章

Hadoop 三剑客之 —— 集群资源管理器 YARN
一.hadoop yarn 简介二.YARN架构 1. ResourceManager 2. NodeManager 3. ApplicationMa ...
Hadoop 学习之路（二）—— 集群资源管理器 YARN
一.hadoop yarn 简介 Apache YARN (Yet Another Resource Negotiator) 是hadoop 2.0 引入的集群资源管理系统.用户可以将各种服务框架部署 ...
Hadoop 系列（二）—— 集群资源管理器 YARN
一.hadoop yarn 简介 Apache YARN (Yet Another Resource Negotiator) 是 hadoop 2.0 引入的集群资源管理系统.用户可以将各种服务框架部 ...
Hadoop分布式资源管理器Yarn、MR运行机制剖析
介绍YARN组件的功能及应用场景 1.ResourceManager(RM) RM是一个全局的资源管理器,集群中只有一个.它负责整个Hadoop系统的资源管理和分配,包括处理客户端请求.启动监控 Ap ...
360安全卫士造成Sharepoint文档库”使用资源管理器打开“异常
备注:企业用户还是少用360为妙有客户反馈:部门里的XP SP2环境客户机全部异常,使用资源管理器打开Sharepoint文档库,看到的界面样式很老土,跟本地文件夹不一样 ...
Windows 7 在资源管理器中显示软件快捷方式
该方法是利用资源管理器中储存网络位置的文件夹实现的, 不需要修改注册表. 效果如图: 操作方法: 在资源管理器中打开路径 "%appdata%\Microsoft\Windows\Netwo ...
修复 Windows7 资源管理器左侧收藏夹无法展开问题
相信大家在网上搜多到的解决办法大多数都是修改注册表,但是这个办法多数是无效的 1.运行regedit 2.展开到HKEY_CLASSES_ROOT\lnkfile 3.添加一个字符串值:IsShort ...
[No00009C]Visual Studio在解决方案资源管理器里同步定位打开的文件
标题的意思就是在使用VS的时候,需要我们打开编辑的文件跟解决方案的资源管理器同步显示,这样方便定位到我们在修改哪个文件. 设置如下: 工具——选项——项目和解决方案——在解决方案资源管理器中跟踪活动项 ...
怎样在Windows资源管理器中添加右键菜单以及修改右键菜单顺序
有时,我们需要在Windows资源管理器的右键菜单中添加一些项,以方便使用某些功能或程序. 比如我的电脑上有一个免安装版的Notepad++,我想在所有文件的右键菜单中添加一项用Notepad++打开 ...

随机推荐

FZOJ2110 star(DFS)
Overpower often go to the playground with classmates. They play and chat on the playground. One day, ...
Dragons
http://codeforces.com/problemset/problem/230/A Dragons time limit per test 2 seconds memory limit pe ...
在IT在系统中使用多租户技术的跨部门和虚拟团队的解决方案为员工提供（草案）
1 前言经过多年的企业信息化建设,Office系统逐步形成有9营业场所的分部门.9专业应用子系统.20独立的信息模块.330一种方法.这些系统或模块内置于Microsoft IIS.Apache T ...
careercup-递归和动态规划 9.9
9.9 设计一种算法,打印八皇后在8*8棋盘上的各种摆法,其中每个皇后都不同行.不同列,也不在对角线上.这里的“对角线”指的是所有的对角线,不只是平分整个棋盘的那两条对角线. 类似leetcode:N ...
标准I/O的替代软件
标准I/O并不完善. 标准I/O库的一个不足之处是效率不高,这与它需要复制的数据量有关.当使用每次一行函数fgets和fputs时,通常需要复制两次数据:一次是在内核和标准I/O缓冲之间(当调用rea ...
android 模拟器定在了任务栏出不来了
系统任务栏上显示了正在运行的模拟器,但是点击它,始终看不到模拟器显示出来.用Alt + Tab 切换也不行按照网上的说法 1.重新建一个模拟器,名字变一下 2.找到模拟器对应的配置文件,路径不管, ...
WPF非轮询方式更新数据库变化SqlDependency（数据库修改前台自动更新）
上一章节我们讲到wpf的柱状图组件,它包含了非轮询方式更新数据库变化SqlDependency的内容,但是没有详细解释,现在给大家一个比较简单的例子来说明这部分内容. 上一章节: WPF柱状图(支持数 ...
WPF柱状图（支持数据库动态更新）
之前我们讲到wpf组件基类以及组件开发,现在我们围绕之前的内容去开发一个组件. 效果图请加群查看,在群共享里面. 做出这个呢是比较繁琐的. 首先要使用我们的基类继承基类的模板自动生成如下几个文件 ...
Default route and zero route
A default route of a computer that is participating in computer networking is the packet forwarding ...
Linux--------------安装vsftpd
1.安装vsftpd yum install -y vsftpd yum -y install ftp vsftpd 2.安装vsftpd虚拟用户配置依赖包 ...

yarn资源管理器高可用性的实现