http://www.socc2013.org/home/program

http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/

 

Hadoop V1.0的问题

Hadoop被发明的时候是用于index海量的web crawls, 所以它很适应那个场景, 但是现在Hadoop被当作一种通用的计算平台, 这个已经超出当初它被设计时的目标和scope.
所以Hadoop作为通用的计算平台有两个主要的缺点, 计算模型和资源管理紧耦合, 无法使用除map/reduce以外的计算模型, 中心化的job控制管理, 带来很大的扩展性问题

1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model

2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler

YARN就是用来decouple计算模型和资源管理, MapReduce is just one of the applications running on top of YARN
想换其他的计算模型也很容易, 比如Dryad, Giraph, Hoya, REEF, Spark, Storm and Tez

更具体的问题描述,

  1. JobTracker 是 Map-reduce 的集中处理点,存在单点故障。
  2. JobTracker 完成了太多的任务,造成了过多的资源消耗,当 map-reduce job 非常多的时候,会造成很大的内存开销,潜在来说,也增加了 JobTracker fail 的风险,这也是业界普遍总结出老 Hadoop 的 Map-Reduce 只能支持 4000 节点主机的上限。
  3. 在 TaskTracker 端,以 map/reduce task 的数目作为资源的表示过于简单,没有考虑到 cpu/ 内存的占用情况,如果两个大内存消耗的 task 被调度到了一块,很容易出现 OOM。
  4. 在 TaskTracker 端,把资源强制划分为 map task slot 和 reduce task slot, 如果当系统中只有 map task 或者只有 reduce task 的时候,会造成资源的浪费,也就是前面提过的集群资源利用的问题。
  5. 源代码层面分析的时候,会发现代码非常的难读,常常因为一个 class 做了太多的事情,代码量达 3000 多行,造成 class 的任务不清晰,增加 bug 修复和版本维护的难度。
  6. 从操作的角度来看,现在的 Hadoop MapReduce 框架在有任何重要的或者不重要的变化 ( 例如 bug 修复,性能提升和特性化 ) 时,都会强制进行系统级别的升级更新。更糟的是,它不管用户的喜好,强制让分布式集群系统的每一个用户端同时更新。这些更新会让用户为了验证他们之前的应用程序是不是适用新的 Hadoop 版本而浪费大量时间。

 

Yarn的3个主要角色

ResourceManager(RM)

ResourceManager 支持分层级的应用队列,这些队列享有集群一定比例的资源。从某种意义上讲它就是一个纯粹的调度器,它在执行过程中不对应用进行监控和状态跟踪。同样,它也不能重启因应用失败或者硬件错误而运行失败的任务。ResourceManager 是基于应用程序对资源的需求进行调度的; 每一个应用程序需要不同类型的资源因此就需要不同的容器。资源包括:内存,CPU,磁盘,网络等等。可以看出,这同现 Mapreduce 固定类型的资源使用模型有显著区别,它给集群的使用带来负面的影响。资源管理器提供一个调度策略的插件,它负责将集群资源分配给多个队列和应用程序。调度插件可以基于现有的能力调度和公平调度模型。

The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources among various competing applications in the cluster.
Given this central and global view of the cluster resources, it can enforce rich, familiar properties such as fairness, capacity, and locality across tenants.

1. Jobs are submitted to the RM via a public submission protocol and go through an admission control phase during which security credentials are validated and various
operational and administrative checks are performed.

2. Once the scheduler has enough resources, the application is moved from accepted to running state. Aside from internal bookkeeping, this involves allocating a container for the AM and spawning it on a node in the cluster.

3. A record of accepted applications is written to persistent storage and recovered in case of RM restart or failure.

 

NodeManager (NM)

NodeManager 是每一台机器框架的代理,是执行应用程序的容器,监控应用程序的资源使用情况 (CPU,内存,硬盘,网络 ) 并且向调度器汇报(通过heartbeat)

NMs are responsible for monitoring resource availability, reporting faults, and container lifecycle management (e.g., starting, killing).
Communications between RM and NMs are heartbeatbased for scalability.

 

ApplicationMaster (AM)

每一个应用的 ApplicationMaster 的职责, 向调度器索要适当的资源容器,运行任务,跟踪应用程序的状态和监控它们的进程,处理任务的失败原因
Yarn的AM也是一种特殊的container, 不是全局的, 是每个job都会创建一个AM, 专职负责该job的整个生命周期, AM通常会基于现有的high level编程框架来实现, 比如M/R

The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution (e.g., running reducers against the output of maps), handling faults and computation skew, and performing other local optimizations.

AM can run arbitrary user code, and can be written in any programming language since all communication with the RM and NM is encoded using extensible communication protocols(ex, protobuf).
Although in practice we expect most jobs will use a higher level programming framework (e.g., MapReduce, Dryad, Tez, REEF, etc.).

 

 

 

Resource Manager (RM)

RM做什么?
ResourceManager should only handle live resource scheduling, and helps central components in YARN scale beyond the Hadoop 1.0 JobTracker.

RM不做什么?

ResourceManager is not responsible for:
1. coordinating application execution or task fault-tolerance
2. providing status or metrics for running applications (now part of the ApplicationMaster)
3. serving framework specific reports of completed jobs (now delegated to a per-framework daemon)

RM和AM的通信

RM需要和client, AM, NM进行通信, 所以需要提供相应的接口, 当然其中RM和AM的通信最为重要, 这里重点讨论一下.

The ResourceManager exposes two public interfaces towards:
1) clients submitting applications,  
2) ApplicationMaster(s) dynamically negotiating access to resources,
3) one internal interface towards NodeManagers for cluster monitoring and resource access management.

 

AM和RM之间的通信, 通过ResourceRequests, 即AM通过RR来告诉RM, 我需要什么样的资源?
现在支持底下几种request属性, 当然这个是可以不断扩展的
现在也支持, RM通过RR来从AM那里收回resources, 比如当资源紧张, RM需要考虑重新分配之前的资源

ApplicationMasters codify their need for resources in terms of one or more ResourceRequests, each of which tracks:
1. number of containers (e.g., 200 containers),
2. resources8 per container h2GB RAM, 1 CPUi,
3. locality preferences, and
4. priority of requests within the application

 

Application Master (AM)

The ApplicationMaster is the process that coordinates the application’s execution in the cluster, but it itself is run in the cluster just like any other container.
A component of the RM negotiates for the container to spawn this bootstrap process.

1. The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand. AM encodes its preferences and constraints in a heartbeat message to the RM.

2. In response to subsequent heartbeats, the AM will receive a container lease on bundles of resources bound to a particular node in the cluster.

3. Based on the containers it receives from the RM, the AM may update its execution plan to accommodate perceived abundance or scarcity.

Since the RM does not interpret the container status, the AM determines the semantics of the success or failure of the container exit status reported by NMs through the RM.

Since the AM is itself a container running in a cluster of unreliable hardware, it should be resilient to failure.

 

Node Manager (NM)

The NodeManager is the “worker” daemon in YARN.
It authenticates container leases, manages containers’ dependencies, monitors their execution, and provides a set of services to containers.

All containers in YARN– including AMs– are described by a container launch context (CLC)

This record includes a map of environment variables, dependencies stored in remotely accessible storage, security tokens, payloads for NM services, and the command necessary to create the process.

After validating the authenticity of the lease, the NM configures the environment for the container, including initializing its monitoring subsystem with the resource constraints specified in the lease.

To launch the container, the NM copies all the necessary dependencies– data files, executables, tarballs– to local storage.

The NM eventually garbage collects dependencies not in use by running containers.

 

Container Killing

The NM will also kill containers as directed by the RM or the AM.

a. Containers may be killed when the RM reports its owning application as completed, when the scheduler decides to evict it for another tenant
b. when the NM detects that the container exceeded the limits of its lease.
c. AMs may request containers to be killed when the corresponding work isn’t needed any more.

Whenever a container exits, the NM will clean up its working directory in local storage. When an application completes, all resources owned by its containers are discarded on all nodes, including any of its processes still running in the cluster.

 

NM Local Monitoring

NM also periodically monitors the health of the physical node. It monitors any issues with the local disks, and runs an admin configured script frequently that in turn can point to any hardware/software issues. When such an issue is discovered, NM changes its state to be unhealthy and reports RM about the same which then makes a scheduler specific decision of killing the containers and/or stopping future allocations on this node till the health issue is addressed.

 

YARN framework/application writers

From the preceding description of the core architecture, we extract the responsibilities of a YARN application author:

1. Submitting the application by passing a CLC for the ApplicationMaster to the RM.
2. When RM starts the AM, it should register with the RM and periodically advertise its liveness and requirements over the heartbeat protocol
3. Once the RM allocates a container, AM can construct a CLC to launch the container on the corresponding NM. It may also monitor the status of the running container and stop it when the resource should be reclaimed. Monitoring the progress of work done inside the container is strictly the AM’s responsibility.
4. Once the AM is done with its work, it should unregister from the RM and exit cleanly.
5. Optionally, framework authors may add control flow between their own clients to report job status and expose a control plane.

 

Fault tolerance and availability

RM Failover, 仍然有单点问题

At the time of this writing, the RM remains a single point of failure in YARN’s architecture.

The RM recovers from its own failures by restoring its state from a persistent store on initialization.
Once the recovery process is complete, it kills all the containers running in the cluster, including live ApplicationMasters. It then launches new instances of each AM.

NM Failover

When a NM fails, the RM detects it by timing out its heartbeat response, marks all the containers running on that node as killed, and reports the failure to all running AMs.
If the fault is transient, the NM will re-synchronize with the RM, clean up its local state, and continue.
In both cases, AMs are responsible for reacting to node failures, potentially redoing work done by any containers running on that node during the fault.

AM Failover

Since the AM runs in the cluster, its failure does not affect the availability of the cluster, but the probability of an application hiccup due to AM failure is higher than in Hadoop 1.x.
The RM may restart the AM if it fails, though the platform offers no support to restore the AMs state.

Container Fail

The failure handling of the containers themselves is completely left to the frameworks.
The RM collects all container exit events from the NMs and propagates those to the corresponding AMs in a heartbeat response.

 

Mesos VS. YARN

While Mesos and YARN both have schedulers at two levels, there are two very significant differences.

First, Mesos is an offer-based resource manager, whereas YARN has a request-based approach.
YARN allows the AM to ask for resources based on various criteria including locations, allows the requester to modify future requests based on what was given and on current usage.
Our approach was necessary to support the location based allocation.

Second, instead of a per-job intraframework scheduler, Mesos leverages a pool of central schedulers (e.g., classic Hadoop or MPI).
YARN enables late binding of containers to tasks, where each individual job can perform local optimizations, and seems more amenable to rolling upgrades (since each job can run on
a different version of the framework). On the other side, per-job ApplicationMaster might result in greater overhead than the Mesos approach.

两点不同,
Mesos是offer-based, master会收集slave上可用的resouces offer, 并通知各个计算framework, 由各个计算framework的schduler来判断是否可以申请当前的offer 
Yarn是request-based, AM不知道你是否有足够的资源, 只是向RM发出resources request(包含各种选择标准)

Mesos是central schedulers, 即对每个计算framework使用一个schduler
Yarn是为每个job分配一个AM, 这样便于进行job级别的local optimizations

 

Hadoop V1.0 VS YARN

首先客户端不变,其调用 API 及接口大部分保持兼容,这也是为了对开发使用者透明化,使其不必对原有代码做大的改变

Yarn 框架相对于老的 MapReduce 框架什么优势呢?我们可以看到:

  1. 这个设计大大减小了 JobTracker(也就是现在的 ResourceManager)的资源消耗,并且让监测每一个 Job 子任务 (tasks) 状态的程序分布式化了,更安全、更优美。
  2. 在新的 Yarn 中,ApplicationMaster 是一个可变更的部分,用户可以对不同的编程模型写自己的 AppMst,让更多类型的编程模型能够跑在 Hadoop 集群中,可以参考 hadoop Yarn 官方配置模板中的 mapred-site.xml 配置。
  3. 对于资源的表示以内存为单位 ( 在目前版本的 Yarn 中,没有考虑 cpu 的占用 ),比之前以剩余 slot 数目更合理。
  4. 老的框架中,JobTracker 一个很大的负担就是监控 job 下的 tasks 的运行状况,现在,这个部分就扔给 ApplicationMaster 做了,而 ResourceManager 中有一个模块叫做 ApplicationsMasters( 注意不是 ApplicationMaster),它是监测 ApplicationMaster 的运行状况,如果出问题,会将其在其他机器上重启。
  5. Container 是 Yarn 为了将来作资源隔离而提出的一个框架。这一点应该借鉴了 Mesos 的工作,目前是一个框架,仅仅提供 java 虚拟机内存的隔离 ,hadoop 团队的设计思路应该后续能支持更多的资源调度和控制 , 既然资源表示成内存量,那就没有了之前的 map slot/reduce slot 分开造成集群资源闲置的尴尬情况。

YARN - Yet Another Resource Negotiator的更多相关文章

  1. Hadoop 2.0 中的资源管理框架 - YARN(Yet Another Resource Negotiator)

    1. Hadoop 2.0 中的资源管理 http://dongxicheng.org/mapreduce-nextgen/hadoop-1-and-2-resource-manage/ Hadoop ...

  2. spark 笔记 4:Apache Hadoop YARN: Yet Another Resource Negotiator

    spark支持YARN做资源调度器,所以YARN的原理还是应该知道的:http://www.socc2013.org/home/program/a5-vavilapalli.pdf    但总体来说, ...

  3. [Note] Yet Another Resource Negotiator

    Yet Another Resource Negotiator Apache Hadoop YARN 是新一代资源管理调度框架,主要针对 Hadoop MapReduce 1.0 的缺陷做出了改进 M ...

  4. Hadoop 5、HDFS HA 和 YARN

    Hadoop 2.0 产生的背景Hadoop 1.0 中HDFS和MapReduce存在高可用和扩展方面的问题 HDFS存在的问题 NameNode单点故障,难以用于在线场景 NameNode压力过大 ...

  5. YARN学习笔记(一)——YARN的简介

    YARN的简介 什么是YARN MRv1的架构和缺陷 经典MapReduce的局限性 解决可伸缩性问题 YARN的架构 一个可运行任何分布式应用程序的集群 YARN中的应用程序提交 YARN的其他特性 ...

  6. 如何在yarn上运行Hello World(一)

    1.YARN是什么 YARN  (Yet Another Resource Negotiator,另一种资源协调者) 是hadoop上的一种资源调度器,它是一个通用资源管理系统,可以为上层应用提供统一 ...

  7. Apache Hadoop YARN: 背景及概述

    从2012年8月开始Apache Hadoop YARN(YARN = Yet Another Resource Negotiator)成了Apache Hadoop的一项子工程.自此Apache H ...

  8. 第4章:YARN

    Apache YARN(Yet Another Resource Negotiator)是一个Hadoop集群资源管理系统.YARN是在Hadoop 2引入的,用以改善MapReduce的表现.但是它 ...

  9. Hadoop YARN简介

    背景 本文整理一些Hadoop YARN的相关内容. 简介 YARN(Yet Another Resource Negotiator)是Hadoop通用资源管理平台,为各类计算框架(离线MR.在线St ...

随机推荐

  1. django学习笔记【003】创建第一个带有model的app

    [1]python应用程序要连接mysql有多个驱动程序可供选择: 1.MySQLdb 这个只支持python2.x 所以在这里就不说了: 2.mysqlclient 下载地址 https://pyp ...

  2. 代理模式和php实现

    代理模式(Proxy Pattern) : 给某一个对象提供一个代 理,并由代理对象控制对原对象的引用.代理模式的英 文叫做Proxy或Surrogate,它是一种对象结构型模式 模式动机: 在某些情 ...

  3. 纯CSS3实现一个旋转的3D立方体盒子

    简单介绍 上网易前端微专业课程,里面有一个课外作业是实现一个3D旋转立方体.花了点时间做了下.还有点意思.写个简单教程.供大家学习. 先放上终于要实现的效果 注:代码在chrome 43.0.2357 ...

  4. iOS多线程与网络开发之NSURLCache

    郝萌主倾心贡献,尊重作者的劳动成果.请勿转载. // 2 // ViewController.m 3 // NSURLCacheDemo 4 // 5 // Created by haomengzhu ...

  5. 日志分析工具--GoAccess的安装部署

    需求:及时得到线上用户访问日志分析统计结果,以便给开发.测试.运维.运营人员提供决策! 方案:GoAccess,图文并茂,而且速度快,每秒8W 的日志记录解析速度,websocket10秒刷新统计数据 ...

  6. iOS9新特性

    本文主要是说一些iOS9适配中出现的坑,如果只是要单纯的了解iOS9新特性可以看瞄神的开发者所需要知道的 iOS 9 SDK 新特性.9月17日凌晨,苹果给用户推送了iOS9正式版,随着有用户陆续升级 ...

  7. 真正解决 Android Studio无法启动,gradle下载不了 提示“building “ 项目名”gradle project info”(原创20131216)

    最近开始研究Android Studio 开发,但是在开始的时候,一直下载gradle,弄了四天,都没有成功,什么FQ,什么设置gradle路径,都没有解决,但是有一次在公司的电脑上很成功的更新了,完 ...

  8. linux查看硬盘空间 文件大小

    du,disk usage,是通过搜索文件来计算每个文件的大小然后累加,du能看到的文件只是一些当前存在的,没有被删除的.他计算的大小就是当前他认为存在的所有文件大小的累加和 df,disk free ...

  9. 【Java】对文件或文件夹进行重命名

    在Java中,对文件或文件夹进行重命名是很简单的,因为Java的File类已经封装好renameTo的方法. 修改文件或者文件夹的名字都使用这个方法.例如如下的程序: import java.io.* ...

  10. Oracle Tuning 总括

    oracle tuning 分为3个阶段 1. application 调优阶段, 包括设计的调优, SQL语句调优, 管理权限等内容, (这部分是我的重点) (调优人员 application de ...