As part of HDP 2.0 Beta, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.

In this blog post we’ll walk through how to plan for and configure processing capacity in your enterprise HDP 2.0 cluster deployment. This will cover YARN and MapReduce 2. We’ll use an example physical cluster of slave nodes each with 48 GB ram, 12 disks and 2 hex core CPUs (12 total cores).

YARN takes into account all the available compute resources on each machine in the cluster. Based on the available resources, YARN will negotiate resource requests from applications (such as MapReduce) running in the cluster. YARN then provides processing capacity to each application by allocating Containers. A Container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (memory, cpu etc.).

Configuring YARN

In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so that processing is not constrained by any one of these cluster resources. As a general recommendation, we’ve found that allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster node with 12 disks and 12 cores, we will allow for 20 maximum Containers to be allocated to each node.

Each machine in our cluster has 48 GB of RAM. Some of this RAM should be reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:

In yarn-site.xml

<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>

The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of 20 Containers, and thus need (40 GB total RAM) / (20 # of Containers) = 2 GB minimum per container:

In yarn-site.xml

 <name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>

YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.

Configuring MapReduce 2

MapReduce 2 runs on top of YARN and utilizes YARN Containers to schedule and execute its map and reduce tasks.

When configuring MapReduce 2 resource utilization on YARN, there are three aspects to consider:

  1. Physical RAM limit for each Map And Reduce task
  2. The JVM heap size limit for each task
  3. The amount of virtual memory each task will get

You can define how much maximum memory each Map and Reduce task will take. Since each Map and each Reduce will run in a separate Container, these maximum memory settings should be at least equal to or more than the YARN minimum Container allocation.

For our example cluster, we have the minimum RAM for a Container (yarn.scheduler.minimum-allocation-mb) = 2 GB. We’ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks Containers.

In mapred-site.xml:

 <name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>

Each Container will run JVMs for the Map and Reduce tasks. The JVM heap size should be set to lower than the Map and Reduce memory defined above, so that they are within the bounds of the Container memory allocated by YARN.

In mapred-site.xml:

 <name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>

The above settings configure the upper limit of the physical RAM that Map and Reduce tasks will use. The virtual memory (physical + paged memory) upper limit for each Map and Reduce task is determined by the virtual memory ratio each YARN Container is allowed. This is set by the following configuration, and the default value is 2.1:

In yarn-site.xml:

 <name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>

Thus, with the above settings on our example cluster, each Map task will get the following memory allocations with the following:

  • Total physical RAM allocated = 4 GB
  • JVM heap space upper limit within the Map task Container = 3 GB
  • Virtual memory upper limit = 4*2.1 = 8.2 GB

With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job. In our example cluster, with the above configurations, YARN will be able to allocate on each node up to 10 mappers (40/4) or 5 reducers (40/8) or a permutation within that.

How to Plan and Configure YARN and MapReduce 2 in HDP 2.0的更多相关文章

  1. Wordcount on YARN 一个MapReduce示例

    Hadoop YARN版本:2.2.0 关于hadoop yarn的环境搭建可以参考这篇博文:Hadoop 2.0安装以及不停集群加datanode hadoop hdfs yarn伪分布式运行,有如 ...

  2. 怎样通过Java程序提交yarn的mapreduce计算任务

    因为项目需求,须要通过Java程序提交Yarn的MapReduce的计算任务.与一般的通过Jar包提交MapReduce任务不同,通过程序提交MapReduce任务须要有点小变动.详见下面代码. 下面 ...

  3. Determine YARN and MapReduce Memory Configuration Settings

    Determine YARN and MapReduce Memory Configuration Settings https://docs.hortonworks.com/HDPDocuments ...

  4. 3.Hadoop测试Yarn和MapReduce

    Hadoop测试Yarn和MapReduce 1.配置Yarn (1)配置ResourceManager 生产环境中,一般是重开一台机器作为ResourceManager,这里我们以Master机器代 ...

  5. YARN和MapReduce的内存设置参考

    如何确定Yarn中容器Container,Mapreduce相关参数的内存设置,对于初始集群,由于不知道集群的类型(如cpu密集.内存密集)我们需要根据经验提供给我们一个参考配置值,来作为基础的配置. ...

  6. YARN和MapReduce的内存设置參考

    怎样确定Yarn中容器Container,Mapreduce相关參数的内存设置,对于初始集群,由于不知道集群的类型(如cpu密集.内存密集)我们须要依据经验提供给我们一个參考配置值,来作为基础的配置. ...

  7. Hadoop2 使用 YARN 运行 MapReduce 的过程源码分析

    Hadoop 使用 YARN 运行 MapReduce 的过程如下图所示: 总共分为11步. 这里以 WordCount 为例, 我们在客户端终端提交作业: # 把本地的 /home/hadoop/t ...

  8. 经典MapReduce作业和Yarn上MapReduce作业运行机制

    一.经典MapReduce的作业运行机制 如下图是经典MapReduce作业的工作原理: 1.1 经典MapReduce作业的实体 经典MapReduce作业运行过程包含的实体: 客户端,提交MapR ...

  9. 大数据系列4:Yarn以及MapReduce 2

    系列文章: 大数据系列:一文初识Hdfs 大数据系列2:Hdfs的读写操作 大数据谢列3:Hdfs的HA实现 通过前文,我们对Hdfs的已经有了一定的了解,本文将继续之前的内容,介绍Yarn与Yarn ...

随机推荐

  1. 为什么要使用ConcurrentHashMap

    好久没写过技术性文章了,还是要坚持下去.掌握的知识,能写出来或者是讲给别人听才是真正的掌握了知识,如果不善于给别人讲,实际上还是没有真正掌握相关的知识,挑个简单的写吧. 面试的时候经常会被问到hash ...

  2. tomcat 启动项目超时问题

    在开发工具中打开tomcat(F3),Timesouts下设置增加start秒数,然后保存重启项目.

  3. UVA 1672不相交的正规表达式

    题意 输入两个正规表达式,判断两者是否相交(即存在一个串同时满足两个正规表达式).本题的正规表达式包含如下几种情况: 单个小写字符 $c$ 或:($P | Q$). 如果字符串 $s$ 满足 $P$ ...

  4. UVALive 5099 Nubulsa Expo(全局最小割)

    题面 vjudge传送门 题解 论文题 见2016绍兴一中王文涛国家队候选队员论文<浅谈无向图最小割问题的一些算法及应用>4节 全局最小割 板题 CODE 暴力O(n3)O(n^3)O(n ...

  5. Linux Shell 如何获取参数

    $# 是传给脚本的参数个数 $0 是脚本本身的名字 $1 是传递给该shell脚本的第一个参数 $2 是传递给该shell脚本的第二个参数 $@ 是传给脚本的所有参数的列表 $* 是以一个单字符串显示 ...

  6. leetcode解题报告(31):Kth Largest Element in an Array

    描述 Find the kth largest element in an unsorted array. Note that it is the kth largest element in the ...

  7. 发现Mathematica中求逆出错

    发现Mathematica中应用Inverse求逆时出错.

  8. UML图规范

    1.子类与父类的继承关系用空心三角形+实线表示.   2.类实现接口用空心三角形+虚线表示.(实现关系) 3.类与类之间的关系用实线箭头表示.(关联关系) 关联关系还可细分为三类:单项关联(下图).双 ...

  9. radio得值

    $('input[name="ylqxjylcldnbModel.jylb"]:checked').val();   <input type="radio" ...

  10. vue+elementui搭建后台管理界面(2首页)

    1 会话存储 使用html5的 sessionStorage 对象临时保存会话 // 保存会话 sessionStorage.setItem('user', username) // 删除会话 ses ...