As part of HDP 2.0 Beta, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.

In this blog post we’ll walk through how to plan for and configure processing capacity in your enterprise HDP 2.0 cluster deployment. This will cover YARN and MapReduce 2. We’ll use an example physical cluster of slave nodes each with 48 GB ram, 12 disks and 2 hex core CPUs (12 total cores).

YARN takes into account all the available compute resources on each machine in the cluster. Based on the available resources, YARN will negotiate resource requests from applications (such as MapReduce) running in the cluster. YARN then provides processing capacity to each application by allocating Containers. A Container is the basic unit of processing capacity in YARN, and is an encapsulation of resource elements (memory, cpu etc.).

Configuring YARN

In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so that processing is not constrained by any one of these cluster resources. As a general recommendation, we’ve found that allowing for 1-2 Containers per disk and per core gives the best balance for cluster utilization. So with our example cluster node with 12 disks and 12 cores, we will allow for 20 maximum Containers to be allocated to each node.

Each machine in our cluster has 48 GB of RAM. Some of this RAM should be reserved for Operating System usage. On each node, we’ll assign 40 GB RAM for YARN to use and keep 8 GB for the Operating System. The following property sets the maximum memory YARN can utilize on the node:

In yarn-site.xml

<name>yarn.nodemanager.resource.memory-mb</name>
<value>40960</value>

The next step is to provide YARN guidance on how to break up the total resources available into Containers. You do this by specifying the minimum unit of RAM to allocate for a Container. We want to allow for a maximum of 20 Containers, and thus need (40 GB total RAM) / (20 # of Containers) = 2 GB minimum per container:

In yarn-site.xml

 <name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>

YARN will allocate Containers with RAM amounts greater than the yarn.scheduler.minimum-allocation-mb.

Configuring MapReduce 2

MapReduce 2 runs on top of YARN and utilizes YARN Containers to schedule and execute its map and reduce tasks.

When configuring MapReduce 2 resource utilization on YARN, there are three aspects to consider:

  1. Physical RAM limit for each Map And Reduce task
  2. The JVM heap size limit for each task
  3. The amount of virtual memory each task will get

You can define how much maximum memory each Map and Reduce task will take. Since each Map and each Reduce will run in a separate Container, these maximum memory settings should be at least equal to or more than the YARN minimum Container allocation.

For our example cluster, we have the minimum RAM for a Container (yarn.scheduler.minimum-allocation-mb) = 2 GB. We’ll thus assign 4 GB for Map task Containers, and 8 GB for Reduce tasks Containers.

In mapred-site.xml:

 <name>mapreduce.map.memory.mb</name>
<value>4096</value>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>

Each Container will run JVMs for the Map and Reduce tasks. The JVM heap size should be set to lower than the Map and Reduce memory defined above, so that they are within the bounds of the Container memory allocated by YARN.

In mapred-site.xml:

 <name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>

The above settings configure the upper limit of the physical RAM that Map and Reduce tasks will use. The virtual memory (physical + paged memory) upper limit for each Map and Reduce task is determined by the virtual memory ratio each YARN Container is allowed. This is set by the following configuration, and the default value is 2.1:

In yarn-site.xml:

 <name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>

Thus, with the above settings on our example cluster, each Map task will get the following memory allocations with the following:

  • Total physical RAM allocated = 4 GB
  • JVM heap space upper limit within the Map task Container = 3 GB
  • Virtual memory upper limit = 4*2.1 = 8.2 GB

With YARN and MapReduce 2, there are no longer pre-configured static slots for Map and Reduce tasks. The entire cluster is available for dynamic resource allocation of Maps and Reduces as needed by the job. In our example cluster, with the above configurations, YARN will be able to allocate on each node up to 10 mappers (40/4) or 5 reducers (40/8) or a permutation within that.

How to Plan and Configure YARN and MapReduce 2 in HDP 2.0的更多相关文章

  1. Wordcount on YARN 一个MapReduce示例

    Hadoop YARN版本:2.2.0 关于hadoop yarn的环境搭建可以参考这篇博文:Hadoop 2.0安装以及不停集群加datanode hadoop hdfs yarn伪分布式运行,有如 ...

  2. 怎样通过Java程序提交yarn的mapreduce计算任务

    因为项目需求,须要通过Java程序提交Yarn的MapReduce的计算任务.与一般的通过Jar包提交MapReduce任务不同,通过程序提交MapReduce任务须要有点小变动.详见下面代码. 下面 ...

  3. Determine YARN and MapReduce Memory Configuration Settings

    Determine YARN and MapReduce Memory Configuration Settings https://docs.hortonworks.com/HDPDocuments ...

  4. 3.Hadoop测试Yarn和MapReduce

    Hadoop测试Yarn和MapReduce 1.配置Yarn (1)配置ResourceManager 生产环境中,一般是重开一台机器作为ResourceManager,这里我们以Master机器代 ...

  5. YARN和MapReduce的内存设置参考

    如何确定Yarn中容器Container,Mapreduce相关参数的内存设置,对于初始集群,由于不知道集群的类型(如cpu密集.内存密集)我们需要根据经验提供给我们一个参考配置值,来作为基础的配置. ...

  6. YARN和MapReduce的内存设置參考

    怎样确定Yarn中容器Container,Mapreduce相关參数的内存设置,对于初始集群,由于不知道集群的类型(如cpu密集.内存密集)我们须要依据经验提供给我们一个參考配置值,来作为基础的配置. ...

  7. Hadoop2 使用 YARN 运行 MapReduce 的过程源码分析

    Hadoop 使用 YARN 运行 MapReduce 的过程如下图所示: 总共分为11步. 这里以 WordCount 为例, 我们在客户端终端提交作业: # 把本地的 /home/hadoop/t ...

  8. 经典MapReduce作业和Yarn上MapReduce作业运行机制

    一.经典MapReduce的作业运行机制 如下图是经典MapReduce作业的工作原理: 1.1 经典MapReduce作业的实体 经典MapReduce作业运行过程包含的实体: 客户端,提交MapR ...

  9. 大数据系列4:Yarn以及MapReduce 2

    系列文章: 大数据系列:一文初识Hdfs 大数据系列2:Hdfs的读写操作 大数据谢列3:Hdfs的HA实现 通过前文,我们对Hdfs的已经有了一定的了解,本文将继续之前的内容,介绍Yarn与Yarn ...

随机推荐

  1. Nutch2.1+solr3.6.1+mysql5.6问题

    1.Nutch2.1问题 1.1 问题:导入完成后,Nutch2.1里面runtime仍旧不能运行,出现jobfailed等错误. 解决:runtime里的nutch调试过程和导入Eclipse差不多 ...

  2. javascript单一复制粘贴

    <div id="copy-txt">内容内容</div> <button type="submit" onclick=" ...

  3. BZOJ 3625:小朋友和二叉树 多项式开根+多项式求逆+生成函数

    生成函数这个东西太好用了~ code: #include <bits/stdc++.h> #define ll long long #define setIO(s) freopen(s&q ...

  4. shell脚本的参数传递使用

    1.params.sh源码如下 #!/bin/bash # author:daokr # url:www.daokr.com echo "Shell 传递参数实例!"; echo ...

  5. Luogu P1903 BZOJ 2120 数颜色 带修改的莫队

    https://www.luogu.org/problemnew/show/P1903 之前切过这道题,复习莫队再切一遍,不过我之前写的是主席树和树状数组,也不知道我当时怎么想的…… 这个题卡常我没写 ...

  6. 2017.10.6 国庆清北 D6T1 排序

    题目描述 小Z 有一个数字序列a1; a2; .... ; an,长度为n,小Z 只有一个操作:选 定p(1<p<n),然后把ap 从序列中拿出,然后再插⼊到序列中任意位置. 比如a 序列 ...

  7. Pytorch在colab和kaggle中使用TensorBoard/TensorboardX可视化

    在colab和kaggle内核的Jupyter notebook中如何可视化深度学习模型的参数对于我们分析模型具有很大的意义,相比tensorflow, pytorch缺乏一些的可视化生态包,但是幸好 ...

  8. copy()函数技术推演

    /*** str_copy.c ***/ #include<stdio.h> void copy_str21(char *from, char *to) { for(; *from != ...

  9. golang orm

    package main import ( "fmt" "github.com/astaxie/beego/orm" _"github.com/go- ...

  10. vue+element 表格formatter数据格式化并且插入html标签

    前言 vue中 element框架,其中表格组件,我既要行内数据格式化,又要插入html标签 一贯思维,二者不可兼得也 一.element 表格 数据格式化 demo <el-table-col ...