1. What is the recommended value for "yarn.nodemanager.resource.local-dirs"?

We only have one value (directory) configured for the above property, which has a size of 200GB.

Our hive jobs' map/reduce fill this folder up, and yarn places this node in the blocklist. Moving to tez engine and/or increasing the quota size may fix this, but we'd like to know the recommended value.

最佳解答

个解答,截止Sourygna Luangsay  · 2015年10月28日 08:04

If you use the same partitions for yarn intermediate data than for the HDFS blocks, then you might also consider setting the fs.datanode.du.reserved property, which reserves some space on those partitions for non-hdfs use (such as intermediate yarn data).

One base recommendation I saw on my first Hadoop training long time ago was to dedicate 25% of the "data disks" for that kind of intermediate data.

I guess the optimal answer should consider the maximum amount of intermediate data you can get at the same time (when launching a job,

do you use all the data of HDFS as input data?) and dedicate the space for yarn.nodemanager.resource.local-dirs accordingly.

I would also recommend turning on the property mapreduce.map.output.compress in order to reduce the size of the intermediate data.

 

个解答,截止Jean-Philippe Player  · 2015年10月27日 20:58

You would assign one folder to each of the datanode disks, closely mapping dfs.datanode.data.dir. On a 12 disk system you would have 12 yarn local-dir locations.

2.Though Dataflow can be used with an out of the box Hadoop installation , there are a couple of configuration properties which may improve DataFlow/Hadoop performance

Resolution

Using the O/S file system (i.e. /tmp or /var/) can be problematic especially if any applications log a lot of information or require large local files. So we have two properties to overcome this bottleneck.
 The first is yarn.nodemanager.local-dirs. This setting specifies the directories to use as base directories for the containers run within YARN.

For each application and container created in YARN, a set of directories will be created underneath these local directories. These are then cleaned up when the application completes.
 
Here’s the setting from the yarn-site.xml file on one of our clusters. Note we have eight data disks per node on these clusters and create a directory for YARN on each data filesystem.

<property>
<name>yarn.nodemanager.local-dirs</name>
<value>
/hadoop/hdfs/data1/hadoop/yarn/local,/hadoop/hdfs/data2/hadoop/yarn/local,/hadoop/hdfs/data3/hadoop/yarn/local,/hadoop/hdfs/data4/hadoop/yarn/local,/hadoop/hdfs/data5/hadoop/yarn/local,/hadoop/hdfs/data6/hadoop/yarn/local,/hadoop/hdfs/data7/hadoop/yarn/local,/hadoop/hdfs/data8/hadoop/yarn/local
</value>
<source>yarn-site.xml</source>
</property>

The second is yarn.nodemanager.log-dirs. Much like the local-dirs property, this setting specifies where container log files should go on the local disk. YARN spreads the load around if you specify multiple directories.
And here’s a sample setting:

<property>
<name>yarn.nodemanager.log-dirs</name>
<value>
/hadoop/hdfs/data1/hadoop/yarn/log,/hadoop/hdfs/data2/hadoop/yarn/log,/hadoop/hdfs/data3/hadoop/yarn/log,/hadoop/hdfs/data4/hadoop/yarn/log,/hadoop/hdfs/data5/hadoop/yarn/log,/hadoop/hdfs/data6/hadoop/yarn/log,/hadoop/hdfs/data7/hadoop/yarn/log,/hadoop/hdfs/data8/hadoop/yarn/log
</value>
<source>yarn-site.xml</source>
</property>

Another YARN property you want to validate is the yarn.nodemanager.resource.memory-mb. This setting specifies the amount of memory YARN is allowed to allocate per worker node.

YARN will only allocate this much memory in total to containers. So it’s important to set this to some value less than the physical memory per worker node.

HDP appears to automatically pick 75% of the physical memory for this setting as our machines have 16GB of RAM each.
Here’s an example:

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value></value>
<source>yarn-site.xml</source>
</property>

 当然也可以考虑使用nfs挂载,相关资料如下

3.How can I change yarn.nodemanager.local-dirs to point to file:/// (high performance NFS mount point)

Hi,I'm trying to change the "yarn.nodemanager.local-dirs" to point to "file:///fast_nfs/yarn/local". This is indeed a high-performance NFS mount-point that all the nodes in my cluster have.

When I try to change it in Ambari I can't and the message "Must be a slash or drive at the start, and must not contain white spaces" is displayed.

If I manually change the /etc/hadoop/conf/yarn-site.xml in all the nodes, after restarting YARN the "file:///" is removed from that option.

I want to have all the shuffle happening in my high-performance NFS array instead of in HDFS.

How can I change this behaviour in HDP?

@Raul Pingarrón

The culprit is "file:/// "you should get a was to create a mount point /fast_nfs/yarn/local, hence the message "Must be a slash or drive ........" like te list below

/hadoop/yarn/local,/opt/hadoop/yarn/local,/usr/hadoop/yarn/local,/var/hadoop/yarn/local

Hope that helps

4. How to set yarn.nodemanager.local-dirs on M3 cluster to write to mapr fs

We are running a four node M3 cluster with one node running NFS. We are getting the the following error.

1/1 local-dirs are bad: /mapr/clustername/tmp/host_name on the nodes that does not have NFS running.

What is the best way to set this property in the yarn-site.xml to allow all nodes to use mapr fs /tmp as the default location and not the local file system /tmp

I believe the property "yarn.nodemanager.local-dirs" is meant to be a location on the local file system. It cannot be a location of the distributed file system (HDFS or MapR FS).

This property determines the location where the node manager maintains intermediate data (for example during the shuffle phase).

You can the find gory details here: http://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/

The default location as you mentioned is /tmp. If you want to improve performance, you could provide multiple directories on separate disks for better I/O throughput.

But, you should ascertain that this is a indeed bottleneck and if a separate disk is warranted for this purpose (or you are better of using it as a MapR data disk).

One other thing, the NFS mounted location (/mapr/clustername/tmp/host_name) is not a part of the distributed FS.

MapR makes it seamless to work between its distributed file system and the POSIX file system. But the files of the POSIX system are not stored in any containers/chunks/blocks, etc.

Since the path you specified is really a local directory on the node running NFS, you don't get an error message on that node . But on the other nodes, the system can't find a local directory by that name and hence it is complaining.

Hadoop Yarn配置项 yarn.nodemanager.resource.local-dirs探讨的更多相关文章

  1. hadoop集群配置方法---mapreduce应用:xml解析+wordcount详解---yarn配置项解析

    注:以下链接均为近期hadoop集群搭建及mapreduce应用开发查找到的资料.使用hadoop2.6.0,其中hadoop集群配置过程下面的文章都有部分参考. hadoop集群配置方法: ---- ...

  2. Hadoop学习之YARN框架

    转自:http://www.ibm.com/developerworks/cn/opensource/os-cn-hadoop-yarn/,非常感谢分享! 对于业界的大数据存储及分布式处理系统来说,H ...

  3. Hadoop生态系统之Yarn

    Apache YARN(Yet Another Resource Negotiator) 是Hadoop的集群资源管理系统.YARN被引入Hadoop2最初是为了改善MapReduce的实现,但它具有 ...

  4. hadoop备战:yarn框架的搭建(mapreduce2)

    昨天没有写好了没有更新,今天一起更新,yarn框架也是刚搭建好的. 我这里把hadoop放在了我的个人用户hadoop下了,你也能够尝试把它放在/usr/local,考虑的问题就相对多点. 主要的软硬 ...

  5. hadoop备战:yarn框架的简单介绍(mapreduce2)

    新 Hadoop Yarn 框架原理及运作机制 重构根本的思想是将 JobTracker 两个基本的功能分离成单独的组件,这两个功能是资源管理和任务调度 / 监控.新的资源管理器全局管理全部应用程序计 ...

  6. Hadoop核心组件之YARN

    YARN概述 Yet Another Resource Negotiator:另外资源的协调者 通用的资源管理系统 为上层应用提供统一的资源管理和调度 操作系统级别的调度框架,可以让各种计算框架运行在 ...

  7. Hadoop学习笔记—Yarn

    目录 一些基本知识 ResourceManager 的恢复 Resource Manager的HA YARN Node Labels YARN Node Attributes Web Applicat ...

  8. Hadoop 2.2 YARN分布式集群搭建配置流程

    搭建环境准备:JDK1.6,SSH免密码通信 系统:CentOS 6.3 集群配置:NameNode和ResourceManager在一台服务器上,三个数据节点 搭建用户:YARN Hadoop2.2 ...

  9. Hadoop数据操作系统YARN全解析

    “ Hadoop 2.0引入YARN,大大提高了集群的资源利用率并降低了集群管理成本.其在异构集群中是怎样应用的?Hulu又有哪些成功实践可以分享? 为了能够对集群中的资源进行统一管理和调度,Hado ...

随机推荐

  1. RabbitMQ消息队列(十)-高可用集群部署实战

    前几章讲到RabbitMQ单主机模式的搭建和使用,我们在实际生产环境中出于对性能还有可用性的考虑会采用集群的模式来部署RabbitMQ. RabbitMQ集群基本概念 Rabbit模式大概分为以下三种 ...

  2. Linux基础知识第四讲,文件内容命令

    目录 一丶常用命令 1.cat命令演示以及常用选项 2.grep 搜索命令的使用 3.echo 以及 重定向的使用 4.管道概念 一丶常用命令 序号 命令 对应英文 作用 01 cat 文件名 con ...

  3. 以语音评测的PC端demo代码为例,讲解口语评测如何实现

    本文由云+社区发表 作者:腾讯智慧教育 概述 腾讯云智聆口语评测(英文版)(Smart Oral Evaluation-English,SOE-E)是腾讯云推出的语音评测产品,是基于英语口语类教育培训 ...

  4. mysql数据库备份并且实现远程复制

    一.实现ssh 远程登陆 机器环境: 192.167.33.108 clent 用户:crawler 192.167.33.77 server 用户:crawler 1.客户端 生成密钥 /home/ ...

  5. JavaSE之Long 详解 Long的方法简介以及用法

    基本功能 Long 类在对象中包装了基本类型 long 的值 每个 Long 类型的对象都包含一个 long 类型的字段 static long MAX_VALUE long 8个字节最大值2^63- ...

  6. 图解ARP协议(二)ARP攻击原理与实践

    一.ARP攻击概述 在上篇文章里,我给大家普及了ARP协议的基本原理,包括ARP请求应答.数据包结构以及协议分层标准,今天我们继续讨论大家最感兴趣的话题:ARP攻击原理是什么?通过ARP攻击可以做什么 ...

  7. CSS float的相关图文详解(二)

    最近这段时间有些忙,一直没有写关于如何清除浮动的,现在终于抽出时间了,还是那句话,如果哪里有错误或者错别字,希望大家留言指正.我们一起进步! 在CSS中,我们通过float属性实现元素的浮动.浮动框旁 ...

  8. spark问题

    使用IDEA运行spark程序,除了需要导入spark的一些依赖包之外,还需要注意的是 当启动spark报找不到可执行的hadoop winutils.exe 可已下载相应版本的winutils.ex ...

  9. Java Scanner nextLine方法跳过

    问题描述 Scanner使用了nextInt方法的时候,如果接下来要使用nextLine,会获取不到内容 原因 因为Scanner读取用户输入数据,是先判断缓冲区是否含有数据,没有则接收用户输入的数据 ...

  10. [leetcode](4.21)4. 有效子数组的数目

    给定一个整数数组 A,返回满足下面条件的 非空.连续 子数组的数目: 子数组中,最左侧的元素不大于其他元素. 示例 1: 输入:[1,4,2,5,3] 输出:11 解释:有 11 个有效子数组,分别是 ...