hadoop map端的超时参数
目前集群上某台机器卡住导致出现大量的Map端任务FAIL,当定位到具体的机器上时,无法ssh或进去后terminal中无响应,退出的相关信息如下:
[hadoop@xxx ~]$ Received disconnect from xxx: Timeout, your session not responding.
AttemptID:attempt_1413206225298_24177_m_000001_0 Timed out after 1200 secsContainer killed by the ApplicationMaster. Container killed on request. Exit code is 143
The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.
Map.Entry<TaskAttemptId, ReportTime> entry = iterator.next();
boolean taskTimedOut = (taskTimeOut > 0) &&
(currentTime > (entry.getValue().getLastProgress() + taskTimeOut)); if(taskTimedOut) {
// task is lost, remove from the list and raise lost event
iterator.remove();
eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(entry
.getKey(), "AttemptID:" + entry.getKey().toString()
+ " Timed out after " + taskTimeOut / 1000 + " secs"));
eventHandler.handle(new TaskAttemptEvent(entry.getKey(),
TaskAttemptEventType.TA_TIMED_OUT));
}
public void progressing(TaskAttemptId attemptID) {
//only put for the registered attempts
//TODO throw an exception if the task isn't registered.
ReportTime time = runningAttempts.get(attemptID);
if(time != null) {
time.setLastProgress(clock.getTime());
}
}

Report progress
If your task reports no progress for 10 minutes (see the mapred.task.timeout property) then it will be killed by Hadoop. Most tasks don’t encounter this situation since they report progress implicitly by reading input and writing output. However, some jobs which don’t process records in this way may fall foul of this behavior and have their tasks killed. Simulations are a good example, since they do a lot of CPU-intensive processing in each map and typically only write the result at the end of the computation. They should be written in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:
- Call
setStatus()onReporterto set a human-readable description of
the task’s progress - Call
incrCounter()onReporterto increment a user counter - Call
progress()onReporterto tell Hadoop that your task is still there (and making progress)
但是,事情还没完,集群中会不定时地有任务卡死在某个点上导致任务无法继续下去:
"main" prio=10 tid=0x000000000293f000 nid=0x1e06 runnable [0x0000000041b20000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000006e243c3f0> (a sun.nio.ch.Util$2)
- locked <0x00000006e243c3e0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000006e243c1a0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
/now wait for socket to be ready.
int count = 0;
try {
count = selector.select(channel, ops, timeout);
} catch (IOException e) { //unexpected IOException.
closed = true;
throw e;
} if (count == 0) {
throw new SocketTimeoutException(timeoutExceptionString(channel,
timeout, ops));
}
Error: java.net.SocketTimeoutException: 70000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=xxx remote=/xxx]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:962)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:930)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1031)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:823)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:475)
while (true) {
long start = (timeout == 0) ? 0 : Time.now();
key = channel.register(info.selector, ops);
ret = info.selector.select(timeout);
if (ret != 0) {
return ret;
}
/* Sometimes select() returns 0 much before timeout for
* unknown reasons. So select again if required.
*/
if (timeout > 0) {
timeout -= Time.now() - start;
if (timeout <= 0) {
return 0;
}
}
if (Thread.currentThread().isInterrupted()) {
throw new InterruptedIOException("Interruped while waiting for " +
"IO on channel " + channel +
". " + timeout +
" millis timeout left.");
}
}
java.nio.channels.Selector
public abstract int select(long timeout)
throws java.io.IOException
Selects a set of keys whose corresponding channels are ready for I/O operations.
This method performs a blocking selection operation. It returns only after at least one channel is selected, this selector's wakeup method is invoked, the current thread is interrupted, or the given timeout period expires, whichever comes first.
- 至少一个已经注册的Channel被选择,返回的就是被选择的Channel数量;
- Selector被中断;
- 给定的超时时间已到;
但是,这也没完,难道超时了不会重试?到底会重试几次?
经过继续分析,发现往下的堆栈中的DFSInputStream调用了readBuffer方法,可以看到retryCurrentNode在第一次失败后,将IOException捕获,会进行必要的重试操作,如果还是发生超时,并且找不到就将其加入黑名单作为失败的DataNode(可能下次不会进行重试?),并转移到另外的DataNode上(执行seekToNewSource方法),经过几次后才会将IOException真正抛出。
try {
return reader.doRead(blockReader, off, len, readStatistics);
} catch ( ChecksumException ce ) {
DFSClient.LOG.warn("Found Checksum error for "
+ getCurrentBlock() + " from " + currentNode
+ " at " + ce.getPos());
ioe = ce;
retryCurrentNode = false;
// we want to remember which block replicas we have tried
addIntoCorruptedBlockMap(getCurrentBlock(), currentNode,
corruptedBlockMap);
} catch ( IOException e ) {
if (!retryCurrentNode) {
DFSClient.LOG.warn("Exception while reading from "
+ getCurrentBlock() + " of " + src + " from "
+ currentNode, e);
}
ioe = e;
}
boolean sourceFound = false;
if (retryCurrentNode) {
/* possibly retry the same node so that transient errors don't
* result in application level failures (e.g. Datanode could have
* closed the connection because the client is idle for too long).
*/
sourceFound = seekToBlockSource(pos);
} else {
addToDeadNodes(currentNode);
sourceFound = seekToNewSource(pos);
}
if (!sourceFound) {
throw ioe;
}
retryCurrentNode = false;
}
hadoop map端的超时参数的更多相关文章
- hadoop的压缩解压缩,reduce端join,map端join
hadoop的压缩解压缩 hadoop对于常见的几种压缩算法对于我们的mapreduce都是内置支持,不需要我们关心.经过map之后,数据会产生输出经过shuffle,这个时候的shuffle过程特别 ...
- Hadoop on Mac with IntelliJ IDEA - 10 陆喜恒. Hadoop实战(第2版)6.4.1(Shuffle和排序)Map端 内容整理
下午对着源码看陆喜恒. Hadoop实战(第2版)6.4.1 (Shuffle和排序)Map端,发现与Hadoop 1.2.1的源码有些出入.下面作个简单的记录,方便起见,引用自书本的语句都用斜体表 ...
- Hadoop基础-Map端链式编程之MapReduce统计TopN示例
Hadoop基础-Map端链式编程之MapReduce统计TopN示例 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.项目需求 对“temp.txt”中的数据进行分析,统计出各 ...
- hadoop编程小技巧(1)---map端聚合
測试hadoop版本号:2.4 Map端聚合的应用场景:当我们仅仅关心全部数据中的部分数据时,而且数据能够放入内存中. 使用的优点:能够大大减小网络数据的传输量,提高效率: 一般编程思路:在Mapp ...
- hadoop mapreduce 端参数优化
在MapReduce执行过程中,特别是Shuffle阶段,尽量使用内存缓冲区存储数据,减少磁盘溢写次数:同时在作业执行过程中增加并行度,都能够显著提高系统性能,这也是配置优化的一个重要依据. 下面分别 ...
- (转)hadoop三个配置文件的参数含义说明
hadoop三个配置文件的参数含义说明 1 获取默认配置 配置hadoop,主要是配置core-site.xml,hdfs-site.xml,mapred-site.xml三个配 ...
- 我对Map端spill的理解
一.先看简单理解 对于hadoop的map端配置项"mapreduce.task.io.sort.mb"和"mapreduce.map.sort.spill.percen ...
- 如何确定 Hadoop map和reduce的个数--map和reduce数量之间的关系是什么?
1.map和reduce的数量过多会导致什么情况?2.Reduce可以通过什么设置来增加任务个数?3.一个task的map数量由谁来决定?4.一个task的reduce数量由谁来决定? 一般情况下,在 ...
- Hadoop Map/Reduce教程
原文地址:http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html 目的 先决条件 概述 输入与输出 例子:WordCount v1.0 ...
随机推荐
- weblogic应用加载不上
这个的问题是编译的问题,在web-inf文件中的classes中少了config文件夹的配置信息 可在项目的build path 中的source中配置
- 如何利用$_SERVER["PHP_SELF"]变量植入script代码?
假如我们是黑客,可以诱骗用户访问如下链接, 相当于用户会在浏览器地址栏中输入以下地址: http://www.xxx.com/test_form.php/%22%3E%3Cscript%3Ealert ...
- macOS 下安装SDKMAN 软件开发工具包管理器
SDKMAN 软件开发工具包管理器的安装非常简单,只需要打开终端,执行: $ curl -s "https://get.sdkman.io" | bash 就OK了,输出类似如下: ...
- Java之JVM逃逸分析
引言: 逃逸分析(Escape Analysis)是众多JVM技术中的一个使用不多的技术点,本文将通过一个实例来分析其使用场景. 概念 逃逸分析,是一种可以有效减少Java 程序中同步负载和内存堆分配 ...
- Java 进阶巩固:什么是注解以及运行时注解的使用
这篇文章 2016年12月13日星期二 就写完了,当时想着等写完另外一篇关于自定义注解的一起发.结果没想到这一等就是半年多 - -. 有时候的确是这样啊,总想着等条件更好了再干,等准备完全了再开始,结 ...
- 【剑指offer】反转链表,C++实现(链表)
1.题目 输入一个链表的头结点,首先反转链表后,然后输出链表的所有元素(牛客网). struct ListNode { int val; struct ListNode *next; }; 2.思路 ...
- 新手向——关于Python3.5在Windows 10 系统下发布模块的终极讲解
博主自己在发布Python模块的时候也是摸索了好久啊,因为跟着书上写的步骤一步一步来终究会跪的节奏有木有啊!!!几经波折终于搞出来了,贴下来与诸君共勉.之前的步骤相信大家都已经知道了,那我们就直接跳过 ...
- JPA级联(一对一 一对多 多对多)注解【实际项目中摘取的】并非自己实际应用
下面把项目中的用户类中有个:一对一 一对多 多对多的注解对应关系列取出来用于学习 说明:项目运行正常 问题类:一对多.一对一.多对多 ============一对多 一方的设置 @One ...
- java并发--CountDownLatch、CyclicBarrier和Semaphore
在java 1.5中,提供了一些非常有用的辅助类来帮助我们进行并发编程,比如CountDownLatch,CyclicBarrier和Semaphore,今天我们就来学习一下这三个辅助类的用法. 以下 ...
- BZOJ2160 拉拉队排练【Manacher】
Description 艾利斯顿商学院篮球队要参加一年一度的市篮球比赛了.拉拉队是篮球比赛的一个看点,好的拉拉队往往能帮助球队增加士气,赢得最终的比赛.所以作为拉拉队队长的楚雨荨同学知道,帮助篮球队训 ...