Request received to kill task 'attempt_201411191723_2827635_r_000009_0' by user
-------
Task has been KILLED_UNCLEAN by the user 原因如下:
1.An impatient user (armed with "mapred job -kill-task" command)
2.JobTracker (to kill a speculative duplicate, or when a whole job fails)
3.Fair Scheduler (but diplomatically, it calls it “preemption”) 一篇老外的文章说的更详细: This is one of the most bloodcurling (and my favorites) stories, that we have recently seen in our 190-square-meter Hadoopland. In a nutshell, some jobs were surprisingly running extremely long, because thousands of their tasks were constantly being killed for some unknown reasons by someone (or something). For example, a photo, taken by our detectives, shows a job running for 12hrs:20min that spawned around 13,000 tasks until that moment. However (only) 4,118 of map tasks had finished successfully, while 8,708 were killed (!) and … surprisingly only 1 task failed (?) – obviously spreading panic in the Hadoopland. When murdering, the killer was leaving the same message each time: "KILLED_UNCLEAN by the user" (however, even our uncle competitor Google does not know too much what it exactly means ;)). Who is “the user”? Does the killer want to impersonate someone?

More Traces Of Crime

The detectives started looking for more traces of crime. They have noticed the killed tasks belong to ad-hoc Hive queries which are quite resource-intensive. When looking at timestamps in log files from JobTracker, TaskTracker and map tasks, they figured out that JobTracker got a request to murder the tasks… They have also noticed that tasks were usually killed young, quickly after the start (within 6-16 minutes), while the surviving tasks are running fine long hours.. The killer is unscrupulous!

Killer’s Identity

Who can actually send a kill request to JobTracker to murder thousands of tasks? Detectives quickly selected there main candidates:
  • An impatient user (armed with "mapred job -kill-task" command)
  • JobTracker (to kill a speculative duplicate, or when a whole job fails)
  • Fair Scheduler (but diplomatically, it calls it “preemption”)
When looking at log messages saying that a task is "KILLED UNCLEAN by the user", one could think that some user is a prime candidate to be the serial killer. However, the citizens of our Hadoopland are friendly, patient and respective to others, so that it would be unfair to assume that somebody killed, in cold blood, 8,708 tasks from a single jobs. JobTracker also seems to have a good alibi, because the job itself had not failed yet and the speculative execution was disabled (surprisingly Hive has own setting, hive.mapred.reduce.tasks.speculative.execution, for disabling speculative execution for reduce tasks, which is not overwritten by Hadoop’s mapred.reduce.tasks.speculative.execution).

FairScheduler Accused

For some company-specific reasons, the ad-hoc Hive queries are running as hive user in our Hadoopland. Moreover FairScheduler is configured with the default value of mapred.fairscheduler.poolnameproperty (which is user.name), so that the pools are created dynamically based on the username of user submitting the job to the cluster (“hive” in case of our ad-hoc Hive queries). When browsing one presentation about Hadoop 2 years ago, one of the detectives just remembered that FairScheduler is usually preempting the newest tasks in an over-share pool to forcibly make some room for starved pools. Eureka! ;) At this movement everything became clear and a quick look at FairScheduler webpage confirmed it. “Hive” pool was running over its minimum and fair shares for a long time, while the other pools are constantly running under their minimum and fair shares. In such a case, Fair Scheduler was killing Hive tasks from time to time to reassign slots to tasks from other pools.

Less Violence, More Peace

Having the evidence, we could put Fair Scheduler in prison, and use Capacity Scheduler instead. Maybe in the future, we will do that! Today, we believe that Fair Scheduler has not committed the crimes really intentionally – we feel that we have educated it badly and gave it too much power. Today, Fair Scheduler gets the suspended sentence – we want to give it a chance to rehabilitate and become more friendly and less aggressive… How to dignify the personality of Fair Scheduler? Obviously tuning settings like minSharePreemptionTimeout, fairSharePreemptionTimeout, minMaps and minReduces based on the current workload could be a good way to control the aggressiveness of the preemption of Fair Scheduler. Easier said, than done, because it requires a deep understanding of and knowledge about your workload (which later may change or not). There is a setting called mapred.fairscheduler.preemption that disables or enables preemption. However disabling preemption (or rather killing, to be precise), in our case, would just partially solve the problem. Only partially, because this issue exposed another problem in the Hadoopland – ad-hoc Hive queries are overloading the cluster.. Finally, we have not disabled preemption, because we were worrying a bit about SLA not being enforced without “any” preemption. Having this said, the two problems to solve are:
  • stop mass killing Hive tasks
  • stop overloading the cluster by ad-hoc Hive queries
We simply limited the number of map and reduce tasks that Fair Scheduler can run in Hive pool (by setting maxMaps and maxReduces for that pool). In consequence, Hive pool could not contain too many task, so that Fair Scheduler could not kill too many of them ;) (because Hive pool’s will not be operating (too much) above its min and fair share level). Limiting the number of tasks prevents also from overloading the cluster by Hive queries (additionally one could also set the maximum number of concurrent jobs running in Hive pool using maxRunningJobs). A nice thing to say is that Fair Scheduler is eager to cooperate, because changing the FairScheduler’s allocation file, does not require restarting of JobTracker. This file is automatically polled for changes every 10 seconds and if it has changed, it is reloaded and the pool configurations are updated on the fly. Thanks to that you can easily learn and change the personality of Fair Scheduler better. ;)
No related posts found.

mapreduce出现大量task被KILLED_UNCLEAN的3个原因的更多相关文章

  1. MapReduce 程序mysql JDBC驱动类找不到原因及学习hadoop写入数据到Mysql数据库的方法

    报错 :ClassNotFoundException: com.mysql.jdbc.Driver 需求描述: hadoop需要动态加载个三方jar包(比如mysql JDBC 驱动包),是在MR结束 ...

  2. MapReduce自定义排序器不生效一个可能的原因

    有问题的代码: package com.mytq.weather; import org.apache.hadoop.io.WritableComparable; import org.apache. ...

  3. Hadoop MapReduce Task的进程模型与Spark Task的线程模型

    Hadoop的MapReduce的Map Task和Reduce Task都是进程级别的:而Spark Task则是基于线程模型的. 多进程模型和多线程模型 所谓的多进程模型和多线程模型,指的是同一个 ...

  4. 谷歌三大核心技术(二)Google MapReduce中文版

    谷歌三大核心技术(二)Google MapReduce中文版  Google MapReduce中文版     译者: alex   摘要 MapReduce是一个编程模型,也是一个处理和生成超大数据 ...

  5. python版 mapreduce 矩阵相乘

    参考张老师的mapreduce 矩阵相乘. 转载请注明:来自chybot的学习笔记http://i.cnblogs.com/EditPosts.aspx?postid=4541939 下面是我用pyt ...

  6. MapReduce-深度剖析

    1.概述 在接触了第一代MapReduce和第二代MapReduce之后,或许会有这样的疑惑,我们从一些书籍和博客当中获取MapReduce的一些原理和算法,在第一代当中会有JobTrack,Task ...

  7. 【转】谷歌三大核心技术(二)Google MapReduce中文版

      Google MapReduce中文版     译者: alex   摘要 MapReduce 是一个编程模型,也是一个处理和生成超大数据集的算法模型的相关实现.用户首先创建一个Map函数处理一个 ...

  8. 【Hadoop】mapreduce采用多进程与spark采用多线程比较

    转自:Mapreduce多进程与spark多线程 Apache Spark的高性能一定程度上取决于它采用的异步并发模型(这里指server/driver 端采用的模型),这与Hadoop 2.0(包括 ...

  9. 如何在Windows下面运行hadoop的MapReduce程序

    在Windows下面运行hadoop的MapReduce程序的方法: 1.下载hadoop的安装包,这里使用的是"hadoop-2.6.4.tar.gz": 2.将安装包直接解压到 ...

随机推荐

  1. 最小费用最大流模板 洛谷P3381

    题目描述 如题,给出一个网络图,以及其源点和汇点,每条边已知其最大流量和单位流量费用,求出其网络最大流和在最大流情况下的最小费用. 输入输出格式 输入格式: 第一行包含四个正整数N.M.S.T,分别表 ...

  2. SQLServer 将日期改造成标准日期格式(如: 2016/6 ->201606)

    同事给了份Excel 数据,导到数据库之后再查出来时发现顺序不好弄.于是想从数据源中做处理. 由于数据存在,年/月 与 年/月/日 的格式不好用datetime保存,于是用varchar保存. 数据处 ...

  3. java 第五章 方法定义及调用

    1.方法的定义 什么是方法 方法是完成某个功能的一组语句,通常将常用的功能写成一个方法 方法的定义 [访问控制符] [修饰符] 返回值类型 方法名( (参数类型 形式参数, ,参数类型 形式参数, , ...

  4. c/c++指针理解

    指针的概念 指针是一个特殊的变量,它里面存储的数值被解释成为内存里的一个地址.要搞清一个指针需要搞清指针的四方面的内容:指针的类型,指针所指向的类型,指针的值或者叫指针所指向的内存区,还有指针本身所占 ...

  5. P1016 旅行家的预算

    P1016 旅行家的预算 题目描述 一个旅行家想驾驶汽车以最少的费用从一个城市到另一个城市(假设出发时油箱是空的).给定两个城市之间的距离D1.汽车油箱的容量C(以升为单位).每升汽油能行驶的距离D2 ...

  6. P2419 [USACO08JAN]牛大赛Cow Contest

    P2419 [USACO08JAN]牛大赛Cow Contest 题目背景 [Usaco2008 Jan] 题目描述 N (1 ≤ N ≤ 100) cows, conveniently number ...

  7. LeetCode:17. Letter Combinations of a Phone Number(Medium)

    1. 原题链接 https://leetcode.com/problems/letter-combinations-of-a-phone-number/description/ 2. 题目要求 给定一 ...

  8. Linux下启动Oracle服务和监听程序步骤

    Linux下启动Oracle服务和监听程序启动和关闭步骤整理如下: 1.安装oracle: 2.创建oracle系统用户: 3./home/oracle下面的.bash_profile添加几个环境变量 ...

  9. 复制MySQL数据库A到另外一个MySQL数据库B(仅仅针对innodb数据库引擎)

    方案一:(不用太大的变化my.ini文件) copy 原数据库A中的   数据库(database)  ib_logfile1  ib_logfile0   ibdata1: 关闭目的数据库B: 备份 ...

  10. ORA-12546: TNS: 权限被拒绝(ORA - 12546 TNS: Permission Denied)

    这个问题上网一查大都是说权限之类的问题,本人在经过第二次折腾之后发现,其实是自己的Oracle客户端工具在破解过程中被自己用防火墙禁止访问网络了,自己还在另一篇博文里记录过,竟然忘光了,BS一下自己! ...