延迟调度的主要目的是提高数据本地性(data locality),减少数据在网络中的传输。对于那些输入数据不在本地的MapTask,调度器将会延迟调度他们,而把slot分配给那些具备本地性的MapTask。

  延迟调度的大体思想如下:

  若该job找到一个node-local的MapTask,则返回该task;若找不到,则延迟调度。即在nodeLocalityDelay时长内,重新找到一个node-local的MapTask并返回;

  否则等待时长超过nodeLocalityDelay之后,寻找一个rack-local的MapTask并返回;若找不到,则延迟调度。即在rackLocalityDelay时长内,重新找到一个rack-local的MapTask并返回;

  否则等待超过nodeLocalityDelay + rackLocalityDelay之后,重新寻找一个off-switch的MapTask并返回。

  FairScheduler.java中关于延迟调度的主要变量:

 long nodeLocalityDelay://node-local已经等待的时间
long rackLocalityDelay: //rack-local已经等待的时间
boolean skippedAtLastHeartbeat://该job是否被延迟调度(是否被跳过)
timeWaitedForLocalMap://自从上次MapTask被分配以来等待的时间
LocalityLevel lastMapLocalityLevel://上次分配的MapTask对应的本地级别
nodeLocalityDelay = rackLocalityDelay =
Math.min(15000 , (long) (1.5 * jobTracker.getNextHeartbeatInterval()));

  

  在fair scheduler中,每个job维护了两个变量用来完成延迟调度:最后一个被调度的MapTask的本地性级别(lastMapLocalityLevel)与自从这个job被跳过以来所等待的时间(timeWaitedForLocalMap)。工作流程如下(具体工作在FairScheduler.java的getAllowedLocalityLevel ()方法中完成):

 /**
* Get the maximum locality level at which a given job is allowed to
* launch tasks, based on how long it has been waiting for local tasks.
* This is used to implement the "delay scheduling" feature of the Fair
* Scheduler for optimizing data locality.
* If the job has no locality information (e.g. it does not use HDFS), this
* method returns LocalityLevel.ANY, allowing tasks at any level.
* Otherwise, the job can only launch tasks at its current locality level
* or lower, unless it has waited at least nodeLocalityDelay or
* rackLocalityDelay milliseconds depends on the current level. If it
* has waited (nodeLocalityDelay + rackLocalityDelay) milliseconds,
* it can go to any level.
*/
protected LocalityLevel getAllowedLocalityLevel(JobInProgress job,
long currentTime) {
JobInfo info = infos.get(job);
if (info == null) { // Job not in infos (shouldn't happen)
LOG.error("getAllowedLocalityLevel called on job " + job
+ ", which does not have a JobInfo in infos");
return LocalityLevel.ANY;
}
if (job.nonLocalMaps.size() > 0) { // Job doesn't have locality information
return LocalityLevel.ANY;
}
// Don't wait for locality if the job's pool is starving for maps
Pool pool = poolMgr.getPool(job);
PoolSchedulable sched = pool.getMapSchedulable();
long minShareTimeout = poolMgr.getMinSharePreemptionTimeout(pool.getName());
long fairShareTimeout = poolMgr.getFairSharePreemptionTimeout();
if (currentTime - sched.getLastTimeAtMinShare() > minShareTimeout ||
currentTime - sched.getLastTimeAtHalfFairShare() > fairShareTimeout) {
eventLog.log("INFO", "No delay scheduling for "
+ job.getJobID() + " because it is being starved");
return LocalityLevel.ANY;
}
// In the common case, compute locality level based on time waited
switch(info.lastMapLocalityLevel) {
case NODE: // Last task launched was node-local
if (info.timeWaitedForLocalMap >=
nodeLocalityDelay + rackLocalityDelay)
return LocalityLevel.ANY;
else if (info.timeWaitedForLocalMap >= nodeLocalityDelay)
return LocalityLevel.RACK;
else
return LocalityLevel.NODE;
case RACK: // Last task launched was rack-local
if (info.timeWaitedForLocalMap >= rackLocalityDelay)
return LocalityLevel.ANY;
else
return LocalityLevel.RACK;
default: // Last task was non-local; can launch anywhere
return LocalityLevel.ANY;
}
}

getAllowedLocalityLevel()

1. 若lastMapLocalityLevel为Node:

1)若timeWaitedForLocalMap >= nodeLocalityDelay + rackLocalityDelay,则可以调度off-switch及以下级别的MapTask;

2)若timeWaitedForLocalMap >= nodeLocalityDelay,则可以调度rack-local及以下级别的MapTask;

3)否则调度node-local级别的MapTask。

2. 若lastMapLocalityLevel为Rack:

1)若timeWaitedForLocalMap >= rackLocalityDelay,则调度off-switch及以下级别的MapTask;

2)否则调度rack-local及以下级别的MapTask;

3. 否则调度off-switch及以下级别的MapTask;

  延迟调度的具体工作流程如下(具体工作在FairScheduler.java的assignTasks()方法中完成):

 @Override
public synchronized List<Task> assignTasks(TaskTracker tracker)
throws IOException {
if (!initialized) // Don't try to assign tasks if we haven't yet started up
return null;
String trackerName = tracker.getTrackerName();
eventLog.log("HEARTBEAT", trackerName);
long currentTime = clock.getTime(); // Compute total runnable maps and reduces, and currently running ones
int runnableMaps = 0;
int runningMaps = 0;
int runnableReduces = 0;
int runningReduces = 0;
for (Pool pool: poolMgr.getPools()) {
runnableMaps += pool.getMapSchedulable().getDemand();
runningMaps += pool.getMapSchedulable().getRunningTasks();
runnableReduces += pool.getReduceSchedulable().getDemand();
runningReduces += pool.getReduceSchedulable().getRunningTasks();
} ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();
// Compute total map/reduce slots
// In the future we can precompute this if the Scheduler becomes a
// listener of tracker join/leave events.
int totalMapSlots = getTotalSlots(TaskType.MAP, clusterStatus);
int totalReduceSlots = getTotalSlots(TaskType.REDUCE, clusterStatus); eventLog.log("RUNNABLE_TASKS",
runnableMaps, runningMaps, runnableReduces, runningReduces); // Update time waited for local maps for jobs skipped on last heartbeat
//备注一
updateLocalityWaitTimes(currentTime); // Check for JT safe-mode
if (taskTrackerManager.isInSafeMode()) {
LOG.info("JobTracker is in safe-mode, not scheduling any tasks.");
return null;
} TaskTrackerStatus tts = tracker.getStatus(); int mapsAssigned = 0; // loop counter for map in the below while loop
int reducesAssigned = 0; // loop counter for reduce in the below while
int mapCapacity = maxTasksToAssign(TaskType.MAP, tts);
int reduceCapacity = maxTasksToAssign(TaskType.REDUCE, tts);
boolean mapRejected = false; // flag used for ending the loop
boolean reduceRejected = false; // flag used for ending the loop // Keep track of which jobs were visited for map tasks and which had tasks
// launched, so that we can later mark skipped jobs for delay scheduling
Set<JobInProgress> visitedForMap = new HashSet<JobInProgress>();
Set<JobInProgress> visitedForReduce = new HashSet<JobInProgress>();
Set<JobInProgress> launchedMap = new HashSet<JobInProgress>(); ArrayList<Task> tasks = new ArrayList<Task>();
// Scan jobs to assign tasks until neither maps nor reduces can be assigned
//备注二
while (true) {
// Computing the ending conditions for the loop
// Reject a task type if one of the following condition happens
// 1. number of assigned task reaches per heatbeat limit
// 2. number of running tasks reaches runnable tasks
// 3. task is rejected by the LoadManager.canAssign
if (!mapRejected) {
if (mapsAssigned == mapCapacity ||
runningMaps == runnableMaps ||
!loadMgr.canAssignMap(tts, runnableMaps,
totalMapSlots, mapsAssigned)) {
eventLog.log("INFO", "Can't assign another MAP to " + trackerName);
mapRejected = true;
}
}
if (!reduceRejected) {
if (reducesAssigned == reduceCapacity ||
runningReduces == runnableReduces ||
!loadMgr.canAssignReduce(tts, runnableReduces,
totalReduceSlots, reducesAssigned)) {
eventLog.log("INFO", "Can't assign another REDUCE to " + trackerName);
reduceRejected = true;
}
}
// Exit while (true) loop if
// 1. neither maps nor reduces can be assigned
// 2. assignMultiple is off and we already assigned one task
if (mapRejected && reduceRejected ||
!assignMultiple && tasks.size() > 0) {
break; // This is the only exit of the while (true) loop
} // Determine which task type to assign this time
// First try choosing a task type which is not rejected
TaskType taskType;
if (mapRejected) {
taskType = TaskType.REDUCE;
} else if (reduceRejected) {
taskType = TaskType.MAP;
} else {
// If both types are available, choose the task type with fewer running
// tasks on the task tracker to prevent that task type from starving
if (tts.countMapTasks() + mapsAssigned <=
tts.countReduceTasks() + reducesAssigned) {
taskType = TaskType.MAP;
} else {
taskType = TaskType.REDUCE;
}
} // Get the map or reduce schedulables and sort them by fair sharing
List<PoolSchedulable> scheds = getPoolSchedulables(taskType);
//对job进行排序
Collections.sort(scheds, new SchedulingAlgorithms.FairShareComparator());
boolean foundTask = false;
//备注三
for (Schedulable sched: scheds) { // This loop will assign only one task
eventLog.log("INFO", "Checking for " + taskType +
" task in " + sched.getName());
//备注四
Task task = taskType == TaskType.MAP ?
sched.assignTask(tts, currentTime, visitedForMap) :
sched.assignTask(tts, currentTime, visitedForReduce);
if (task != null) {
foundTask = true;
JobInProgress job = taskTrackerManager.getJob(task.getJobID());
eventLog.log("ASSIGN", trackerName, taskType,
job.getJobID(), task.getTaskID());
// Update running task counts, and the job's locality level
if (taskType == TaskType.MAP) {
launchedMap.add(job);
mapsAssigned++;
runningMaps++;
//备注五
updateLastMapLocalityLevel(job, task, tts);
} else {
reducesAssigned++;
runningReduces++;
}
// Add task to the list of assignments
tasks.add(task);
break; // This break makes this loop assign only one task
} // end if(task != null)
} // end for(Schedulable sched: scheds) // Reject the task type if we cannot find a task
if (!foundTask) {
if (taskType == TaskType.MAP) {
mapRejected = true;
} else {
reduceRejected = true;
}
}
} // end while (true) // Mark any jobs that were visited for map tasks but did not launch a task
// as skipped on this heartbeat
for (JobInProgress job: visitedForMap) {
if (!launchedMap.contains(job)) {
infos.get(job).skippedAtLastHeartbeat = true;
}
} // If no tasks were found, return null
return tasks.isEmpty() ? null : tasks;
}

assignTasks()

  备注一:updateLocalityWaitTimes()。首先更新自上次心跳以来,timeWaitedForLocalMap的时间,并将所有job 的skippedAtLastHeartbeat设为false;代码如下:

 /**
* Update locality wait times for jobs that were skipped at last heartbeat.
*/
private void updateLocalityWaitTimes(long currentTime) {
long timeSinceLastHeartbeat =
(lastHeartbeatTime == 0 ? 0 : currentTime - lastHeartbeatTime);
lastHeartbeatTime = currentTime;
for (JobInfo info: infos.values()) {
if (info.skippedAtLastHeartbeat) {
info.timeWaitedForLocalMap += timeSinceLastHeartbeat;
info.skippedAtLastHeartbeat = false;
}
}
}

updateLocalityWaitTimes()

  备注二:在while(true)循环中不断分配MapTask和ReduceTask,直到没有可被分配的为止;在循环中对所有job进行排序;接着在一个for()循环中进行真正的MapTask分配(Schedulable有两个子类,分别代表PoolSchedulable与JobSchedulable。这里的Schedulable可当做job看待)。

  备注三、四:在for()循环里,JobSchedulable中的assignTask()方法会被调用,来选择适当的MapTask或者ReduceTask。在选择MapTask时,先会调用FairScheduler.getAllowedLocalityLevel()方法来确定应该调度哪个级别的MapTask(具体的方法分析见上),然后根据该方法的返回值来选择对应级别的MapTask。assignTask()方法代码如下:

 @Override
public Task assignTask(TaskTrackerStatus tts, long currentTime,
Collection<JobInProgress> visited) throws IOException {
if (isRunnable()) {
visited.add(job);
TaskTrackerManager ttm = scheduler.taskTrackerManager;
ClusterStatus clusterStatus = ttm.getClusterStatus();
int numTaskTrackers = clusterStatus.getTaskTrackers(); // check with the load manager whether it is safe to
// launch this task on this taskTracker.
LoadManager loadMgr = scheduler.getLoadManager();
if (!loadMgr.canLaunchTask(tts, job, taskType)) {
return null;
}
if (taskType == TaskType.MAP) {
//确定应该调度的级别
LocalityLevel localityLevel = scheduler.getAllowedLocalityLevel(
job, currentTime);
scheduler.getEventLog().log(
"ALLOWED_LOC_LEVEL", job.getJobID(), localityLevel);
switch (localityLevel) {
case NODE:
return job.obtainNewNodeLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
case RACK:
return job.obtainNewNodeOrRackLocalMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
default:
return job.obtainNewMapTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
} else {
return job.obtainNewReduceTask(tts, numTaskTrackers,
ttm.getNumberOfUniqueHosts());
}
} else {
return null;
}
}

assignTask()

  可以看到,在该方法中又会根据相应的级别调用JobInProgress类中的方法来获取该级别的MapTask。

  备注五:最后updateLastMapLocalityLevel()方法会更新该job的一些信息:lastMapLocalityLevel设为该job对应的级别;timeWaitedForLocalMap置为0。

   /**
* Update a job's locality level and locality wait variables given that that
* it has just launched a map task on a given task tracker.
*/
private void updateLastMapLocalityLevel(JobInProgress job,
Task mapTaskLaunched, TaskTrackerStatus tracker) {
JobInfo info = infos.get(job);
boolean isNodeGroupAware = conf.getBoolean(
"net.topology.nodegroup.aware", false);
LocalityLevel localityLevel = LocalityLevel.fromTask(
job, mapTaskLaunched, tracker, isNodeGroupAware);
info.lastMapLocalityLevel = localityLevel;
info.timeWaitedForLocalMap = 0;
eventLog.log("ASSIGNED_LOC_LEVEL", job.getJobID(), localityLevel);
}

updateLastMapLocalityLevel()

  本文基于hadoop1.2.1。如有错误,还请指正

  参考文章: 《Hadoop技术内幕 深入理解MapReduce架构设计与实现原理》 董西成

    https://issues.apache.org/jira/secure/attachment/12457515/fair_scheduler_design_doc.pdf

  转载请注明出处:http://www.cnblogs.com/gwgyk/p/4568270.html

Fair Scheduler中的Delay Schedule分析的更多相关文章

  1. Hadoop学习之--Fair Scheduler作业调度分析

    Fair Scheduler调度器同步心跳分配任务的过程简单来讲会经历以下环节: 1. 对map/reduce是否已经达到资源上限的循环判断 2. 对pool队列根据Fair算法排序 3.然后循环po ...

  2. Cocos2d-x 源代码分析 : Scheduler(定时器) 源代码分析

    源代码版本号 3.1r,转载请注明 我也最终不out了,開始看3.x的源代码了.此时此刻的心情仅仅能是wtf! !!!!!!! !.只是也最终告别CC时代了. cocos2d-x 源代码分析文件夹 h ...

  3. 【原】Spark中Master源码分析(二)

    继续上一篇的内容.上一篇的内容为: Spark中Master源码分析(一) http://www.cnblogs.com/yourarebest/p/5312965.html 4.receive方法, ...

  4. Fair Scheduler 队列设置经验总结

    Fair Scheduler 队列设置经验总结 由于公司的hadoop集群的计算资源不是很充足,需要开启yarn资源队列的资源抢占.在使用过程中,才明白资源抢占的一些特点.在这里总结一下. 只有一个队 ...

  5. YARN的Fair Scheduler和Capacity Scheduler

    关于Scheduler YARN有四种调度机制:Fair Schedule,Capacity Schedule,FIFO以及Priority: 其中Fair Scheduler是资源池机制,进入到里面 ...

  6. 三:Fair Scheduler 公平调度器

    参考资料: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html http://h ...

  7. LoadRunner中对图表的分析说明

    LoadRunner中对图表的分析说明 (一)在Vusers(虚拟用户状态)中 1.Running Vusers(负载过程中的虚拟用户运行情况) 说明——系统形成负载的过程,随着时间的推移,虚拟用户数 ...

  8. Hadoop的三种调度器FIFO、Capacity Scheduler、Fair Scheduler(转载)

    目前Hadoop有三种比较流行的资源调度器:FIFO .Capacity Scheduler.Fair Scheduler.目前Hadoop2.7默认使用的是Capacity Scheduler容量调 ...

  9. 在VS 2015中边调试边分析性能

    (此文章同时发表在本人微信公众号"dotNET每日精华文章",欢迎右边二维码来关注.) 对代码进行性能分析,之前往往是一种独立的Profiling过程,现在在VS 2015中可以结 ...

随机推荐

  1. ognl.OgnlException: target is null for setProperty(null, "emailTypeNo", [Ljava.lang.String;@1513fd0)

    [com.opensymphony.xwork2.ognl.OgnlValueStack] - Error setting expression 'emaiTypeDto.emailTypeNo' w ...

  2. iOS利用通知逆传值

    直接创建两个控制器,点击跳转第二个界面,然后点击按钮进行传值 #import "ViewController.h" #import "TWOOViewController ...

  3. 定长循环队列C语言实现

    #ifndef _CONST_H_#define _CONST_H_ #include <stdio.h>#include <stdlib.h> typedef enum { ...

  4. JS实现继承的几种方式

    前言 JS作为面向对象的弱类型语言,继承也是其非常强大的特性之一.那么如何在JS中实现继承呢?让我们拭目以待. JS继承的实现方式 既然要实现继承,那么首先我们得有一个父类,代码如下: // 定义一个 ...

  5. Tomcat7.0安装配置

    很久没有通过博客对学习所得进行记录了. 现在将使用Tomcat的一些经验和心得写到这里,作为记录和备忘.如果有朋友看到,也请不吝赐教. 首先,我个人使用的是apache-tomcat-7.0.27你可 ...

  6. SAP 设置屏幕字段的隐藏、显示、必填和可选,以设置物料组为例

    1.事务码MM01,把物料组设为选填字段. 2.找到物料组的屏幕字段. 3.在后台根据屏幕字段找到对应字段组.后台路径:后勤-常规—物料主数据—字段选择—给字段组分配字段.点击后面的箭头进入下一屏幕. ...

  7. python——django使用mysql数据库(一)

    之前已经写过如何创建一个django项目,现在我们已经有了一个小骷髅,要想这个web工程变成一个有血有肉的人,我们还需要做很多操作.现在就先来介绍如何在django中使用mysql数据库. 前提:已经 ...

  8. 收藏夹里的js

    释放右键 javascript:(function(){var doc=document;var bd=doc.body;bd.onselectstart=bd.oncopy=bd.onpaste=b ...

  9. thinkPHP的用法之创建新项目

    1 配置文件中 新增数组元素:'DEFAULT_APPS'=> array('api', 'admin', 'home', 'megagame'), 2 新增样式变量 在view.class.p ...

  10. 复制物料(参考的MMCC想法)

    MMCC这个事务码没用过,也是才听到的,都是业务搞起来的...然后感觉这个东西有点意思,就搞搞咯 网上找的一篇文章,自己修改的.改了默认收货工厂为创建时的工厂,因为这边一直报收货工厂必输...不管是不 ...