Fair Scheduler中的Delay Schedule分析

　　延迟调度的主要目的是提高数据本地性(data locality)，减少数据在网络中的传输。对于那些输入数据不在本地的MapTask，调度器将会延迟调度他们，而把slot分配给那些具备本地性的MapTask。

　　延迟调度的大体思想如下：

　　若该job找到一个node-local的MapTask，则返回该task；若找不到，则延迟调度。即在nodeLocalityDelay时长内，重新找到一个node-local的MapTask并返回；

　　否则等待时长超过nodeLocalityDelay之后，寻找一个rack-local的MapTask并返回；若找不到，则延迟调度。即在rackLocalityDelay时长内，重新找到一个rack-local的MapTask并返回；

　　否则等待超过nodeLocalityDelay + rackLocalityDelay之后，重新寻找一个off-switch的MapTask并返回。

　　FairScheduler.java中关于延迟调度的主要变量：

 long nodeLocalityDelay：//node-local已经等待的时间

 long rackLocalityDelay： //rack-local已经等待的时间

 boolean skippedAtLastHeartbeat：//该job是否被延迟调度(是否被跳过)

 timeWaitedForLocalMap：//自从上次MapTask被分配以来等待的时间

 LocalityLevel lastMapLocalityLevel：//上次分配的MapTask对应的本地级别

 nodeLocalityDelay = rackLocalityDelay =

   Math.min(15000 ,  (long) (1.5 * jobTracker.getNextHeartbeatInterval()));

　　在fair scheduler中，每个job维护了两个变量用来完成延迟调度：最后一个被调度的MapTask的本地性级别(lastMapLocalityLevel)与自从这个job被跳过以来所等待的时间(timeWaitedForLocalMap)。工作流程如下(具体工作在FairScheduler.java的getAllowedLocalityLevel ()方法中完成)：

 /**

    * Get the maximum locality level at which a given job is allowed to

    * launch tasks, based on how long it has been waiting for local tasks.

    * This is used to implement the "delay scheduling" feature of the Fair

    * Scheduler for optimizing data locality.

    * If the job has no locality information (e.g. it does not use HDFS), this

    * method returns LocalityLevel.ANY, allowing tasks at any level.

    * Otherwise, the job can only launch tasks at its current locality level

    * or lower, unless it has waited at least nodeLocalityDelay or

    * rackLocalityDelay milliseconds depends on the current level. If it

    * has waited (nodeLocalityDelay + rackLocalityDelay) milliseconds,

    * it can go to any level.

    */

   protected LocalityLevel getAllowedLocalityLevel(JobInProgress job,

       long currentTime) {

     JobInfo info = infos.get(job);

     if (info == null) { // Job not in infos (shouldn't happen)

       LOG.error("getAllowedLocalityLevel called on job " + job

           + ", which does not have a JobInfo in infos");

       return LocalityLevel.ANY;

     }

     if (job.nonLocalMaps.size() > 0) { // Job doesn't have locality information

       return LocalityLevel.ANY;

     }

     // Don't wait for locality if the job's pool is starving for maps

     Pool pool = poolMgr.getPool(job);

     PoolSchedulable sched = pool.getMapSchedulable();

     long minShareTimeout = poolMgr.getMinSharePreemptionTimeout(pool.getName());

     long fairShareTimeout = poolMgr.getFairSharePreemptionTimeout();

     if (currentTime - sched.getLastTimeAtMinShare() > minShareTimeout ||

         currentTime - sched.getLastTimeAtHalfFairShare() > fairShareTimeout) {

       eventLog.log("INFO", "No delay scheduling for "

           + job.getJobID() + " because it is being starved");

       return LocalityLevel.ANY;

     }

     // In the common case, compute locality level based on time waited

     switch(info.lastMapLocalityLevel) {

     case NODE: // Last task launched was node-local

       if (info.timeWaitedForLocalMap >=

           nodeLocalityDelay + rackLocalityDelay)

         return LocalityLevel.ANY;

       else if (info.timeWaitedForLocalMap >= nodeLocalityDelay)

         return LocalityLevel.RACK;

       else

         return LocalityLevel.NODE;

     case RACK: // Last task launched was rack-local

       if (info.timeWaitedForLocalMap >= rackLocalityDelay)

         return LocalityLevel.ANY;

       else

         return LocalityLevel.RACK;

     default: // Last task was non-local; can launch anywhere

       return LocalityLevel.ANY;

     }

   }

getAllowedLocalityLevel()

1. 若lastMapLocalityLevel为Node：

1）若timeWaitedForLocalMap >= nodeLocalityDelay + rackLocalityDelay，则可以调度off-switch及以下级别的MapTask；

2）若timeWaitedForLocalMap >= nodeLocalityDelay，则可以调度rack-local及以下级别的MapTask；

3）否则调度node-local级别的MapTask。

2. 若lastMapLocalityLevel为Rack：

1）若timeWaitedForLocalMap >= rackLocalityDelay，则调度off-switch及以下级别的MapTask；

2）否则调度rack-local及以下级别的MapTask；

3. 否则调度off-switch及以下级别的MapTask；

　　延迟调度的具体工作流程如下(具体工作在FairScheduler.java的assignTasks()方法中完成)：

 @Override

   public synchronized List<Task> assignTasks(TaskTracker tracker)

       throws IOException {

     if (!initialized) // Don't try to assign tasks if we haven't yet started up

       return null;

     String trackerName = tracker.getTrackerName();

     eventLog.log("HEARTBEAT", trackerName);

     long currentTime = clock.getTime();

     // Compute total runnable maps and reduces, and currently running ones

     int runnableMaps = 0;

     int runningMaps = 0;

     int runnableReduces = 0;

     int runningReduces = 0;

     for (Pool pool: poolMgr.getPools()) {

       runnableMaps += pool.getMapSchedulable().getDemand();

       runningMaps += pool.getMapSchedulable().getRunningTasks();

       runnableReduces += pool.getReduceSchedulable().getDemand();

       runningReduces += pool.getReduceSchedulable().getRunningTasks();

     }

     ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();

     // Compute total map/reduce slots

     // In the future we can precompute this if the Scheduler becomes a

     // listener of tracker join/leave events.

     int totalMapSlots = getTotalSlots(TaskType.MAP, clusterStatus);

     int totalReduceSlots = getTotalSlots(TaskType.REDUCE, clusterStatus);

     eventLog.log("RUNNABLE_TASKS",

         runnableMaps, runningMaps, runnableReduces, runningReduces);

     // Update time waited for local maps for jobs skipped on last heartbeat

     //备注一

     updateLocalityWaitTimes(currentTime);

     // Check for JT safe-mode

     if (taskTrackerManager.isInSafeMode()) {

       LOG.info("JobTracker is in safe-mode, not scheduling any tasks.");

       return null;

     } 

     TaskTrackerStatus tts = tracker.getStatus();

     int mapsAssigned = 0; // loop counter for map in the below while loop

     int reducesAssigned = 0; // loop counter for reduce in the below while

     int mapCapacity = maxTasksToAssign(TaskType.MAP, tts);

     int reduceCapacity = maxTasksToAssign(TaskType.REDUCE, tts);

     boolean mapRejected = false; // flag used for ending the loop

     boolean reduceRejected = false; // flag used for ending the loop

     // Keep track of which jobs were visited for map tasks and which had tasks

     // launched, so that we can later mark skipped jobs for delay scheduling

     Set<JobInProgress> visitedForMap = new HashSet<JobInProgress>();

     Set<JobInProgress> visitedForReduce = new HashSet<JobInProgress>();

     Set<JobInProgress> launchedMap = new HashSet<JobInProgress>();

     ArrayList<Task> tasks = new ArrayList<Task>();

     // Scan jobs to assign tasks until neither maps nor reduces can be assigned

     //备注二

     while (true) {

       // Computing the ending conditions for the loop

       // Reject a task type if one of the following condition happens

       // 1. number of assigned task reaches per heatbeat limit

       // 2. number of running tasks reaches runnable tasks

       // 3. task is rejected by the LoadManager.canAssign

       if (!mapRejected) {

         if (mapsAssigned == mapCapacity ||

             runningMaps == runnableMaps ||

             !loadMgr.canAssignMap(tts, runnableMaps,

                 totalMapSlots, mapsAssigned)) {

           eventLog.log("INFO", "Can't assign another MAP to " + trackerName);

           mapRejected = true;

         }

       }

       if (!reduceRejected) {

         if (reducesAssigned == reduceCapacity ||

             runningReduces == runnableReduces ||

             !loadMgr.canAssignReduce(tts, runnableReduces,

                 totalReduceSlots, reducesAssigned)) {

           eventLog.log("INFO", "Can't assign another REDUCE to " + trackerName);

           reduceRejected = true;

         }

       }

       // Exit while (true) loop if

       // 1. neither maps nor reduces can be assigned

       // 2. assignMultiple is off and we already assigned one task

       if (mapRejected && reduceRejected ||

           !assignMultiple && tasks.size() > 0) {

         break; // This is the only exit of the while (true) loop

       }

       // Determine which task type to assign this time

       // First try choosing a task type which is not rejected

       TaskType taskType;

       if (mapRejected) {

         taskType = TaskType.REDUCE;

       } else if (reduceRejected) {

         taskType = TaskType.MAP;

       } else {

         // If both types are available, choose the task type with fewer running

         // tasks on the task tracker to prevent that task type from starving

         if (tts.countMapTasks() + mapsAssigned <=

             tts.countReduceTasks() + reducesAssigned) {

           taskType = TaskType.MAP;

         } else {

           taskType = TaskType.REDUCE;

         }

       }

       // Get the map or reduce schedulables and sort them by fair sharing

       List<PoolSchedulable> scheds = getPoolSchedulables(taskType);

       //对job进行排序

       Collections.sort(scheds, new SchedulingAlgorithms.FairShareComparator());

       boolean foundTask = false;

       //备注三

       for (Schedulable sched: scheds) { // This loop will assign only one task

         eventLog.log("INFO", "Checking for " + taskType +

             " task in " + sched.getName());

         //备注四

         Task task = taskType == TaskType.MAP ?

                     sched.assignTask(tts, currentTime, visitedForMap) :

                     sched.assignTask(tts, currentTime, visitedForReduce);

         if (task != null) {

           foundTask = true;

           JobInProgress job = taskTrackerManager.getJob(task.getJobID());

           eventLog.log("ASSIGN", trackerName, taskType,

               job.getJobID(), task.getTaskID());

           // Update running task counts, and the job's locality level

           if (taskType == TaskType.MAP) {

             launchedMap.add(job);

             mapsAssigned++;

             runningMaps++;

             //备注五

             updateLastMapLocalityLevel(job, task, tts);

           } else {

             reducesAssigned++;

             runningReduces++;

           }

           // Add task to the list of assignments

           tasks.add(task);

           break; // This break makes this loop assign only one task

         } // end if(task != null)

       } // end for(Schedulable sched: scheds)

       // Reject the task type if we cannot find a task

       if (!foundTask) {

         if (taskType == TaskType.MAP) {

           mapRejected = true;

         } else {

           reduceRejected = true;

         }

       }

     } // end while (true)

     // Mark any jobs that were visited for map tasks but did not launch a task

     // as skipped on this heartbeat

     for (JobInProgress job: visitedForMap) {

       if (!launchedMap.contains(job)) {

         infos.get(job).skippedAtLastHeartbeat = true;

       }

     }

     // If no tasks were found, return null

     return tasks.isEmpty() ? null : tasks;

   }

assignTasks()

　　备注一：updateLocalityWaitTimes()。首先更新自上次心跳以来，timeWaitedForLocalMap的时间，并将所有job 的skippedAtLastHeartbeat设为false；代码如下：

 /**

    * Update locality wait times for jobs that were skipped at last heartbeat.

    */

   private void updateLocalityWaitTimes(long currentTime) {

     long timeSinceLastHeartbeat =

       (lastHeartbeatTime == 0 ? 0 : currentTime - lastHeartbeatTime);

     lastHeartbeatTime = currentTime;

     for (JobInfo info: infos.values()) {

       if (info.skippedAtLastHeartbeat) {

         info.timeWaitedForLocalMap += timeSinceLastHeartbeat;

         info.skippedAtLastHeartbeat = false;

       }

     }

   }

updateLocalityWaitTimes()

　　备注二：在while(true)循环中不断分配MapTask和ReduceTask，直到没有可被分配的为止；在循环中对所有job进行排序；接着在一个for()循环中进行真正的MapTask分配(Schedulable有两个子类，分别代表PoolSchedulable与JobSchedulable。这里的Schedulable可当做job看待)。

　　备注三、四：在for()循环里，JobSchedulable中的assignTask()方法会被调用，来选择适当的MapTask或者ReduceTask。在选择MapTask时，先会调用FairScheduler.getAllowedLocalityLevel()方法来确定应该调度哪个级别的MapTask(具体的方法分析见上)，然后根据该方法的返回值来选择对应级别的MapTask。assignTask()方法代码如下：

 @Override

   public Task assignTask(TaskTrackerStatus tts, long currentTime,

       Collection<JobInProgress> visited) throws IOException {

     if (isRunnable()) {

       visited.add(job);

       TaskTrackerManager ttm = scheduler.taskTrackerManager;

       ClusterStatus clusterStatus = ttm.getClusterStatus();

       int numTaskTrackers = clusterStatus.getTaskTrackers();

       // check with the load manager whether it is safe to

       // launch this task on this taskTracker.

       LoadManager loadMgr = scheduler.getLoadManager();

       if (!loadMgr.canLaunchTask(tts, job, taskType)) {

         return null;

       }

       if (taskType == TaskType.MAP) {

           //确定应该调度的级别

         LocalityLevel localityLevel = scheduler.getAllowedLocalityLevel(

             job, currentTime);

         scheduler.getEventLog().log(

             "ALLOWED_LOC_LEVEL", job.getJobID(), localityLevel);

         switch (localityLevel) {

           case NODE:

             return job.obtainNewNodeLocalMapTask(tts, numTaskTrackers,

                 ttm.getNumberOfUniqueHosts());

           case RACK:

             return job.obtainNewNodeOrRackLocalMapTask(tts, numTaskTrackers,

                 ttm.getNumberOfUniqueHosts());

           default:

             return job.obtainNewMapTask(tts, numTaskTrackers,

                 ttm.getNumberOfUniqueHosts());

         }

       } else {

         return job.obtainNewReduceTask(tts, numTaskTrackers,

             ttm.getNumberOfUniqueHosts());

       }

     } else {

       return null;

     }

   }

assignTask()

　　可以看到，在该方法中又会根据相应的级别调用JobInProgress类中的方法来获取该级别的MapTask。

　　备注五：最后updateLastMapLocalityLevel()方法会更新该job的一些信息：lastMapLocalityLevel设为该job对应的级别；timeWaitedForLocalMap置为0。

   /**

    * Update a job's locality level and locality wait variables given that that

    * it has just launched a map task on a given task tracker.

    */

   private void updateLastMapLocalityLevel(JobInProgress job,

       Task mapTaskLaunched, TaskTrackerStatus tracker) {

     JobInfo info = infos.get(job);

     boolean isNodeGroupAware = conf.getBoolean(

         "net.topology.nodegroup.aware", false);

     LocalityLevel localityLevel = LocalityLevel.fromTask(

         job, mapTaskLaunched, tracker, isNodeGroupAware);

     info.lastMapLocalityLevel = localityLevel;

     info.timeWaitedForLocalMap = 0;

     eventLog.log("ASSIGNED_LOC_LEVEL", job.getJobID(), localityLevel);

   }

updateLastMapLocalityLevel()

　　本文基于hadoop1.2.1。如有错误，还请指正

　　参考文章：《Hadoop技术内幕深入理解MapReduce架构设计与实现原理》董西成

　　　　https://issues.apache.org/jira/secure/attachment/12457515/fair_scheduler_design_doc.pdf

　　转载请注明出处：http://www.cnblogs.com/gwgyk/p/4568270.html

Fair Scheduler中的Delay Schedule分析的更多相关文章

Hadoop学习之--Fair Scheduler作业调度分析
Fair Scheduler调度器同步心跳分配任务的过程简单来讲会经历以下环节: 1. 对map/reduce是否已经达到资源上限的循环判断 2. 对pool队列根据Fair算法排序 3.然后循环po ...
Cocos2d-x 源代码分析： Scheduler（定时器）源代码分析
源代码版本号 3.1r,转载请注明我也最终不out了,開始看3.x的源代码了.此时此刻的心情仅仅能是wtf! !!!!!!! !.只是也最终告别CC时代了. cocos2d-x 源代码分析文件夹 h ...
【原】Spark中Master源码分析（二）
继续上一篇的内容.上一篇的内容为: Spark中Master源码分析(一) http://www.cnblogs.com/yourarebest/p/5312965.html 4.receive方法, ...
Fair Scheduler 队列设置经验总结
Fair Scheduler 队列设置经验总结由于公司的hadoop集群的计算资源不是很充足,需要开启yarn资源队列的资源抢占.在使用过程中,才明白资源抢占的一些特点.在这里总结一下. 只有一个队 ...
YARN的Fair Scheduler和Capacity Scheduler
关于Scheduler YARN有四种调度机制:Fair Schedule,Capacity Schedule,FIFO以及Priority: 其中Fair Scheduler是资源池机制,进入到里面 ...
三：Fair Scheduler 公平调度器
参考资料: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html http://h ...
LoadRunner中对图表的分析说明
LoadRunner中对图表的分析说明 (一)在Vusers(虚拟用户状态)中 1.Running Vusers(负载过程中的虚拟用户运行情况) 说明——系统形成负载的过程,随着时间的推移,虚拟用户数 ...
Hadoop的三种调度器FIFO、Capacity Scheduler、Fair Scheduler（转载）
目前Hadoop有三种比较流行的资源调度器:FIFO .Capacity Scheduler.Fair Scheduler.目前Hadoop2.7默认使用的是Capacity Scheduler容量调 ...
在VS 2015中边调试边分析性能
(此文章同时发表在本人微信公众号"dotNET每日精华文章",欢迎右边二维码来关注.) 对代码进行性能分析,之前往往是一种独立的Profiling过程,现在在VS 2015中可以结 ...

随机推荐

Python3基础访问列表指定索引值的元素
镇场诗:---大梦谁觉,水月中建博客.百千磨难,才知世事无常.---今持佛语,技术无量愿学.愿尽所学,铸一良心博客.------------------------------------------ ...
做Adsense的一些经验
The payment you receive per click depends on how much advertisers are paying per click to advertise ...
该用 QGraphicsView ? QtQuick-QML ?
目前QtQuick (2014/3/6) 已经发展了有一段时间了,很多人在用因此我也想看看是否适合我目前的项目. 我要做的是一个类似3DMax中的材质编辑器的东西,里面有成千上万的”表单“(不知道怎么 ...
【转载】动态新增svg节点
原文地址:http://blog.csdn.net/tomatomas/article/details/50442497 原文作者:番茄大圣创建svg节点时,要使用createElementNS函数 ...
JVM 内存管理机制
1. 内存分配图: 两栈一区一堆一计数方法区里面包含了运行时常量 2. 对象创建过程: new A() 首先加载A的字节码. 分配内存,内存分配方式分两种,如果采用带压缩的垃圾回收策略,则采用“ ...
Ios二维码扫描（系统自带的二维码扫描）
Ios二维码扫描这里给大家介绍的时如何使用系统自带的二维码扫描方法和一些简单的动画! 操作步骤: 1).首先你需要搭建UI界面如图:下图我用了俩个imageview和一个label 2).你需要在你 ...
java文章显示内容部分(将html转成纯文本)
public static String splitAndFilterString(String input, int length) { if (input == null || input.tri ...
jupyter nb + scite 混合搭建成我的最爱IDE
jupyter nb + scite 混合搭建成我的最爱IDE 自从体验过jupyter notebook之后, 就深深地爱上了你, jupyter. jupyter这个名字也很古怪的. 它应该是ju ...
[转]Android开源项目第二篇——工具库篇
本文为那些不错的Android开源项目第二篇--开发工具库篇,主要介绍常用的开发库,包括依赖注入框架.图片缓存.网络相关.数据库ORM建模.Android公共库.Android 高版本向低版本兼容.多 ...
Eclipse WTP Tomcat hot deploy
转自: http://ducquoc.wordpress.com/2010/11/06/eclipse-wtp-tomcat-hot-deploy/ One of the reasons why Ja ...

Fair Scheduler中的Delay Schedule分析

Fair Scheduler中的Delay Schedule分析的更多相关文章

随机推荐

热门专题