记录一次 hadoop yarn resourceManager无故切换的故障

某日收到告警线上集群rm切换观察resourcemanager 日志报错如下

这行不明显再看看其他日志报错

在 app attempt_removed 时候发生了空指针错误

    break;

    case APP_ATTEMPT_REMOVED:

      if (!(event instanceof AppAttemptRemovedSchedulerEvent)) {

        throw new RuntimeException("Unexpected event type: " + event);

      }

      AppAttemptRemovedSchedulerEvent appAttemptRemovedEvent =

          (AppAttemptRemovedSchedulerEvent) event;

      removeApplicationAttempt(

          appAttemptRemovedEvent.getApplicationAttemptID(),

          appAttemptRemovedEvent.getFinalAttemptState(),

          appAttemptRemovedEvent.getKeepContainersAcrossAppAttempts());

      break;

    case CONTAINER_EXPIRED:

定位到代码问题在这里新增标记内容

private void removeApplicationAttempt(

      ApplicationAttemptId applicationAttemptId,

      RMAppAttemptState rmAppAttemptFinalState, boolean keepContainers) {

    LOG.info("Application " + applicationAttemptId + " is done." +

        " finalState=" + rmAppAttemptFinalState);

    try {

      writeLock.lock();

      SchedulerApplication<FSAppAttempt> application =

        applications.get(applicationAttemptId.getApplicationId());

      FSAppAttempt attempt = getSchedulerApp(applicationAttemptId);

      if (attempt == null || application == null) {

        LOG.info("Unknown application " + applicationAttemptId + " has completed!");

        return;

      }
     //已经停止了就不用再次停止了 新增

      // Check if the attempt is already stopped and don't stop it twice.

      if (attempt.isStopped()) {

        LOG.info("Application " + applicationAttemptId + " has already been "

                + "stopped!");

        return;

      }

      // Release all the running containers

      for (RMContainer rmContainer : attempt.getLiveContainers()) {

        if (keepContainers

          && rmContainer.getState().equals(RMContainerState.RUNNING)) {

          // do not kill the running container in the case of work-preserving AM

          // restart.

          LOG.info("Skip killing " + rmContainer.getContainerId());

          continue;

        }

        super.completedContainer(rmContainer,

          SchedulerUtils.createAbnormalContainerStatus(

            rmContainer.getContainerId(),

            SchedulerUtils.COMPLETED_APPLICATION),

          RMContainerEventType.KILL);

      }

增加如下代码

 @Override

  public String moveApplication(ApplicationId appId,

      String queueName) throws YarnException {

    try {

      writeLock.lock();

      SchedulerApplication<FSAppAttempt> app = applications.get(appId);

      if (app == null) {

        throw new YarnException("App to be moved " + appId + " not found.");

      }

      FSAppAttempt attempt = (FSAppAttempt) app.getCurrentAppAttempt();

      // To serialize with FairScheduler#allocate, synchronize on app attempt

      try {

        attempt.getWriteLock().lock();

        FSLeafQueue oldQueue = (FSLeafQueue) app.getQueue();

        // Check if the attempt is already stopped: don't move stopped app

        // attempt. The attempt has already been removed from all queues.

        if (attempt.isStopped()) {

          LOG.info("Application " + appId + " is stopped and can't be moved!"throw new YarnException("Application " ++ " is stopped and can't be moved!");

        }

        String destQueueName = handleMoveToPlanQueue(queueName);

        FSLeafQueue targetQueue = queueMgr.getLeafQueue(destQueueName, false);

        if (targetQueue == null) {

          throw new YarnException("Target queue " + queueName

            + " not found or is not a leaf queue.");

        }

        if (targetQueue == oldQueue) {

          return oldQueue.getQueueName();

        }

private void executeMove(SchedulerApplication<FSAppAttempt> app,

      FSAppAttempt attempt, FSLeafQueue oldQueue, FSLeafQueue newQueue) {

    // Check current runs state. Do not remove the attempt from the queue until

    // after the check has been performed otherwise it could remove the app

    // from a queue without moving it to a new queue.

    boolean wasRunnable = oldQueue.isRunnableApp(attempt);

    // if app was not runnable before, it may be runnable now

    boolean nowRunnable = maxRunningEnforcer.canAppBeRunnable(newQueue,

        attempt.getUser());

    if (wasRunnable && !nowRunnable) {

      throw new IllegalStateException("Should have already verified that app "

          + attempt.getApplicationId() + " would be runnable in new queue");

    }

    // Now it is safe to remove from the queue.

    oldQueue.removeApp(attempt);

    if (wasRunnable) {

      maxRunningEnforcer.untrackRunnableApp(attempt);

    } else if (nowRunnable) {

      // App has changed from non-runnable to runnable

      maxRunningEnforcer.untrackNonRunnableApp(attempt);

    }

    attempt.move(newQueue); // This updates all the metrics

    app.setQueue(newQueue);

    newQueue.addApp(attempt, nowRunnable);

    if (nowRunnable) {

      maxRunningEnforcer.trackRunnableApp(attempt);

    }

    if (wasRunnable) {

      maxRunningEnforcer.updateRunnabilityOnAppRemoval(attempt, oldQueue);

    }

  }

问题解决

参考 https://issues.apache.org/jira/secure/attachment/12841441/YARN-5136.2.patch

记录一次 hadoop yarn resourceManager无故切换的故障的更多相关文章

记录一次线上yarn RM频繁切换的故障
周末一大早被报警惊醒,rm频繁切换急急忙忙排查看到两处错误日志错误信息1 ervation <memory:0, vCores:0> 2019-12-21 11:51:57,781 ...
Hadoop记录-yarn ResourceManager Active频繁易主问题排查（转载）
一.故障现象两个节点的ResourceManger频繁在active和standby角色中切换.不断有active易主的告警发出许多任务的状态没能成功更新,导致一些任务状态卡在NEW_SAVING ...
Hadoop官方文档翻译—— YARN ResourceManager High Availability 2.7.3
ResourceManager High Availability (RM高可用) Introduction(简介) Architecture(架构) RM Failover(RM 故障切换) Rec ...
Spark&Hive：如何使用scala开发spark访问hive作业，如何使用yarn resourcemanager。
背景: 接到任务,需要在一个一天数据量在460亿条记录的hive表中,筛选出某些host为特定的值时才解析该条记录的http_content中的经纬度: 解析规则譬如: 需要解析host: api.m ...
Hadoop yarn配置参数
参照site:http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml 我们在配置yar ...
Hadoop Yarn 安装
环境:Linux, 8G 内存.60G 硬盘 , Hadoop 2.2.0 为了构建基于Yarn体系的Spark集群.先要安装Hadoop集群,为了以后查阅方便记录了我本次安装的详细步骤. 事前准备 ...
通过tarball形式安装HBASE Cluster（CDH5.0.2）——配置分布式集群中的YARN ResourceManager 的HA
<?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the &q ...
hadoop yarn HA集群搭建
可先完成hadoop namenode HA的搭建:http://www.cnblogs.com/kisf/p/7458519.html 搭建yarnde HA只需要在namenode HA配置基础上 ...
hadoop+yarn+hbase+storm+kafka+spark+zookeeper)高可用集群详细配置
配置 hadoop+yarn+hbase+storm+kafka+spark+zookeeper 高可用集群,同时安装相关组建:JDK,MySQL,Hive,Flume 文章目录环境介绍节点介绍 ...

随机推荐

learning express step(七)
Route handlers enable you to define multiple routes for a path. The example below defines two routes ...
[Luogu] 八数码难题
https://www.luogu.org/problemnew/show/P1379 long long ago 暴力bfs #include <iostream> #include & ...
cdh版hbase构建Phoenix 遇到的坑
Phoenix 构建cdh版hbase遇到的坑 1. 安装phoenix 下载:在github上下载对应版本https://github.com/apache/phoenix 解压:略编译: 修改根 ...
node_exporter安装和配置
1.二进制包安装 mkdir -p /opt/exporter 下载地址: wget https://github.com/prometheus/node_exporter/releases/down ...
String 类型的数据强转成int的方法
有2个方法:1). int i = Integer.parseInt(str); 2). int i = Integer.valueOf(str).intValue();
HDU 1069 Monkey and Banana ——（DP）
简单DP. 题意:给出若干种长方体,如果摆放时一个长方体的长和宽小于另一个的长宽,那么它可以放在另一个的上面,问最高能放多少高度.每种长方体的个数都是无限的. 做法:因为每种个数都是无限,那么每种按照 ...
Open Live Writer 显示不出来代码着色插件解决办法
下载地址: Open Live Writer 插件更新下载后要把下面这5个文件,全部解除锁定(右键属性打开) Memento.OLW.Plugins.dll OLWPlugins.css OpenL ...
Flutter设置Container的最大最小宽高
Flutter中设置Container宽高可直接通过width和height属性来设置:如下 Container( width: 100, height: 100, color: Colors.red ...
解析PHP的self关键字
PHP群里有人询问self关键字的用法,答案是比较明显的:静态成员函数内不能用this调用非成员函数,但可以用self调用静态成员函数/变量/常量:其他成员函数可以用self调用静态成员函数以及非静态 ...
[MySql]当虚拟机的IP地址自动更换后，JDBC使用原来的配置连不上MySql数据库时所报的异常。
Communications link failure The last packet sent successfully to the server was 0 milliseconds ago. ...

记录一次 hadoop yarn resourceManager无故切换的故障

记录一次 hadoop yarn resourceManager无故切换的故障的更多相关文章

随机推荐

热门专题