HDFS Lease Recovey 和 Block Recovery

这篇分析一下Lease Recovery 和 Block Recovery

hdfs支持hflush后，需要保证hflush的数据被读到，datanode重启不能简单的丢弃文件的最后一个block，而是需要保留下hflush的数据。同时为了支持append，需要将已经finalized的block重新打开追加数据。这就为宕机的恢复处理带来了更大的困难，支持hflush/append之前，hdfs只需要将未关闭文件的最后一个block的多个副本删除即可.

在hdfs的设计中，Lease是为了实现一个文件在一个时刻只能被一个客户端写。客户端写文件或者append之前都需要向namenode申请这个文件的Lease，在客户端写数据的过程中，后台线程会不断的renew lease，不断的延长独占写的时间.实际上，Lease有两个limit，一个是soft limit，默认60s，一个是hard limit，默认1小时。这两个limit的区别如下:

lease soft limit过期之前，该客户端拥有对这个文件的独立访问权，其他客户端不能剥夺该客户端独占写这个文件的权利。

lease soft limit过期后，任何一个客户端都可以回收lease，继而得到这个文件的lease，获得对这个文件的独占访问权。

lease hard limit过期后，namenode强制关闭文件，撤销lease.

考虑客户端写文件的过程中宕机，那么在lease soft limit过期之前，其他的客户端不能写这个文件，等到lease soft limit过期后，其他客户端可以写这个文件，在写文件之前，会首先检查文件是不是没有关闭，如果没有，那么就会进入lease recovery和block recovery阶段，这个阶段的目的是使文件的最后一个block的所有副本数据达到一致，因为客户端写block的多个副本是pipeline写，pipeline中的副本数据不一致很正常。

本文考虑客户端写的过程中客户端宕机，随后其他客户端对这个文件进行append操作的场景。

客户端通过如下代码对一个文件进行append:

FileSystem fs = FileSystem.get(configuration);

FSDataOutputStream out = fs.append(path);

out.write(byte[]);

append操作在namenode这端主要逻辑在FSNameSystem的appendFileInternal函数中处理，内部会调用

 // Opening an existing file for write - may need to recover lease.

 recoverLeaseInternal(myFile, src, holder, clientMachine, false);

来检查是否需要首先对文件进行lease recovery.重点看看这个函数.

 private void recoverLeaseInternal(INodeFile fileInode,

      String src, String holder, String clientMachine, boolean force)

      throws IOException {

    // holder是对这个文件进行append的clientname

    assert hasWriteLock();

    if (fileInode != null && fileInode.isUnderConstruction()) {

      //

      // If the file is under construction , then it must be in our

      // leases. Find the appropriate lease record.

      //

      Lease lease = leaseManager.getLease(holder);

      //

      // We found the lease for this file. And surprisingly the original

      // holder is trying to recreate this file. This should never occur.

      //

      if (!force && lease != null) {

        Lease leaseFile = leaseManager.getLeaseByPath(src);

        if ((leaseFile != null && leaseFile.equals(lease)) ||

            lease.getHolder().equals(holder)) {

          throw new AlreadyBeingCreatedException(

            "failed to create file " + src + " for " + holder +

            " for client " + clientMachine +

            " because current leaseholder is trying to recreate file.");

        }

      }

      //

      // Find the original holder.

      //

      FileUnderConstructionFeature uc = fileInode.getFileUnderConstructionFeature();

      String clientName = uc.getClientName();

      lease = leaseManager.getLease(clientName);

      if (lease == null) {

        throw new AlreadyBeingCreatedException(

          "failed to create file " + src + " for " + holder +

          " for client " + clientMachine +

          " because pendingCreates is non-null but no leases found.");

      }

      if (force) {

        // close now: no need to wait for soft lease expiration and

        // close only the file src

        LOG.info("recoverLease: " + lease + ", src=" + src +

          " from client " + clientName);

        internalReleaseLease(lease, src, holder);

      } else {

        assert lease.getHolder().equals(clientName) :

          "Current lease holder " + lease.getHolder() +

          " does not match file creator " + clientName;

        //

        // If the original holder has not renewed in the last SOFTLIMIT

        // period, then start lease recovery.

        //

        if (lease.expiredSoftLimit()) {

          LOG.info("startFile: recover " + lease + ", src=" + src + " client "

              + clientName);

          boolean isClosed = internalReleaseLease(lease, src, null);

          if(!isClosed)

            throw new RecoveryInProgressException(

                "Failed to close file " + src +

                ". Lease recovery is in progress. Try again later.");

        } else {

          final BlockInfo lastBlock = fileInode.getLastBlock();

          if (lastBlock != null

              && lastBlock.getBlockUCState() == BlockUCState.UNDER_RECOVERY) {

            throw new RecoveryInProgressException("Recovery in progress, file ["

                + src + "], " + "lease owner [" + lease.getHolder() + "]");

          } else {

            throw new AlreadyBeingCreatedException("Failed to create file ["

                + src + "] for [" + holder + "] for client [" + clientMachine

                + "], because this file is already being created by ["

                + clientName + "] on ["

                + uc.getClientMachine() + "]");

          }

        }

      }

    }

  }

通过检查文件的INode看文件的状态，如果处于under construction状态，说明，该文件不处于关闭状态，那么很可能这个文件需要经过lease recovery和block recovery阶段来对文件的最后一个block的多个副本达到一致.
从lease manager中根据clientname拿到clientname持有的Lease(holder是调用此次append操作的clientname)，如果不为空，说明该客户端依然持有lease，那么接着看这个lease中是否包含append的这个文件名，如果确实有，那么说明当前客户端仍然持有这个文件的lease，append失败，因为append的前提条件是文件处于closed状态.如果lease中不包含这个文件，说明客户端当前不持有这个文件的Lease，那么继续往下走
从INode中找出这个之前拥有这个文件的leaseholder，也就是在我们设定的场景中的宕机的客户端，然后从lease manager中找到宕机的客户端对应的Lease，然后检查是否这个lease已经soft limit过期，如果过期，则调用

boolean isClosed = internalReleaseLease(lease, src, null);

这个函数检查是否需要真正的进入block recovery阶段，这个阶段需要datanode的参与。下面函数的主要逻辑如下.

3.1. 如果文件的所有block都是completed状态，则不需要进行block recovery，关闭文件.

	则从lease manager将这个文件的lease删除，将INode的状态置为complete,最后记一条close file的edit log

3.2. 如果最后一个block是committed状态，那么看该文件的最后两个block的状态，如果倒数第二个block和最后一个block都满足最小副本数要求(默认是1),关闭文件.否则，客户端抛异常。

3.3. 如果最后一个block是under construction或者under recovery状态，并且最后一个block没有任何datanode汇报上来，很有可能是pipeline还没建立起来，客户端就宕机了，这种情况下，只需要把最后一个block从INode中溢出，并且关闭文件.

3.4. 进入block recovery阶段.

为这次block recovery过程申请一个block recovery id，标示这次block recovery过程.block recovery id实际是一个新分配的generation stamp

将block状态设置为under recovery，从block的多个副本中选择一个副本所在的datanode作为primary data node，然后将这个block放入这个datanode的recoverBlocks列表中，随后，namenode在处理datanode的定期心跳中，会将这个datanode的所有的recoverBlocks都在心跳回复中发送给datanode，以BlockRecoveryCommand的形式.代码:

DatanodeManager::handleHeartbeat

//check lease recovery

    BlockInfoUnderConstruction[] blocks = nodeinfo

        .getLeaseRecoveryCommand(Integer.MAX_VALUE);

    if (blocks != null) {

      BlockRecoveryCommand brCommand = new BlockRecoveryCommand(

          blocks.length);

      for (BlockInfoUnderConstruction b : blocks) {

        final DatanodeStorageInfo[] storages = b.getExpectedStorageLocations();

        // Skip stale nodes during recovery - not heart beated for some time (30s by default).

        final List<DatanodeStorageInfo> recoveryLocations =

            new ArrayList<DatanodeStorageInfo>(storages.length);

        for (int i = 0; i < storages.length; i++) {

          if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {

            recoveryLocations.add(storages[i]);

          }

        }

        // If we only get 1 replica after eliminating stale nodes, then choose all

        // replicas for recovery and let the primary data node handle failures.

        if (recoveryLocations.size() > 1) {

          if (recoveryLocations.size() != storages.length) {

            LOG.info("Skipped stale nodes for recovery : " +

                (storages.length - recoveryLocations.size()));

          }

          brCommand.add(new RecoveringBlock(

              new ExtendedBlock(blockPoolId, b),

              DatanodeStorageInfo.toDatanodeInfos(recoveryLocations),

              b.getBlockRecoveryId()));

        } else {

          // If too many replicas are stale, then choose all replicas to participate

          // in block recovery.

          brCommand.add(new RecoveringBlock(

              new ExtendedBlock(blockPoolId, b),

              DatanodeStorageInfo.toDatanodeInfos(storages),

              b.getBlockRecoveryId()));

        }

      }

      return new DatanodeCommand[] { brCommand };

    }

现在看DataNode端.

DataNode端的BPServiceActor处理心跳回复，在offerService()函数中，从心跳回复中拿出所有的DataNodeCommand处理。在processCommandFromActive函数中检查，command类型是DNA_RECOVERBLOCK，说明是block recovery命令，调用DataNode的recoverBlocks处理.

    case DatanodeProtocol.DNA_RECOVERBLOCK:

      String who = "NameNode at " + actor.getNNSocketAddress();

      dn.recoverBlocks(who, ((BlockRecoveryCommand)cmd).getRecoveringBlocks());

      break;

dn.recoverBlocks会起一个后台线程专门来处理这件事,对于每个需要recover的block:

从block拿出副本所在的datanode，给其他两个副本所在的datanode建立连接，datanode之间的接口定义在InterDatanodeProtocol接口中，调用DataNode(包括自己)的initReplicaRecovery(rBlock)函数,DataNode最终会调用FsDatasetImpl的initReplicaRecovery方法来初始化datanode上需要恢复的replica。看看这个函数：

static ReplicaRecoveryInfo initReplicaRecovery(String bpid, ReplicaMap map,

Block block, long recoveryId, long xceiverStopTimeout) throws IOException {

final ReplicaInfo replica = map.get(bpid, block.getBlockId());

LOG.info("initReplicaRecovery: " + block + ", recoveryId=" + recoveryId

+ ", replica=" + replica);

//check replica

if (replica == null) {

  return null;

}

//stop writer if there is any

if (replica instanceof ReplicaInPipeline) {

  final ReplicaInPipeline rip = (ReplicaInPipeline)replica;

  rip.stopWriter(xceiverStopTimeout);

  //check replica bytes on disk.

  if (rip.getBytesOnDisk() < rip.getVisibleLength()) {

    throw new IOException("THIS IS NOT SUPPOSED TO HAPPEN:"

        + " getBytesOnDisk() < getVisibleLength(), rip=" + rip);

  }

  //check the replica's files

  checkReplicaFiles(rip);

}

//check generation stamp

if (replica.getGenerationStamp() < block.getGenerationStamp()) {

  throw new IOException(

      "replica.getGenerationStamp() < block.getGenerationStamp(), block="

      + block + ", replica=" + replica);

}

//check recovery id

if (replica.getGenerationStamp() >= recoveryId) {

  throw new IOException("THIS IS NOT SUPPOSED TO HAPPEN:"

      + " replica.getGenerationStamp() >= recoveryId = " + recoveryId

      + ", block=" + block + ", replica=" + replica);

}

//check RUR

final ReplicaUnderRecovery rur;

if (replica.getState() == ReplicaState.RUR) {

  rur = (ReplicaUnderRecovery)replica;

  if (rur.getRecoveryID() >= recoveryId) {

    throw new RecoveryInProgressException(

        "rur.getRecoveryID() >= recoveryId = " + recoveryId

        + ", block=" + block + ", rur=" + rur);

  }

  final long oldRecoveryID = rur.getRecoveryID();

  rur.setRecoveryID(recoveryId);

  LOG.info("initReplicaRecovery: update recovery id for " + block

      + " from " + oldRecoveryID + " to " + recoveryId);

}

else {

  rur = new ReplicaUnderRecovery(replica, recoveryId);

  map.add(bpid, rur);

  LOG.info("initReplicaRecovery: changing replica state for "

      + block + " from " + replica.getState()

      + " to " + rur.getState());

}

return rur.createInfo();

}

```

首先，检查副本的状态，如果当前副本的状态是正在写的过程中，那么调用replica的stopWriter停止这个写线程，停止的方法就是interupt这个写线程(写pipeline时，datanode创建replica时会将当前写线程的handle存到replica中),从这可以看出blcok recovery优先级很高。然后做一些check，比如副本在磁盘上的文件是否存在，meta文件是否存在等，然后，检查generation stamp，namenode记录的generation stamp不能比实际的大，recovery id不能比副本的generation stamp小，最后，创建一个ReplicaUnderRecovery，放入replica map中，这里还会检查，如果replica已经处于under recovery状态，则看当前的block recovery过程的recovery id和它谁大，如果更大，则强占它。

接着，将三个副本的信息(包括recovery前的副本的信息)都加入一个列表，然后开始sync，sync就是根据三个副本的原来的状态，做一些选择，规则如下,这是两个副本的情况:

参考资料

hadoop-hdfs-2.4.1.jar

Append/Hflush/Read Design

HDFS Lease Recovey 和 Block Recovery的更多相关文章

Hadoop架构: 关于Recovery (Lease Recovery , Block Recovery, PipeLine Recovery)
该系列总览: Hadoop3.1.1架构体系——设计原理阐述与Client源码图文详解 : 总览在HDFS中,有三种Recovery 1.Lease Recovery 2.Block Recover ...
hdfs.server.datanode.DataNode: Block pool ID needed, but service not yet registered with NN
启动hadoop 发现 50070 的 livenode 数量是 0 查看日志, hdfs.server.datanode.DataNode: Block pool ID needed, but se ...
Datanode启动问题 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering>
-- ::, INFO org.apache.hadoop.hdfs.server.datanode.DataNode: supergroup = supergroup -- ::, INFO org ...
HDFS 异常处理与恢复
在前面的文章 <HDFS DataNode 设计实现解析>中我们对文件操作进行了描述,但并未展开讲述其中涉及的异常错误处理与恢复机制.本文将深入探讨 HDFS 文件操作涉及的错误处理与恢复 ...
HDFS租约实践
一.租约详解 Why租约 HDFS的读写模式为 "write-once-read-many",为了实现write-once,需要设计一种互斥机制,租约应运而生租约本质上是一个有时间 ...
后端分布式系列：分布式存储－HDFS 异常处理与恢复
在前面的文章 <HDFS DataNode 设计实现解析>中我们对文件操作进行了描述,但并未展开讲述其中涉及的异常错误处理与恢复机制.本文将深入探讨 HDFS 文件操作涉及的错误处理与恢复 ...
【转载 Hadoop&Spark 动手实践 2】Hadoop2.7.3 HDFS理论与动手实践
简介 HDFS(Hadoop Distributed File System )Hadoop分布式文件系统.是根据google发表的论文翻版的.论文为GFS(Google File System)Go ...
Hadoop学习笔记之六：HDFS功能逻辑(2)
Lease(租约) HDFS(及大多数分布式文件系统)不支持文件并发写,Lease是HDFS用于保证唯一写的手段. Lease可以看做是一把带时间限制的写锁,仅持有写锁的客户端可以写文件. 租约的有效 ...
HDFS分布式文件系统（The Hadoop Distributed File System）
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to ...

随机推荐

快速选择算法/Select 寻找第k大的数
参考算法导论9.3节的内容和这位大神的博客:http://blog.csdn.net/v_JULY_v上对这一节内容代码的实现进行了学习尝试实现了以查找中位数为前提的select算法. 算法功能:可 ...
Tensorflow应用之LSTM
学习RNN时原理理解起来不难,但是用TensorFlow去实现时被它各种数据的shape弄得晕头转向.现在就结合一个情感分析的案例来了解一下LSTM的操作流程. 一.深度学习在自然语言处理中的应用自 ...
Docker Spring-boot
docker 1.使用 sudo 或 root 权限登录 Centos. 2.确保 yum 包更新到最新. $ sudo yum update 3.执行 Docker 安装脚本. $ curl -fs ...
SSO - 开篇引例
进公司以来, 所做的产品中, 下面的子系统就没有少于10个的, 其中有的是.net做的, 有的是java做的, 还有安卓端, ios端. 那么这么多子系统, 我可能需要访问其中的多个(同一平台), 我 ...
Python虚拟环境工具-Virtualenv 介绍及部署记录
在开发Python应用程序时,系统默认的Python版本可能会不兼容这个应用程序, 如果同时开发多个应用程序, 可能会用到好几个版本的python环境, 这种情况下,每个应用可能需要各自拥有一套&qu ...
第2章 Selenium2-java 测试环境搭建
2.1 Window下环境搭建 2.1.1 安装Java 2.1.2 安装Eclipse (网上资源很多,就不详将了). 2.1.3 下载Java版的Selenium包. 下载地址:http://d ...
org.hibernate.NonUniqueObjectException:a different object with the same identifier value was alread
转自: http://blog.csdn.net/zzzz3621/article/details/9776539 看异常提示意思已经很明显了,是说主键不唯一,在事务的最后执行SQL时,session ...
Windows Mobile设备操作演示准备工作小记
公司最近为PDA开发了一款作业程序,我在工作中常常需要将操作过程通过电脑上设影出来为客户讲解使用方法.本文记录了相关的准备工作. 1. 微软嵌入式操作系统体系 RTOS: Embedded Real ...
JSONPath使用说明
# JSONPath - XPath for JSON A frequently emphasized advantage of XML is the availability of plenty t ...
手把手教你实现自己的abp代码生成器
代码生成器的原理无非就是得到字段相关信息(字段名,字段类型,字段注释等),然后根据模板,其实就是字符串的拼接与替换生成相应代码. 所以第一步我们需要解决如何得到字段的相关信息,有两种方式通过反射获得 ...

HDFS Lease Recovey 和 Block Recovery

参考资料

HDFS Lease Recovey 和 Block Recovery的更多相关文章

随机推荐

热门专题