akka cluster sharding source code 学习 (2/5) handle off

一旦 shard coordinator（相当于分布式系统的 zookeeper）启动，它就会启动一个定时器，每隔一定的时间尝试平衡一下集群中各个节点的负载，平衡的办法是把那些负载较重的 actor 移动到负载较轻的节点上。在这一点上，我以前的理解有误，我以为 shardRegion 是移动的最小单位。

val rebalanceTask = context.system.scheduler.schedule(rebalanceInterval, rebalanceInterval, self, RebalanceTick)

当 coordinator 收到 ReblanceTick 后，就开始尝试平衡系统负载

case RebalanceTick ⇒

      if (persistentState.regions.nonEmpty) {

        val shardsFuture = allocationStrategy.rebalance(persistentState.regions, rebalanceInProgress)

        shardsFuture.value match {

          case Some(Success(shards)) ⇒

            continueRebalance(shards)

          case _ ⇒

            // continue when future is completed

            shardsFuture.map { shards ⇒ RebalanceResult(shards)

            }.recover {

              case _ ⇒ RebalanceResult(Set.empty)

            }.pipeTo(self)

        }

      }

上面的逻辑我看懂了，但是 Future 的用法没看明白。按照一般的写法，当 shardsFuture 返回 Failure 以后，应该直接执行 RebalanceResut(Set.empty).pipeTo(self)，不知道为什么失败以后还要尝试等待 Future

allocationStrategy 提供了默认的实现，也可以自定义负载均衡策略。rebalance 函数返回的是 Set(ShardId)，即那些要被移动的 shards

当 coordinator 收到 RebalanceResult 后，开始启动 balance 逻辑

def continueRebalance(shards: Set[ShardId]): Unit =

    shards.foreach { shard ⇒

      if (!rebalanceInProgress(shard)) {

        persistentState.shards.get(shard) match {

          case Some(rebalanceFromRegion) ⇒

            rebalanceInProgress += shard

            log.debug("Rebalance shard [{}] from [{}]", shard, rebalanceFromRegion)

            context.actorOf(rebalanceWorkerProps(shard, rebalanceFromRegion, handOffTimeout,

              persistentState.regions.keySet ++ persistentState.regionProxies)

              .withDispatcher(context.props.dispatcher))

          case None ⇒

            log.debug("Rebalance of non-existing shard [{}] is ignored", shard)

        }

      }

    }

rebalanceInProcess 是一个 Set，记录正在被移动的 shard，我想，在新一轮 balance 开始时， rebalanceInProcess 为空的情况只会发生在上次 balance 还没有做完。不知道这个时候，是应该报错还是继续 balance 更好，因为 balanceStrategy 应该不会考虑吧到上一轮 balance 还没做完这种可能性。

然后， coordinator 启动 rebalanceWorker，也就是上篇提到的替身 actor。

private[akka] class RebalanceWorker(shard: String, from: ActorRef, handOffTimeout: FiniteDuration,

                                      regions: Set[ActorRef]) extends Actor {

    import Internal._

    regions.foreach(_ ! BeginHandOff(shard))

    var remaining = regions

    import context.dispatcher

    context.system.scheduler.scheduleOnce(handOffTimeout, self, ReceiveTimeout)

    def receive = {

      case BeginHandOffAck(`shard`) ⇒

        remaining -= sender()

        if (remaining.isEmpty) {

          from ! HandOff(shard)

          context.become(stoppingShard, discardOld = true)

        }

      case ReceiveTimeout ⇒ done(ok = false)

    }

    def stoppingShard: Receive = {

      case ShardStopped(shard) ⇒ done(ok = true)

      case ReceiveTimeout      ⇒ done(ok = false)

    }

    def done(ok: Boolean): Unit = {

      context.parent ! RebalanceDone(shard, ok)

      context.stop(self)

    }

  }

akka 的逻辑是基于消息传递的，这种代码其实是很难去读的。在 rebalanceWorker 运行时，牵扯到很多个 actor。首先是，coordinator，其次是 shardRegion，也就是 host 待迁移 shard actor 的那个 region，然后是 shard actor 本身，最后是系统里所有的 shardRegion，他们也要参与进来。写到这里，我不禁把电脑屏幕竖了起来。

1. RebalanceWorker 首先给所有的 ShardRegion BeginHandOff 消息，告诉大家，hand off 开始，然后等待大家的回复

2. ShardRegion 收到 BeginHandOff 后，开始更新自己的知识库，将 HostShardRegion 和 shardActor 的记忆从自己的知识库中抹去

case BeginHandOff(shard) ⇒

      log.debug("BeginHandOff shard [{}]", shard)

      if (regionByShard.contains(shard)) {

        val regionRef = regionByShard(shard)

        val updatedShards = regions(regionRef) - shard

        if (updatedShards.isEmpty) regions -= regionRef

        else regions = regions.updated(regionRef, updatedShards)

        regionByShard -= shard

      }

      sender() ! BeginHandOffAck(shard)

最后，发送 BeginHandOffAck 消息，告诉 rebalanceWorker 自己准备完毕（这些 shardRegion 以后也没事干了）

3. 继续回到 rebalanceWorker，它发送 HandOff 告诉 Host shard actor 的 ShardRegion，你可以做自己的清理工作了。然后将自己的状态设置成 stoppingShard，等待 ShardStopped 消息，这个消息的来源有两个，一个是 HostShardRegion，另外一个是 shard actor

4. HostShardRegion 收到 HandOff 消息后

case msg @ HandOff(shard) ⇒

      log.debug("HandOff shard [{}]", shard)

      // must drop requests that came in between the BeginHandOff and now,

      // because they might be forwarded from other regions and there

      // is a risk or message re-ordering otherwise

      if (shardBuffers.contains(shard)) {

        shardBuffers -= shard

        loggedFullBufferWarning = false

      }

      if (shards.contains(shard)) {

        handingOff += shards(shard)

        shards(shard) forward msg

      } else

        sender() ! ShardStopped(shard)

如果 HostShardRegion 已经不再含有 shard actor，那么直接返回 ShardStopped，否则 HandOff 这个 Set 加入 shard actor，并将 HandOff 传给 shard actor

5. 又看了一遍代码，发现 shard actor 和 entity actor 又是两种东西，shard actor 存在于 entity actor 和 shard region 之间

目前还不知道 entity actor 和 shard region 之间的关系

def getEntity(id: EntityId): ActorRef = {

    val name = URLEncoder.encode(id, "utf-8")

    context.child(name).getOrElse {

      log.debug("Starting entity [{}] in shard [{}]", id, shardId)

      val a = context.watch(context.actorOf(entityProps, name))

      idByRef = idByRef.updated(a, id)

      refById = refById.updated(id, a)

      state = state.copy(state.entities + id)

      a

    }

  }

从这段代码来看， shard actor 与 entity actor 是一对多的关系。

def receiveCoordinatorMessage(msg: CoordinatorMessage): Unit = msg match {

    case HandOff(`shardId`) ⇒ handOff(sender())

    case HandOff(shard)     ⇒ log.warning("Shard [{}] can not hand off for another Shard [{}]", shardId, shard)

    case _                  ⇒ unhandled(msg)

  }

  def handOff(replyTo: ActorRef): Unit = handOffStopper match {

    case Some(_) ⇒ log.warning("HandOff shard [{}] received during existing handOff", shardId)

    case None ⇒

      log.debug("HandOff shard [{}]", shardId)

      if (state.entities.nonEmpty) {

        handOffStopper = Some(context.watch(context.actorOf(

          handOffStopperProps(shardId, replyTo, idByRef.keySet, handOffStopMessage))))

        //During hand off we only care about watching for termination of the hand off stopper

        context become {

          case Terminated(ref) ⇒ receiveTerminated(ref)

        }

      } else {

        replyTo ! ShardStopped(shardId)

        context stop self

      }

  }

def receiveTerminated(ref: ActorRef): Unit = {
  if (handOffStopper.exists(_ == ref))
    context stop self
  else if (idByRef.contains(ref) && handOffStopper.isEmpty)
    entityTerminated(ref)
}

从这段代码看， shard actor 与 entity actor 的关系是一对一，因为当 entity stop self 了以后， shard actor 也会 stop self。这让我想到 coursera reactive programming 的最后一道作业题，为什么也是类似于一个 entity 有一个 shard actor 对应。

akka cluster sharding source code 学习 (2/5) handle off的更多相关文章

akka cluster sharding source code 学习 (1/5) 替身模式
为了使一个项目支持集群,自己学习使用了 akka cluster 并在项目中实施了,从此,生活就变得有些痛苦.再配上 apache 做反向代理和负载均衡,debug 起来不要太酸爽.直到现在,我还对 ...
akka cluster sharding
cluster sharding 的目的在于提供一个框架,方便实现 DDD,虽然我至今也没搞明白 DDD 到底适用于是什么场合,但是 cluster sharding 却是我目前在做的一个 proje ...
StreamSets学习系列之StreamSets支持多种安装方式【Core Tarball、Cloudera Parcel 、Full Tarball 、Full RPM 、Docker Image和Source Code 】（图文详解）
不多说,直接上干货! Streamsets的官网 https://streamsets.com/ 得到 https://streamsets.com/opensource/ StreamSets支持多 ...
Classic Source Code Collected
收藏一些经典的源码,持续更新!!! 1.深度学习框架(Deep Learning Framework). A:Caffe (Convolutional Architecture for Fast Fe ...
spark source code 分析之ApplicationMaster overview（yarn deploy client mode）
一直不是很清楚ApplicationMaster的作用,尤其是在yarn client mode和cluster mode的区别网上有一些非常好的资料,请移步: https://blog.cloud ...
Learning English From Android Source Code:1
英语在软件行业的重要作用不言自明,尤其是做国际项目和写国际软件,好的英语表达是项目顺利进行的必要条件.纵观眼下的IT行业.可以流利的与国外客户英文口语交流的程序猿占比并非非常高.要想去国际接轨,语言这 ...
Steps of source code change to executable application
程序运行的整个过程,学习一下源代码 (source code) → 预处理器 (preprocessor) → 编译器 (compiler) → 汇编程序 (assembler) → 目标代码 (o ...
UI5 Source code map机制的细节介绍
在我的博客A debugging issue caused by source code mapping里我介绍了在我做SAP C4C开发时遇到的一个曾经困扰我很久的问题,最后结论是这个问题由于Jav ...
Akka系列（十）：Akka集群之Akka Cluster
前言........... 上一篇文章我们讲了Akka Remote,理解了Akka中的远程通信,其实Akka Cluster可以看成Akka Remote的扩展,由原来的两点变成由多点组成的通信网络 ...

随机推荐

webAPI 自动生成帮助文档
之前在项目中有用到webapi对外提供接口,发现在项目中有根据webapi的方法和注释自动生成帮助文档,还可以测试webapi方法,功能很是强大,现拿出来与大家分享一下. 先看一下生成的webapi文 ...
用sass画蜗牛
一.sass的好处用css画图也算是简单的实战吧,虽然用到的东西还比较少..用过之后,发现sass主要有以下优势: 可维护性.最重要的一点,可维护性的很大一部分来自变量嗯,最简单的例子,画图总要有 ...
让ASP.NET Web API支持POST纯文本格式(text/plain)的数据
今天在web api中遇到了这样一个问题,虽然api的参数类型是string,但只能接收post body中json格式的string,不能接收原始string. web api是这样定义的: pub ...
设计模式之美：Decorator（装饰）
索引别名意图结构参与者适用性缺点效果相关模式实现实现方式(一):Decorator 对象的接口必须与它所装饰的 Component 的接口保持一致. 实现方式(二):省略抽象的 D ...
[Java Web] 5、JSP （1）注释 & Scriptlet
>_<" 在JSP中支持两种注释的语法操作,一种是显式注释,这种注释客户端是允许看见的,另外一种是隐式注释,此种注释客户端是无法看见的. 显式注释语法: <!-- 注释内容 ...
SQL——字符串处理函数
1) ASCII Format:ASCII ( character_expression ) Function:返回表达式最左端字符的ASCII值. eg: select ASCII('abcdef' ...
Nagios学习笔记四:基于NRPE监控远程Linux主机
1.NRPE简介 Nagios监控远程主机的方法有多种,其方式包括SNMP.NRPE.SSH和NCSA等.这里介绍其通过NRPE监控远程Linux主机的方式. NRPE(Nagios Remote P ...
Sublime Text 新建文件的模版插件: SublimeTmpl
地址安装方法通过 Package ControlPackage Control / Install Package, 搜索"SublimeTmpl" 或 "tmpl& ...
Nodejs学习笔记（十）--- 与MongoDB的交互（mongodb/node-mongodb-native）、MongoDB入门
目录简介 MongoDB安装(windows) MongoDB基本语法和操作入门(mongo.exe客户端操作) 库操作插入查询修改删除存储过程 nodejs操作MongoDB 插入查询 ...
celery简单入门
写作背景介绍最近在做后台图像处理,需要使用到celery这个异步任务框架.但是使用的时候遇到很多技术问题,为了方便日后再遇到相似问题时能够快速解决.写下这篇文章也希望能够帮助共同奋战在同一战线的程序 ...

akka cluster sharding source code 学习 (2/5) handle off

akka cluster sharding source code 学习 (2/5) handle off的更多相关文章

随机推荐

热门专题