Apache Kafka源码分析 - PartitionStateMachine

startup

在onControllerFailover中被调用，

initializePartitionState

private def initializePartitionState() {

    for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) { // 取出所有的partitions

      // check if leader and isr path exists for partition. If not, then it is in NEW state

      controllerContext.partitionLeadershipInfo.get(topicPartition) match {

        case Some(currentLeaderIsrAndEpoch) =>

          // else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state

          controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {

            case true => // leader is alive

              partitionState.put(topicPartition, OnlinePartition)

            case false =>

              partitionState.put(topicPartition, OfflinePartition)

          }

        case None =>

          partitionState.put(topicPartition, NewPartition)

      }

    }

  }

这里注意offlinePartition和newPartition的区别，

如果controllerContext.partitionLeadershipInfo中没有这个partition的leader信息，那么说明是newPartition

如果有leader，但leader所在broker不是alive的，那么就是offlinePartition

当然，如果leader所在broker是alive的，那么就是onlinePartition

triggerOnlinePartitionStateChange

试图将所有offline和new partition的状态变成online

def triggerOnlinePartitionStateChange() {

    try {

      brokerRequestBatch.newBatch()

      // try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions

      // that belong to topics to be deleted

      for((topicAndPartition, partitionState) <- partitionState

          if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {

        if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))

          handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,

                            (new CallbackBuilder).build)

      }

      brokerRequestBatch.sendRequestsToBrokers(controller.epoch, controllerContext.correlationId.getAndIncrement)

    } catch {

      case e: Throwable => error("Error while moving some partitions to the online state", e)

      // TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions

    }

  }

这里看到，brokerRequestBatch，这个经常出现，ControllerBrokerRequestBatch

这个类封装了leaderAndIsrRequestMap，stopReplicaRequestMap，updateMetadataRequestMap

用来记录和cache，handleStateChange中产生的这些request

最终用sendRequestsToBrokers，将这些requests，批量的发出去

handleStateChange的逻辑后面单独看，这里看看controller.offlinePartitionSelector，这个selector实现如何为一个newPartition或offlinePartition选一个leader

代码挺长的，注释讲的挺清楚的，就不贴代码了

首先如果ISR里面有活的broker，那没有好说的，直接用它作为新的leader

如果没有，这里需要看一下是否容忍unclean leader election，其实就是是否可以容忍丢数据，如果可以

那么就看看这AR里面有没有活的broker，如果有就以它为leader，但这个既然不在ISR里面，说明这个replica是不同步的，所以一定有data loss

如果AR里面也没有活的broker，那只能是elect失败了

/**

 * Select the new leader, new isr and receiving replicas (for the LeaderAndIsrRequest):

 * 1. If at least one broker from the isr is alive, it picks a broker from the live isr as the new leader and the live

 *    isr as the new isr.

 * 2. Else, if unclean leader election for the topic is disabled, it throws a NoReplicaOnlineException.

 * 3. Else, it picks some alive broker from the assigned replica list as the new leader and the new isr.

 * 4. If no broker in the assigned replica list is alive, it throws a NoReplicaOnlineException

 * Replicas to receive LeaderAndIsr request = live assigned replicas

 * Once the leader is successfully registered in zookeeper, it updates the allLeaders cache

 */

registerListeners

在onControllerFailover中被调用，

这里负责注册一下listener到zk，deleteTopicListener先不管

先看看TopicChangeListener，当topics发生变化时，我们做什么处理？

registerTopicChangeListener

private def registerTopicChangeListener() = {

    zkClient.subscribeChildChanges(ZkUtils.BrokerTopicsPath, topicChangeListener) //"/brokers/topics"

  }

Listen这个目录， /brokers/topics，如果发生变化，触发topicChangeListener

TopicChangeListener

  /**

   * This is the zookeeper listener that triggers all the state transitions for a partition

   */

  class TopicChangeListener extends IZkChildListener with Logging {

    this.logIdent = "[TopicChangeListener on Controller " + controller.config.brokerId + "]: "

    @throws(classOf[Exception])

    def handleChildChange(parentPath : String, children : java.util.List[String]) {

      inLock(controllerContext.controllerLock) {

        if (hasStarted.get) {

          try {

            val currentChildren = {

              import JavaConversions._

              debug("Topic change listener fired for path %s with children %s".format(parentPath, children.mkString(",")))

              (children: Buffer[String]).toSet

            }

            val newTopics = currentChildren -- controllerContext.allTopics //context里面没记录，但zk有的，就是新topic

            val deletedTopics = controllerContext.allTopics -- currentChildren //反之，就被删除的topic

            controllerContext.allTopics = currentChildren //更新context

            val addedPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, newTopics.toSeq) //从zk取出新topic的assignment情况

              controllerContext.partitionReplicaAssignment = controllerContext.partitionReplicaAssignment.filter(p =>  //从context中的assignment情况中删掉deletedtopic的

              !deletedTopics.contains(p._1.topic))

            controllerContext.partitionReplicaAssignment.++=(addedPartitionReplicaAssignment) //把新的topic的assignment加入context

            info("New topics: [%s], deleted topics: [%s], new partition replica assignment [%s]".format(newTopics,

              deletedTopics, addedPartitionReplicaAssignment))

            if(newTopics.size > 0)

              controller.onNewTopicCreation(newTopics, addedPartitionReplicaAssignment.keySet.toSet) //最终调用KafkaController.onNewTopicCreation

          } catch {

            case e: Throwable => error("Error while handling new topic", e )

          }

        }

      }

    }

  }

onNewTopicCreation

def onNewTopicCreation(topics: Set[String], newPartitions: Set[TopicAndPartition]) {

    info("New topic creation callback for %s".format(newPartitions.mkString(",")))

    // subscribe to partition changes

    topics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic)) //添加partition change listener

    onNewPartitionCreation(newPartitions) //partition和replica的状态变化

  }

def registerPartitionChangeListener(topic: String) = {

    addPartitionsListener.put(topic, new AddPartitionsListener(topic))

    zkClient.subscribeDataChanges(ZkUtils.getTopicPath(topic), addPartitionsListener(topic)) ///brokers/topics/topic-name

  }

AddPartitionsListener

和topic listener很想，就是从zk读出partition情况，和当前context里面的比较，找出新的partitions，调用

controller.onNewPartitionCreation(partitionsToBeAdded.keySet.toSet)

可见无论是TopicChangeListener还是AddPartitionsListener，最终都是调用到onNewPartitionCreation，毕竟topic是个逻辑概念

onNewPartitionCreation

def onNewPartitionCreation(newPartitions: Set[TopicAndPartition]) {

    info("New partition creation callback for %s".format(newPartitions.mkString(",")))

    partitionStateMachine.handleStateChanges(newPartitions, NewPartition)

    replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), NewReplica)

    partitionStateMachine.handleStateChanges(newPartitions, OnlinePartition, offlinePartitionSelector)

    replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), OnlineReplica)

  }

很简单，只是首先将所有新的partition和相应的replica的状态设为new，然后再设为online

handleStateChange

这是状态机的主逻辑，

private def handleStateChange(topic: String, partition: Int, targetState: PartitionState,

                                leaderSelector: PartitionLeaderSelector,

                                callbacks: Callbacks) {

    val topicAndPartition = TopicAndPartition(topic, partition)

    val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition) // 取得当前状态

    try {

      targetState match {

        case NewPartition =>

          // pre: partition did not exist before this

          assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)

          assignReplicasToPartitions(topic, partition) // 从zk取得AR，并更新controllerContext.partitionReplicaAssignment
          partitionState.put(topicAndPartition, NewPartition)
          // post: partition has been assigned replicas

        case OnlinePartition =>

          assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)

          partitionState(topicAndPartition) match {

            case NewPartition =>

              // initialize leader and isr path for new partition

              initializeLeaderAndIsrForPartition(topicAndPartition)

            case OfflinePartition =>

              electLeaderForPartition(topic, partition, leaderSelector)

            case OnlinePartition => // invoked when the leader needs to be re-elected

              electLeaderForPartition(topic, partition, leaderSelector)

            case _ => // should never come here since illegal previous states are checked above

          }

          partitionState.put(topicAndPartition, OnlinePartition)

          val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader

          stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"

                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))

           // post: partition has a leader

        case OfflinePartition =>

          // pre: partition should be in New or Online state

          assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)

          // should be called when the leader for a partition is no longer alive

          stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"

                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))

          partitionState.put(topicAndPartition, OfflinePartition)

          // post: partition has no alive leader

        case NonExistentPartition =>

          // pre: partition should be in Offline state

          assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)

          stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"

                                    .format(controllerId, controller.epoch, topicAndPartition, currState, targetState))

          partitionState.put(topicAndPartition, NonExistentPartition)

          // post: partition state is deleted from all brokers and zookeeper

      }

    } catch {

      case t: Throwable =>

        stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"

          .format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)

    }

  }

可以看到，对于转变到OfflinePartition，NonExistentPartition，只是单纯的设置state

而转变到NewPartition，除了设置state，也就多了步初始化AR

只有转变到OnlinePartition的时候比较复杂些，

其中从NewPartition--》OnlinePartition，需要做些初始化的工作，所以调用initializeLeaderAndIsrForPartition

initializeLeaderAndIsrForPartition

NewPartition是在zk中，没有leaderAndISR path的，所以初始化需要创建path，创建后，就再也不能回到New的状态，只能到offline

其中逻辑除了创建zk path，就是进行leader elect，这里的elect逻辑是写死的，初始化的时候，一定是prefered selector，即选取live AR的head

/**

   * Invoked on the NewPartition->OnlinePartition state change. When a partition is in the New state, it does not have

   * a leader and isr path in zookeeper. Once the partition moves to the OnlinePartition state, it's leader and isr

   * path gets initialized and it never goes back to the NewPartition state. From here, it can only go to the

   * OfflinePartition state.

   * @param topicAndPartition   The topic/partition whose leader and isr path is to be initialized

   */

  private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {

    val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)

    val liveAssignedReplicas = replicaAssignment.filter(r => controllerContext.liveBrokerIds.contains(r)) // 找出AR中活着的replica

    liveAssignedReplicas.size match {

      case 0 => // 没有活的，那肯定无法转成online的

        val failMsg = ("encountered error during state change of partition %s from New to Online, assigned replicas are [%s], " +

                       "live brokers are [%s]. No assigned replica is alive.")

                         .format(topicAndPartition, replicaAssignment.mkString(","), controllerContext.liveBrokerIds)

        stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

        throw new StateChangeFailedException(failMsg)

      case _ =>

        debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))

        // make the first replica in the list of assigned replicas, the leader

        val leader = liveAssignedReplicas.head  // 取出第一个活的replica，作为leader replica

        val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList), // 封装出LeaderIsrAndControllerEpoch

          controller.epoch)

        debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))

        try {

          ZkUtils.createPersistentPath(controllerContext.zkClient,  // 创建zk的LeaderAndIsrPath，关键的初始步骤

            ZkUtils.getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),

            ZkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))

          // NOTE: the above write can fail only if the current controller lost its zk session and the new controller

          // took over and initialized this partition. This can happen if the current controller went into a long

          // GC pause

          controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch) // 更新context中的partitionLeadershipInfo

          brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic, // 添加LeaderAndIsrRequest到requestbatch

            topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)

        } catch {

          case e: ZkNodeExistsException =>

            // read the controller epoch

            val leaderIsrAndEpoch = ReplicationUtils.getLeaderIsrAndEpochForPartition(zkClient, topicAndPartition.topic,

              topicAndPartition.partition).get

            val failMsg = ("encountered error while changing partition %s's state from New to Online since LeaderAndIsr path already " +

                           "exists with value %s and controller epoch %d")

                             .format(topicAndPartition, leaderIsrAndEpoch.leaderAndIsr.toString(), leaderIsrAndEpoch.controllerEpoch)

            stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

            throw new StateChangeFailedException(failMsg)

        }

    }

  }

OfflinePartition或OnlinePartition –》OnlinePartition

这个相对简单，只需要重新选举一下leader

electLeaderForPartition

def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {

    val topicAndPartition = TopicAndPartition(topic, partition)

    try {

      var zookeeperPathUpdateSucceeded: Boolean = false

      var newLeaderAndIsr: LeaderAndIsr = null

      var replicasForThisPartition: Seq[Int] = Seq.empty[Int]

      while(!zookeeperPathUpdateSucceeded) { // while，只有更新zk成功，或发生异常才会跳出，这样写是不是有点危险

        val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition) // 去zk获取leaderAndIsr信息，如果取不到，抛异常，因为offline或online都应该在zk上有数据的

        val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr

        val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch

        if (controllerEpoch > controller.epoch) { // 判断leaderAndISR如果已经被其他更新epoch的controller改过，那就说明当前controller已经过期了，抛异常

          val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +

                         "already written by another controller. This probably means that the current controller %d went through " +

                         "a soft failure and another controller was elected with epoch %d.")

                           .format(topic, partition, controllerId, controllerEpoch)

          stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

          throw new StateChangeFailedException(failMsg)

        }

        // elect new leader or throw exception

        val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr) // 调用Selector来选取leader，不同的Selector会有不同的选取逻辑

        val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topic, partition, // 在zk上更新leaderAndISR

          leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)

        newLeaderAndIsr = leaderAndIsr

        newLeaderAndIsr.zkVersion = newVersion

        zookeeperPathUpdateSucceeded = updateSucceeded

        replicasForThisPartition = replicas

      }

      val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)

      // update the leader cache

      controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)

      stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"

                                .format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))

      val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))

      // store new leader and isr info in cache

      brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,

        newLeaderIsrAndControllerEpoch, replicas)

    } catch {

      case lenne: LeaderElectionNotNeededException => // swallow

      case nroe: NoReplicaOnlineException => throw nroe

      case sce: Throwable =>

        val failMsg = "encountered error while electing leader for partition %s due to: %s.".format(topicAndPartition, sce.getMessage)

        stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)

        throw new StateChangeFailedException(failMsg, sce)

    }

    debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l => (l._1, l._2))))

  }

Apache Kafka源码分析 - PartitionStateMachine的更多相关文章

Apache Kafka源码分析 – Broker Server
1. Kafka.scala 在Kafka的main入口中startup KafkaServerStartable, 而KafkaServerStartable这是对KafkaServer的封装 1: ...
apache kafka源码分析-Producer分析---转载
原文地址:http://www.aboutyun.com/thread-9938-1-1.html 问题导读1.Kafka提供了Producer类作为java producer的api,此类有几种发送 ...
Apache Kafka源码分析 - kafka controller
前面已经分析过kafka server的启动过程,以及server所能处理的所有的request,即KafkaApis 剩下的,其实关键就是controller,以及partition和replica ...
Apache Kafka源码分析 – Controller
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Internalshttps://cwiki.apache.org ...
Apache Kafka源码分析 - autoLeaderRebalanceEnable
在broker的配置中,auto.leader.rebalance.enable (false) 那么这个leader是如何进行rebalance的? 首先在controller启动的时候会打开一个s ...
Apache Kafka源码分析 - KafkaApis
kafka apis反映出kafka broker server可以提供哪些服务,broker server主要和producer,consumer,controller有交互,搞清这些api就清楚了 ...
Apache Kafka源码分析 – Log Management
LogManager LogManager会管理broker上所有的logs(在一个log目录下),一个topic的一个partition对应于一个log(一个log子目录)首先loadLogs会加载 ...
Apache Kafka源码分析 - ReplicaStateMachine
startup 在onControllerFailover中被调用, /** * Invoked on successful controller election. First registers ...
Apache Kafka源码分析 – Replica and Partition
Replica 对于local replica, 需要记录highWatermarkValue,表示当前已经committed的数据对于remote replica,需要记录logEndOffsetV ...

随机推荐

Sonar+Hudson+Maven构建系列之二：迁移Sonar
摘要:由于昨天在一台机器上安装的东西太多了,导致Linux机器上非常卡,一台Linux负担了jira, fisheye, confluence, sonar, hudson, mysql 等等,本来已 ...
dip,px,pt,sp的区别
dip: device independent pixels(设备独立像素). 不同设备有不同的显示效果,这个和设备硬件有关,一般我们为了支持WVGA.HVGA和QVGA 推荐使用这个,不依赖像素. ...
SU Demos-02Filtering-06Sukfilter
本demo中数学原理纯粹不知道,看来以后需要抓紧时间补课了,只附图. 运行结果图如下:
Eclipse启动Tomcat时45秒超时的解决方法
Eclipse启动Tomcat时,默认配置的启动超时时长为45秒.假若项目需要加载的东西比较多,启动时间会比较久,如果启动超过45秒将会报错.有两种解决途径,方法只有一个,就是修改启动时间. 1. 修 ...
stack+DFS ZOJ 1004 Anagrams by Stack
题目传送门 /* stack 容器的应用: 要求字典序升序输出,所以先搜索入栈的然后逐个判断是否满足答案,若不满足,回溯继续搜索,输出所有符合的结果 */ #include <cstdio&g ...
checkbox下面的提示框鼠标移入时显示，移出时隐藏
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name ...
struts 的问题是由于没有写的name有缺少的项，没有完全对应
java.lang.RuntimeException: Invalid action class configuration that references an unknown class name ...
BZOJ3046 : lagoon
码农题,拆点BFS预处理出所有联通块的面积即可,注意分类讨论. #include<cstdio> #include<cmath> using namespace std; co ...
前端不为人知的一面--前端冷知识集锦前端已经被玩儿坏了！像console.log()可以向控制台输出图片
前端已经被玩儿坏了!像console.log()可以向控制台输出图片等炫酷的玩意已经不是什么新闻了,像用||操作符给变量赋默认值也是人尽皆知的旧闻了,今天看到Quora上一个帖子,瞬间又GET了好多前 ...
【SPOJ】7258. Lexicographical Substring Search（后缀自动机）
http://www.spoj.com/problems/SUBLEX/ 后缀自动机系列完成QAQ...撒花..明天or今晚写个小结? 首先得知道:后缀自动机中,root出发到任意一个状态的路径对应一 ...

Apache Kafka源码分析 - PartitionStateMachine

Apache Kafka源码分析 - PartitionStateMachine的更多相关文章

随机推荐

热门专题