startup

在onControllerFailover中被调用,

initializePartitionState

private def initializePartitionState() {
for((topicPartition, replicaAssignment) <- controllerContext.partitionReplicaAssignment) { // 取出所有的partitions
// check if leader and isr path exists for partition. If not, then it is in NEW state
controllerContext.partitionLeadershipInfo.get(topicPartition) match {
case Some(currentLeaderIsrAndEpoch) =>
// else, check if the leader for partition is alive. If yes, it is in Online state, else it is in Offline state
controllerContext.liveBrokerIds.contains(currentLeaderIsrAndEpoch.leaderAndIsr.leader) match {
case true => // leader is alive
partitionState.put(topicPartition, OnlinePartition)
case false =>
partitionState.put(topicPartition, OfflinePartition)
}
case None =>
partitionState.put(topicPartition, NewPartition)
}
}
}

这里注意offlinePartition和newPartition的区别,

如果controllerContext.partitionLeadershipInfo中没有这个partition的leader信息,那么说明是newPartition

如果有leader,但leader所在broker不是alive的,那么就是offlinePartition

当然,如果leader所在broker是alive的,那么就是onlinePartition

 

triggerOnlinePartitionStateChange

试图将所有offline和new partition的状态变成online

def triggerOnlinePartitionStateChange() {
try {
brokerRequestBatch.newBatch()
// try to move all partitions in NewPartition or OfflinePartition state to OnlinePartition state except partitions
// that belong to topics to be deleted
for((topicAndPartition, partitionState) <- partitionState
if(!controller.deleteTopicManager.isTopicQueuedUpForDeletion(topicAndPartition.topic))) {
if(partitionState.equals(OfflinePartition) || partitionState.equals(NewPartition))
handleStateChange(topicAndPartition.topic, topicAndPartition.partition, OnlinePartition, controller.offlinePartitionSelector,
(new CallbackBuilder).build)
}
brokerRequestBatch.sendRequestsToBrokers(controller.epoch, controllerContext.correlationId.getAndIncrement)
} catch {
case e: Throwable => error("Error while moving some partitions to the online state", e)
// TODO: It is not enough to bail out and log an error, it is important to trigger leader election for those partitions
}
}

这里看到,brokerRequestBatch,这个经常出现,ControllerBrokerRequestBatch

这个类封装了leaderAndIsrRequestMap,stopReplicaRequestMap,updateMetadataRequestMap

用来记录和cache,handleStateChange中产生的这些request

最终用sendRequestsToBrokers,将这些requests,批量的发出去

handleStateChange的逻辑后面单独看,这里看看controller.offlinePartitionSelector,这个selector实现如何为一个newPartition或offlinePartition选一个leader

代码挺长的,注释讲的挺清楚的,就不贴代码了

首先如果ISR里面有活的broker,那没有好说的,直接用它作为新的leader

如果没有,这里需要看一下是否容忍unclean leader election,其实就是是否可以容忍丢数据,如果可以

那么就看看这AR里面有没有活的broker,如果有就以它为leader,但这个既然不在ISR里面,说明这个replica是不同步的,所以一定有data loss

如果AR里面也没有活的broker,那只能是elect失败了

/**
* Select the new leader, new isr and receiving replicas (for the LeaderAndIsrRequest):
* 1. If at least one broker from the isr is alive, it picks a broker from the live isr as the new leader and the live
* isr as the new isr.
* 2. Else, if unclean leader election for the topic is disabled, it throws a NoReplicaOnlineException.
* 3. Else, it picks some alive broker from the assigned replica list as the new leader and the new isr.
* 4. If no broker in the assigned replica list is alive, it throws a NoReplicaOnlineException
* Replicas to receive LeaderAndIsr request = live assigned replicas
* Once the leader is successfully registered in zookeeper, it updates the allLeaders cache
*/

registerListeners

在onControllerFailover中被调用,

这里负责注册一下listener到zk,deleteTopicListener先不管

先看看TopicChangeListener,当topics发生变化时,我们做什么处理?

 

registerTopicChangeListener

private def registerTopicChangeListener() = {
zkClient.subscribeChildChanges(ZkUtils.BrokerTopicsPath, topicChangeListener) //"/brokers/topics"
}

Listen这个目录, /brokers/topics,如果发生变化,触发topicChangeListener

 

TopicChangeListener

  /**
* This is the zookeeper listener that triggers all the state transitions for a partition
*/
class TopicChangeListener extends IZkChildListener with Logging {
this.logIdent = "[TopicChangeListener on Controller " + controller.config.brokerId + "]: " @throws(classOf[Exception])
def handleChildChange(parentPath : String, children : java.util.List[String]) {
inLock(controllerContext.controllerLock) {
if (hasStarted.get) {
try {
val currentChildren = {
import JavaConversions._
debug("Topic change listener fired for path %s with children %s".format(parentPath, children.mkString(",")))
(children: Buffer[String]).toSet
}
val newTopics = currentChildren -- controllerContext.allTopics //context里面没记录,但zk有的,就是新topic
val deletedTopics = controllerContext.allTopics -- currentChildren //反之,就被删除的topic
controllerContext.allTopics = currentChildren //更新context val addedPartitionReplicaAssignment = ZkUtils.getReplicaAssignmentForTopics(zkClient, newTopics.toSeq) //从zk取出新topic的assignment情况
controllerContext.partitionReplicaAssignment = controllerContext.partitionReplicaAssignment.filter(p => //从context中的assignment情况中删掉deletedtopic的
!deletedTopics.contains(p._1.topic))
controllerContext.partitionReplicaAssignment.++=(addedPartitionReplicaAssignment) //把新的topic的assignment加入context
info("New topics: [%s], deleted topics: [%s], new partition replica assignment [%s]".format(newTopics,
deletedTopics, addedPartitionReplicaAssignment))
if(newTopics.size > 0)
controller.onNewTopicCreation(newTopics, addedPartitionReplicaAssignment.keySet.toSet) //最终调用KafkaController.onNewTopicCreation
} catch {
case e: Throwable => error("Error while handling new topic", e )
}
}
}
}
}

 

onNewTopicCreation

def onNewTopicCreation(topics: Set[String], newPartitions: Set[TopicAndPartition]) {
info("New topic creation callback for %s".format(newPartitions.mkString(",")))
// subscribe to partition changes
topics.foreach(topic => partitionStateMachine.registerPartitionChangeListener(topic)) //添加partition change listener
onNewPartitionCreation(newPartitions) //partition和replica的状态变化
}
def registerPartitionChangeListener(topic: String) = {
addPartitionsListener.put(topic, new AddPartitionsListener(topic))
zkClient.subscribeDataChanges(ZkUtils.getTopicPath(topic), addPartitionsListener(topic)) ///brokers/topics/topic-name
}

 

AddPartitionsListener

和topic listener很想,就是从zk读出partition情况,和当前context里面的比较,找出新的partitions,调用

controller.onNewPartitionCreation(partitionsToBeAdded.keySet.toSet)

可见无论是TopicChangeListener还是AddPartitionsListener,最终都是调用到onNewPartitionCreation,毕竟topic是个逻辑概念

onNewPartitionCreation

def onNewPartitionCreation(newPartitions: Set[TopicAndPartition]) {
info("New partition creation callback for %s".format(newPartitions.mkString(",")))
partitionStateMachine.handleStateChanges(newPartitions, NewPartition)
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), NewReplica)
partitionStateMachine.handleStateChanges(newPartitions, OnlinePartition, offlinePartitionSelector)
replicaStateMachine.handleStateChanges(controllerContext.replicasForPartition(newPartitions), OnlineReplica)
}

很简单,只是首先将所有新的partition和相应的replica的状态设为new,然后再设为online

 

handleStateChange

这是状态机的主逻辑,

private def handleStateChange(topic: String, partition: Int, targetState: PartitionState,
leaderSelector: PartitionLeaderSelector,
callbacks: Callbacks) {
val topicAndPartition = TopicAndPartition(topic, partition)
val currState = partitionState.getOrElseUpdate(topicAndPartition, NonExistentPartition) // 取得当前状态
try {
targetState match {
case NewPartition =>
// pre: partition did not exist before this
assertValidPreviousStates(topicAndPartition, List(NonExistentPartition), NewPartition)
assignReplicasToPartitions(topic, partition) // 从zk取得AR,并更新controllerContext.partitionReplicaAssignment
partitionState.put(topicAndPartition, NewPartition)
// post: partition has been assigned replicas
case OnlinePartition =>
assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OnlinePartition)
partitionState(topicAndPartition) match {
case NewPartition =>
// initialize leader and isr path for new partition
initializeLeaderAndIsrForPartition(topicAndPartition)
case OfflinePartition =>
electLeaderForPartition(topic, partition, leaderSelector)
case OnlinePartition => // invoked when the leader needs to be re-elected
electLeaderForPartition(topic, partition, leaderSelector)
case _ => // should never come here since illegal previous states are checked above
}
partitionState.put(topicAndPartition, OnlinePartition)
val leader = controllerContext.partitionLeadershipInfo(topicAndPartition).leaderAndIsr.leader
stateChangeLogger.trace("Controller %d epoch %d changed partition %s from %s to %s with leader %d"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState, leader))
// post: partition has a leader
case OfflinePartition =>
// pre: partition should be in New or Online state
assertValidPreviousStates(topicAndPartition, List(NewPartition, OnlinePartition, OfflinePartition), OfflinePartition)
// should be called when the leader for a partition is no longer alive
stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
partitionState.put(topicAndPartition, OfflinePartition)
// post: partition has no alive leader
case NonExistentPartition =>
// pre: partition should be in Offline state
assertValidPreviousStates(topicAndPartition, List(OfflinePartition), NonExistentPartition)
stateChangeLogger.trace("Controller %d epoch %d changed partition %s state from %s to %s"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState))
partitionState.put(topicAndPartition, NonExistentPartition)
// post: partition state is deleted from all brokers and zookeeper
}
} catch {
case t: Throwable =>
stateChangeLogger.error("Controller %d epoch %d initiated state change for partition %s from %s to %s failed"
.format(controllerId, controller.epoch, topicAndPartition, currState, targetState), t)
}
}

可以看到,对于转变到OfflinePartition,NonExistentPartition,只是单纯的设置state

而转变到NewPartition,除了设置state,也就多了步初始化AR

只有转变到OnlinePartition的时候比较复杂些,

其中从NewPartition--》OnlinePartition,需要做些初始化的工作,所以调用initializeLeaderAndIsrForPartition

initializeLeaderAndIsrForPartition

NewPartition是在zk中,没有leaderAndISR path的,所以初始化需要创建path,创建后,就再也不能回到New的状态,只能到offline

其中逻辑除了创建zk path,就是进行leader elect,这里的elect逻辑是写死的,初始化的时候,一定是prefered selector,即选取live AR的head

/**
* Invoked on the NewPartition->OnlinePartition state change. When a partition is in the New state, it does not have
* a leader and isr path in zookeeper. Once the partition moves to the OnlinePartition state, it's leader and isr
* path gets initialized and it never goes back to the NewPartition state. From here, it can only go to the
* OfflinePartition state.
* @param topicAndPartition The topic/partition whose leader and isr path is to be initialized
*/
private def initializeLeaderAndIsrForPartition(topicAndPartition: TopicAndPartition) {
val replicaAssignment = controllerContext.partitionReplicaAssignment(topicAndPartition)
val liveAssignedReplicas = replicaAssignment.filter(r => controllerContext.liveBrokerIds.contains(r)) // 找出AR中活着的replica
liveAssignedReplicas.size match {
case 0 => // 没有活的,那肯定无法转成online的
val failMsg = ("encountered error during state change of partition %s from New to Online, assigned replicas are [%s], " +
"live brokers are [%s]. No assigned replica is alive.")
.format(topicAndPartition, replicaAssignment.mkString(","), controllerContext.liveBrokerIds)
stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
throw new StateChangeFailedException(failMsg)
case _ =>
debug("Live assigned replicas for partition %s are: [%s]".format(topicAndPartition, liveAssignedReplicas))
// make the first replica in the list of assigned replicas, the leader
val leader = liveAssignedReplicas.head // 取出第一个活的replica,作为leader replica
val leaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(new LeaderAndIsr(leader, liveAssignedReplicas.toList), // 封装出LeaderIsrAndControllerEpoch
controller.epoch)
debug("Initializing leader and isr for partition %s to %s".format(topicAndPartition, leaderIsrAndControllerEpoch))
try {
ZkUtils.createPersistentPath(controllerContext.zkClient, // 创建zk的LeaderAndIsrPath,关键的初始步骤
ZkUtils.getTopicPartitionLeaderAndIsrPath(topicAndPartition.topic, topicAndPartition.partition),
ZkUtils.leaderAndIsrZkData(leaderIsrAndControllerEpoch.leaderAndIsr, controller.epoch))
// NOTE: the above write can fail only if the current controller lost its zk session and the new controller
// took over and initialized this partition. This can happen if the current controller went into a long
// GC pause
controllerContext.partitionLeadershipInfo.put(topicAndPartition, leaderIsrAndControllerEpoch) // 更新context中的partitionLeadershipInfo
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(liveAssignedReplicas, topicAndPartition.topic, // 添加LeaderAndIsrRequest到requestbatch
topicAndPartition.partition, leaderIsrAndControllerEpoch, replicaAssignment)
} catch {
case e: ZkNodeExistsException =>
// read the controller epoch
val leaderIsrAndEpoch = ReplicationUtils.getLeaderIsrAndEpochForPartition(zkClient, topicAndPartition.topic,
topicAndPartition.partition).get
val failMsg = ("encountered error while changing partition %s's state from New to Online since LeaderAndIsr path already " +
"exists with value %s and controller epoch %d")
.format(topicAndPartition, leaderIsrAndEpoch.leaderAndIsr.toString(), leaderIsrAndEpoch.controllerEpoch)
stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
throw new StateChangeFailedException(failMsg)
}
}
}

 

OfflinePartition或OnlinePartition –》OnlinePartition

这个相对简单,只需要重新选举一下leader

electLeaderForPartition

def electLeaderForPartition(topic: String, partition: Int, leaderSelector: PartitionLeaderSelector) {
val topicAndPartition = TopicAndPartition(topic, partition)
try {
var zookeeperPathUpdateSucceeded: Boolean = false
var newLeaderAndIsr: LeaderAndIsr = null
var replicasForThisPartition: Seq[Int] = Seq.empty[Int]
while(!zookeeperPathUpdateSucceeded) { // while,只有更新zk成功,或发生异常才会跳出,这样写是不是有点危险
val currentLeaderIsrAndEpoch = getLeaderIsrAndEpochOrThrowException(topic, partition) // 去zk获取leaderAndIsr信息,如果取不到,抛异常,因为offline或online都应该在zk上有数据的
val currentLeaderAndIsr = currentLeaderIsrAndEpoch.leaderAndIsr
val controllerEpoch = currentLeaderIsrAndEpoch.controllerEpoch
if (controllerEpoch > controller.epoch) { // 判断leaderAndISR如果已经被其他更新epoch的controller改过,那就说明当前controller已经过期了,抛异常
val failMsg = ("aborted leader election for partition [%s,%d] since the LeaderAndIsr path was " +
"already written by another controller. This probably means that the current controller %d went through " +
"a soft failure and another controller was elected with epoch %d.")
.format(topic, partition, controllerId, controllerEpoch)
stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
throw new StateChangeFailedException(failMsg)
}
// elect new leader or throw exception
val (leaderAndIsr, replicas) = leaderSelector.selectLeader(topicAndPartition, currentLeaderAndIsr) // 调用Selector来选取leader,不同的Selector会有不同的选取逻辑
val (updateSucceeded, newVersion) = ReplicationUtils.updateLeaderAndIsr(zkClient, topic, partition, // 在zk上更新leaderAndISR
leaderAndIsr, controller.epoch, currentLeaderAndIsr.zkVersion)
newLeaderAndIsr = leaderAndIsr
newLeaderAndIsr.zkVersion = newVersion
zookeeperPathUpdateSucceeded = updateSucceeded
replicasForThisPartition = replicas
}
val newLeaderIsrAndControllerEpoch = new LeaderIsrAndControllerEpoch(newLeaderAndIsr, controller.epoch)
// update the leader cache
controllerContext.partitionLeadershipInfo.put(TopicAndPartition(topic, partition), newLeaderIsrAndControllerEpoch)
stateChangeLogger.trace("Controller %d epoch %d elected leader %d for Offline partition %s"
.format(controllerId, controller.epoch, newLeaderAndIsr.leader, topicAndPartition))
val replicas = controllerContext.partitionReplicaAssignment(TopicAndPartition(topic, partition))
// store new leader and isr info in cache
brokerRequestBatch.addLeaderAndIsrRequestForBrokers(replicasForThisPartition, topic, partition,
newLeaderIsrAndControllerEpoch, replicas)
} catch {
case lenne: LeaderElectionNotNeededException => // swallow
case nroe: NoReplicaOnlineException => throw nroe
case sce: Throwable =>
val failMsg = "encountered error while electing leader for partition %s due to: %s.".format(topicAndPartition, sce.getMessage)
stateChangeLogger.error("Controller %d epoch %d ".format(controllerId, controller.epoch) + failMsg)
throw new StateChangeFailedException(failMsg, sce)
}
debug("After leader election, leader cache is updated to %s".format(controllerContext.partitionLeadershipInfo.map(l => (l._1, l._2))))
}

Apache Kafka源码分析 - PartitionStateMachine的更多相关文章

  1. Apache Kafka源码分析 – Broker Server

    1. Kafka.scala 在Kafka的main入口中startup KafkaServerStartable, 而KafkaServerStartable这是对KafkaServer的封装 1: ...

  2. apache kafka源码分析-Producer分析---转载

    原文地址:http://www.aboutyun.com/thread-9938-1-1.html 问题导读1.Kafka提供了Producer类作为java producer的api,此类有几种发送 ...

  3. Apache Kafka源码分析 - kafka controller

    前面已经分析过kafka server的启动过程,以及server所能处理的所有的request,即KafkaApis 剩下的,其实关键就是controller,以及partition和replica ...

  4. Apache Kafka源码分析 – Controller

    https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Controller+Internalshttps://cwiki.apache.org ...

  5. Apache Kafka源码分析 - autoLeaderRebalanceEnable

    在broker的配置中,auto.leader.rebalance.enable (false) 那么这个leader是如何进行rebalance的? 首先在controller启动的时候会打开一个s ...

  6. Apache Kafka源码分析 - KafkaApis

    kafka apis反映出kafka broker server可以提供哪些服务,broker server主要和producer,consumer,controller有交互,搞清这些api就清楚了 ...

  7. Apache Kafka源码分析 – Log Management

    LogManager LogManager会管理broker上所有的logs(在一个log目录下),一个topic的一个partition对应于一个log(一个log子目录)首先loadLogs会加载 ...

  8. Apache Kafka源码分析 - ReplicaStateMachine

    startup 在onControllerFailover中被调用, /** * Invoked on successful controller election. First registers ...

  9. Apache Kafka源码分析 – Replica and Partition

    Replica 对于local replica, 需要记录highWatermarkValue,表示当前已经committed的数据对于remote replica,需要记录logEndOffsetV ...

随机推荐

  1. Gym 100463A Crossings 逆序对

    Crossings Time Limit: 20 Sec Memory Limit: 256 MB 题目连接 http://codeforces.com/gym/100463 Description ...

  2. Eclipse开发,利用WordWrap设置自动换行

    安装 WordWrap : Help → install new Software→http://ahtik.com/eclipse-update/ 安装成功后,重启Eclipse,鼠标右键开启自动换 ...

  3. CDH中,如果管理CM中没有的属性

    在CM配置管理中的"hive-site.xml 的 Hive 客户端高级配置代码段(安全阀)""仅适用于高级使用,逐个将字符串插入 hive-site.xml 的客户端配 ...

  4. protected(C# 参考)

    protected 关键字是一个成员访问修饰符.受保护成员在它的类中可访问并且可由派生类访问.有关 protected 与其他访问修饰符的比较,请参见可访问性级别. 仅当访问通过派生类类型发生时,基类 ...

  5. HDU 1561 (树形DP+背包)

    题目链接: http://acm.hdu.edu.cn/showproblem.php?pid=1561 题目大意:从树根开始取点.最多取m个点,问最大价值. 解题思路: cost=1的树形背包. 有 ...

  6. 【BZOJ】2178: 圆的面积并

    http://www.lydsy.com/JudgeOnline/problem.php?id=2178 题意:给出n<=1000个圆,求这些圆的面积并 #include <cstdio& ...

  7. c#操作excel后关闭excel.exe的方法

    关闭进程 C#和Asp.net下excel进程一被打开,有时就无法关闭,   尤其是website.对关闭该进程有过GC.release等方法,但这些方法并不是在所有情况下均适用.  于是提出了kil ...

  8. winform学习2-datagridview数据绑定

    1.datagridview.clearSelection()清除默认的选中项 2.列数据显示,首先列必须是显示状态, 3.布局-单元格内文字内容居中显示,示例:外观-defaultCellStyle ...

  9. StereoBM::disp12MaxDiff Crash the Release

    Initializing "cv::StereoBM bm.state->disp12MaxDiff" should be careful, inappropriate va ...

  10. 9. Add the Block Storage service

    Block Storage Server: 1. sudo apt-get install python-mysqldb   2. sudo apt-get install lvm2   3. 创建存 ...