etcd raft library设计原理和使用

早在2013年11月份，在raft论文还只能在网上下载到草稿版时，我曾经写过一篇blog对其进行简要分析。4年过去了，各种raft协议的讲解铺天盖地，raft也确实得到了广泛的应用。其中最知名的应用莫过于etcd。etcd将raft协议本身实现为一个library，位于https://github.com/coreos/etcd/tree/master/raft，然后本身作为一个应用使用它。

本文不讲解raft协议核心内容，而是站在一个etcd raft library使用者的角度，讲解要用上这个library需要了解的东西。

这个library使用起来相对来说还是有点麻烦。官方有一个使用示例在 https://github.com/coreos/etcd/tree/master/contrib/raftexample。整体来说，这个库实现了raft协议核心的内容，比如append log的逻辑，选主逻辑，snapshot，成员变更等逻辑。需要明确的是：library没有实现消息的网络传输和接收，库只会把一些待发送的消息保存在内存中，用户自定义的网络传输层取出消息并发送出去，并且在网络接收端，需要调一个library的函数，用于将收到的消息传入library，后面会详细说明。同时，library定义了一个Storage接口，需要library的使用者自行实现。

Storage接口如下:

// Storage is an interface that may be implemented by the application

// to retrieve log entries from storage.

//

// If any Storage method returns an error, the raft instance will

// become inoperable and refuse to participate in elections; the

// application is responsible for cleanup and recovery in this case.

type Storage interface {

	// InitialState returns the saved HardState and ConfState information.

	InitialState() (pb.HardState, pb.ConfState, error)

	// Entries returns a slice of log entries in the range [lo,hi).

	// MaxSize limits the total size of the log entries returned, but

	// Entries returns at least one entry if any.

	Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)

	// Term returns the term of entry i, which must be in the range

	// [FirstIndex()-1, LastIndex()]. The term of the entry before

	// FirstIndex is retained for matching purposes even though the

	// rest of that entry may not be available.

	Term(i uint64) (uint64, error)

	// LastIndex returns the index of the last entry in the log.

	LastIndex() (uint64, error)

	// FirstIndex returns the index of the first log entry that is

	// possibly available via Entries (older entries have been incorporated

	// into the latest Snapshot; if storage only contains the dummy entry the

	// first log entry is not available).

	FirstIndex() (uint64, error)

	// Snapshot returns the most recent snapshot.

	// If snapshot is temporarily unavailable, it should return ErrSnapshotTemporarilyUnavailable,

	// so raft state machine could know that Storage needs some time to prepare

	// snapshot and call Snapshot later.

	Snapshot() (pb.Snapshot, error)

}

这些接口在library中会被用到。熟悉raft协议的人不难理解。上面提到的官方示例https://github.com/coreos/etcd/tree/master/contrib/raftexample中使用了library自带的MemoryStorage，和etcd的wal和snap包做持久化，重启的时候从wal和snap中获取日志恢复MemoryStorage。

要提供这种IO/网络密集型的东西，提高吞吐最好的手段就是batch加批处理了。etcd raft library正是这么做的。

下面看一下为了做这事，etcd提供的核心抽象Ready结构体：

// Ready encapsulates the entries and messages that are ready to read,

// be saved to stable storage, committed or sent to other peers.

// All fields in Ready are read-only.

type Ready struct {

	// The current volatile state of a Node.

	// SoftState will be nil if there is no update.

	// It is not required to consume or store SoftState.

	*SoftState

	// The current state of a Node to be saved to stable storage BEFORE

	// Messages are sent.

	// HardState will be equal to empty state if there is no update.

	pb.HardState

	// ReadStates can be used for node to serve linearizable read requests locally

	// when its applied index is greater than the index in ReadState.

	// Note that the readState will be returned when raft receives msgReadIndex.

	// The returned is only valid for the request that requested to read.

	ReadStates []ReadState

	// Entries specifies entries to be saved to stable storage BEFORE

	// Messages are sent.

	Entries []pb.Entry

	// Snapshot specifies the snapshot to be saved to stable storage.

	Snapshot pb.Snapshot

	// CommittedEntries specifies entries to be committed to a

	// store/state-machine. These have previously been committed to stable

	// store.

	CommittedEntries []pb.Entry

	// Messages specifies outbound messages to be sent AFTER Entries are

	// committed to stable storage.

	// If it contains a MsgSnap message, the application MUST report back to raft

	// when the snapshot has been received or has failed by calling ReportSnapshot.

	Messages []pb.Message

	// MustSync indicates whether the HardState and Entries must be synchronously

	// written to disk or if an asynchronous write is permissible.

	MustSync bool

}

可以说，这个Ready结构体封装了一批更新，这些更新包括：

pb.HardState: 包含当前节点见过的最大的term，以及在这个term给谁投过票，已经当前节点知道的commit index
Messages: 需要广播给所有peers的消息
CommittedEntries:已经commit了，还没有apply到状态机的日志
Snapshot:需要持久化的快照

库的使用者从node结构体提供的一个ready channel中不断的pop出一个个的Ready进行处理，库使用者通过如下方法拿到Ready channel：

func (n *node) Ready() <-chan Ready { return n.readyc }

应用需要对Ready的处理包括:

将HardState, Entries, Snapshot持久化到storage。
将Messages(上文提到的msgs)非阻塞的广播给其他peers
将CommittedEntries(已经commit还没有apply)应用到状态机。
如果发现CommittedEntries中有成员变更类型的entry，调用node的ApplyConfChange()方法让node知道(这里和raft论文不一样，论文中只要节点收到了成员变更日志就应用)
调用Node.Advance()告诉raft node，这批状态更新处理完了，状态已经演进了，可以给我下一批Ready让我处理。

应用通过raft.StartNode()来启动raft中的一个副本，函数内部通过启动一个goroutine运行

func (n *node) run(r *raft)

来启动服务。

应用通过调用

func (n *node) Propose(ctx context.Context, data []byte) error

来Propose一个请求给raft，被raft开始处理后返回。

增删节点通过调用

func (n *node) ProposeConfChange(ctx context.Context, cc pb.ConfChange) error

node结构体包含几个重要的channel:

// node is the canonical implementation of the Node interface

type node struct {

	propc      chan pb.Message

	recvc      chan pb.Message

	confc      chan pb.ConfChange

	confstatec chan pb.ConfState

	readyc     chan Ready

	advancec   chan struct{}

	tickc      chan struct{}

	done       chan struct{}

	stop       chan struct{}

	status     chan chan Status

	logger Logger

}

propc: propc是一个没有buffer的channel，应用通过Propose接口写入的请求被封装成Message被push到propc中，node的run方法从propc中pop出Message，append自己的raft log中，并且将Message放入mailbox中(raft结构体中的msgs []pb.Message)，这个msgs会被封装在Ready中，被应用从readyc中取出来，然后通过应用自定义的transport发送出去。
recvc: 应用自定义的transport在收到Message后需要调用
```
func (n *node) Step(ctx context.Context, m pb.Message) error
```
来把Message放入recvc中，经过一些处理后，同样，会把需要发送的Message放入到对应peers的mailbox中。后续通过自定义transport发送出去。
readyc／advancec: readyc和advancec都是没有buffer的channel，node.run()内部把相关的一些状态更新打包成Ready结构体(其中一种状态就是上面提到的msgs)放入readyc中。应用从readyc中pop出Ready中，对相应的状态进行处理，处理完成后，调用
```
rc.node.Advance()
```
往advancec中push一个空结构体告诉raft，已经对这批Ready包含的状态进行了相应的处理，node.run()内部从advancec中得到通知后，对内部一些状态进行处理，比如把已经持久化到storage中的entries从内存(对应type unstable struct)中删除等。
tickc:应用定期往tickc中push空结构体，node.run()会调用tick()函数，对于leader来说，tick()会给其他peers发心跳，对于follower来说，会检查是否需要发起选主操作。
confc/confstatec:应用从Ready中拿出CommittedEntries，检查其如果含有成员变更类型的日志，则需要调用
```
func (n *node) ApplyConfChange(cc pb.ConfChange) *pb.ConfState
```
这个函数会push ConfChange到confc中，confc同样是个无buffer的channel，node.run()内部会从confc中拿出ConfChange，然后进行真正的增减peers操作，之后将最新的成员组push到confstatec中，而ApplyConfChange函数从confstatec pop出最新的成员组返回给应用。

可以说，要想用上etcd的raft library还是需要了解不少东西的。

etcd raft library设计原理和使用的更多相关文章

etcd raft library
https://github.com/coreos/etcd/tree/master/raft import "github.com/coreos/etcd/raft" ----- ...
etcd raft如何实现成员变更
成员变更在一致性协议里稍复杂一些,由于不同的成员不可能在同一时刻从旧成员组切换至新成员组,所以可能出现两个不相交的majority,从而导致同一个term出现两个leader,进而导致同一个index ...
etcd学习(5)-etcd的Raft一致性算法原理
ETCD的Raft一致性算法原理前言 Raft原理了解 raft选举 raft中的几种状态任期 leader选举日志复制安全性 leader宕机,新的leader未同步前任committed的 ...
etcd raft如何实现leadership transfer
leadership transfer可以把raft group中的leader身份转给其中一个follower.这个功能可以用来做负载均衡,比如可以把leader放在性能更好的机器或者离客户端更近的 ...
Atitit ati licenseService 设计原理
Atitit ati licenseService 设计原理 C:\0workspace\AtiPlatf\src_atibrow\com\attilax\license\LicenseX.ja ...
kafka入门：简介、使用场景、设计原理、主要配置及集群搭建（转）
问题导读: 1.zookeeper在kafka的作用是什么? 2.kafka中几乎不允许对消息进行"随机读写"的原因是什么? 3.kafka集群consumer和producer状 ...
html5设计原理(转)
转自: http://www.cn-cuckoo.com/2010/10/21/the-design-of-html5-2151.html 今天我想跟大家谈一谈HTML5的设计.主要分两个方面:一 ...
学习HTML5必读之《HTML5设计原理》
引子:很久前看过的一遍受益匪浅的文章,今天再次转过来,希望对学习HTML5的朋友有所帮助. 今天我想跟大家谈一谈HTML5的设计.主要分两个方面:一方面,当然了,就是HTML5.我可以站在这儿只讲HT ...
分布式文件系统FastDFS设计原理
原文地址: http://blog.chinaunix.net/uid-20196318-id-4058561.html FastDFS是一个开源的轻量级分布式文件系统,由跟踪服务器(tracker ...

随机推荐

Java编程之委托代理回调、内部类以及匿名内部类回调(闭包回调)
最近一直在看Java的相关东西,因为我们在iOS开发是,无论是Objective-C还是Swift中,经常会用到委托代理回调,以及Block回调或者说是闭包回调.接下来我们就来看看Java语言中是如何 ...
asp.net已流的方式下载文件
string filePath = context.Server.MapPath("~/" + uploadFolder+"/"+file_name);//路径 ...
开涛spring3(6.7) - AOP 之 6.7 通知顺序
如果我们有多个通知想要在同一连接点执行,那执行顺序如何确定呢?Spring AOP使用AspectJ的优先级规则来确定通知执行顺序.总共有两种情况:同一切面中通知执行顺序.不同切面中的通知执行顺序. ...
SmartCoder每日站立会议10
站立会议内容: 准备为上交第一阶段项目进行加班,将各个页面联系起来,静态地图变为动态转换,考虑地图全屏或者是小屏即消息展示方式 1.站立会议照片: 2.任务展板: 3.燃尽图:
[原创]安全系列之端口敲门服务（Port Knocking for Ubuntu 14.04 Server）
Port Knocking for Ubuntu 14.04 Server OS:ubuntu 14.04 server 原理简单分析: 端口敲门服务,即:knockd服务.该服务通过动态的添加ipt ...
jQuery-强大的jQuery选择器、过滤器
1. 基础选择器 Basics 名称说明举例 #id 根据元素Id选择 $("divId") 选择ID为divId的元素 element 根据元素的名称选择, $(" ...
不知道Linux的mysql的root密码
用了好几次了这个方法.记一下: 1.停止Mysql /etc/init.d/mysql stop 或者(根据安装配置的位置不同,而不同) /etc/init.d/mysqld stop 2.进入Mys ...
python str的一些方法
在python有各种各样的string操作函数.在历史上string类在python中经历了一段轮回的历史.在最开始的时候,python有一个专门的string的module,要使用string的方法 ...
A comparison of local caches (2) 【本地缓存之比较 (2)】
接上一篇: A comparison of local caches (1) [本地缓存之比较 (1)] This article will compare the asynchronous loca ...
debian安装dwm窗口管理器
我安装debian版本是debian-8.8.0-i386-netinst最小安装首先去官网下载源代码 http://git.suckless.org/dwm #安装x-window环境 $sudo ...

etcd raft library设计原理和使用

etcd raft library设计原理和使用的更多相关文章

随机推荐

热门专题