PAXOS实现 —— libevent_paxos

该文章是项目的一部分。主要讲PAXOS算法的实现。                                                                        



 Qiushan, Shandong University

 Part I   Start     

 Part II  PAXOS

Part III  Architecture

Part IV  Performance

 Part V  Skills


Part I   Generated on November 15, 2014 (Start)

The libevent_paxos ensures that all replicas of an application(e.g., apache) receive a consistent total order of requests(i.e., socket operations) despite of a small number of replicas fail. That's why we used paxos.

To begin with, let's read Paxos Made Practical carefully. Below is a figure in it:

1)    In libevent_paxos, when a client connects with a PROXY via socket, then it sends a message ‘clientx:send’ to the proxy. When the connection is established, it will trigger a listener callback function called proxy_on_accept. Then when the message arrives at the proxy, it will trigger another listener callback function called client_side_on_read, the function will handle the message. Then client_process_data will be called, and in the function build_req_sub_msg will be invoked, it is responsible for building formal message (REQUEST_SUBMIT). Then the message will be sent to &proxy->sys_addr.c_addr,in another word, the corresponding consensus, the message will be sent to127.0.0.1:8000. And then the consensus will handle the coming message with handle_request_submit,handle_request_sumbit will invoke consensus_submit_request, if the current node is the leader, consensus_submit_request will use leader_handle_submit_req to combine the request message from the client with a view_stamp which is generated by function get_next_view_stamp. And then in leader_handle_submit_req, the new request message with type ACCEPT_REQ will be built by function build_accept_req, and then the message will be broadcasted to the other replicas via function uc (i.e.send_for_consensus_comp). In fact the process is just like the phase 1-2 of the following figure which cited from the paper Paxos Made Practical.

2)   In the figure, 3 is replicate_res which is the acknowledge that backups make after receiving the request information from the primary.In libevent_paxos, it should correspond to

handle_accept_req (src/consensus/consensus.c)

// build the reply to the leader
accept_ack* reply = build_accept_ack(comp,&msg->msg_vs);

After seeing build_accept_ack, we can find that what backups deliver to the primary is:

msg->node_id = comp->node_id;

msg->msg_vs = *vs;

msg->header.msg_type = ACCEPT_ACK;

which is similar to replicate_res in the paper:

So libevent_paxos also follows what Paxos Made Practical said.

Next for libevent_paxos should be execute the request after receiving a majority of replicas' acknowledge. The execution should be

handle_accept_ack

try_to_execute

leader_try_to_execute 

And how to send the ack to the primary, it is the function uc (a.k.a send_for_consensus_comp), let's see it.

// consensus part
static void send_for_consensus_comp(node* my_node,size_t data_size,void* data,int target){ consensus_msg* msg = build_consensus_msg(data_size,data);
if(NULL==msg){
goto send_for_consensus_comp_exit;
}
// means send to every node except me
if(target<0){
for(uint32_t i=0;i<my_node->group_size;i++){
if(i!=my_node->node_id && my_node->peer_pool[i].active){
struct bufferevent* buff = my_node->peer_pool[i].my_buff_event;
bufferevent_write(buff,msg,CONSENSUS_MSG_SIZE(msg));
SYS_LOG(my_node,
"Send Consensus Msg To Node %u\n",i);
}
}
}else{
if(target!=(int)my_node->node_id&&my_node->peer_pool[target].active){
struct bufferevent* buff = my_node->peer_pool[target].my_buff_event;
bufferevent_write(buff,msg,CONSENSUS_MSG_SIZE(msg));
SYS_LOG(my_node,
"Send Consensus Msg To Node %u.\n",target);
}
}
send_for_consensus_comp_exit:
if(msg!=NULL){
free(msg);
} return;
}

So, uc sees to delivering the message between the primary and the backups. For the same reason, when a client request arrives at the primary, the primary also uses uc (a.k.a send_for_consensus_comp) to broadcast the request information to backups.

3)  Next there is 4 left to be explained.

group_size = 3;

leader_try_to_execute
{
if(reached_quorum(record_data,comp->group_size))
{
SYS_LOG(comp,"Node %d : View Stamp%u : %u Has Reached Quorum.\n",…)
SYS_LOG(comp,"Before Node %d IncExecute %u : %u.\n",…)
SYS_LOG(comp,"After Node %d Inc Execute %u : %u.\n",…)
}
} static int reached_quorum(request_record*record,int group_size){
// this may be compatibility issue
if(__builtin_popcountl(record->bit_map)>=((group_size/2)+1)){
return 1;
}else{
return 0;
}
}

__builtin_popcountl which comes from GCC can calculate the number of 1 accurately.

So next we should pay attention to the data structure record->bit_map.

In concensus.c

typedef struct request_record_t{
struct timeval created_time; // data created timestamp
char is_closed;
uint64_t bit_map; // now we assume the maximal replica group size is 64;
size_t data_size; // data size
char data[0]; // real data
}__attribute__((packed))request_record;

Before our further exploration, let us suppose that the leader needs to use the bit_map to record that how many replicas has accepted the request that it proposed, and iff the number reaches the quorum (i.e. majority, i.e. (group_size/2)+1),the leader can execute the request. So based on the assumption, there must be a place where recording the number of ACCEPT ACK, and it should be after the leader has received the ACCEPT ACK messages which come from the other replicas.Let’s find the place, verifying our hypothesis.

consensus_handle_msg
{
case ACCEPT_ACK:
handle_accept_ack(comp,data)
{
update_record(record_data,msg->node_id)
{
record->bit_map= (record->bit_map | (1<<node_id));
//debug_log("the record bit map isupdated to %x\n",record->bit_map);
}
}
}

So our assumption is correct, and meanwhile through the expression

record->bit_map | (1<<node_id)

we could see that the number of 1 in record->bit_map is the number of replicas which has sent ACCEPT_ACK.


Evaluation Framework

  • Python script

./eval.py apache_ab.cfg

  • DEBUG Logs: proxy-req.log | proxy-sys.log | consensus-sys.log

node-0-proxy-req.log:

1418002395 : 1418002395.415361,1418002395.415362,1418002395.416476,1418002395.416476
Operation: Connects.

1418002395 : 1418002395.415617,1418002395.415619,1418002395.416576,1418002395.416576
Operation: Sends data: (START):client0:send:(END)

1418002395 : 1418002395.416275,1418002395.416276,1418002395.417113,1418002395.417113
Operation: Closes.

About proxy-req.log, what's the meaning of 1418002395 : 1418002395.415361,1418002395.415362,1418002395.416476,1418002395.416476

fprintf(output,"%lu : %lu.%06lu,%lu.%06lu,%lu.%06lu,%lu.%06lu\n",header->connection_id,
            header->received_time.tv_sec,header->received_time.tv_usec,
            header->created_time.tv_sec,header->created_time.tv_usec,
            endtime.tv_sec,endtime.tv_usec,endtime.tv_sec,endtime.tv_usec);

It shows the received time, created time and end time of the operation.

  • Google Spreadsheets

suites benchmark workload per client Server # Concurrency # Client # Requests # w/ proxy
consensus mean(us)
w/ proxy
consensus s.t.d.
w/ proxy
response mean(us)
w/ proxy
response s.t.d.
w/ proxy
throughput(Req/s)
w/ proxy
server mean(us)
w/ proxy
server throughput(Req/s)
w/o proxy
server mean(us)
w/o proxy
server throughput(Req/s)
overhead
mean(us)
notes
Mongoose Ab 10 3 10 100 5434 11784.4 12637 12137.4 12671 1125.57 6024 1660.03 1865 5361.93 4159  
Mongoose Ab 100 3 10 100 58384 9650.152 15138.11 9992.337 15145.30 2485.936 16216 616.69 1315 7605.72 14901  
Mongoose Ab 100 3 50 100 57812 17456.41 17694.93 18459.16 17518.62 5127.989 70611 708.1 5744 8704.74 64867  
Apache Ab 10 3 10 100
posted on
2017-08-03 15:26 
lxjshuju 
阅读(...) 
评论(...) 
编辑 
收藏

PAXOS: libevent_paxos的更多相关文章

  1. 分布式系列文章——Paxos算法原理与推导

    Paxos算法在分布式领域具有非常重要的地位.但是Paxos算法有两个比较明显的缺点:1.难以理解 2.工程实现更难. 网上有很多讲解Paxos算法的文章,但是质量参差不齐.看了很多关于Paxos的资 ...

  2. 分布式理论之一:Paxos算法的通俗理解

    维基的简介:Paxos算法是莱斯利·兰伯特(Leslie Lamport,就是 LaTeX 中的"La",此人现在在微软研究院)于1990年提出的一种基于消息传递且具有高度容错特性 ...

  3. 分布式系统理论进阶 - Paxos

    引言 <分布式系统理论基础 - 一致性.2PC和3PC>一文介绍了一致性.达成一致性需要面临的各种问题以及2PC.3PC模型,Paxos协议在节点宕机恢复.消息无序或丢失.网络分化的场景下 ...

  4. 分布式系统理论进阶 - Paxos变种和优化

    引言 <分布式系统理论进阶 - Paxos>中我们了解了Basic Paxos.Multi Paxos的基本原理,但如果想把Paxos应用于工程实践,了解基本原理还不够. 有很多基于Pax ...

  5. 【分布式】Zookeeper与Paxos

    一.前言 在学习了Paxos在Chubby中的应用后,接下来学习Paxos在开源软件Zookeeper中的应用. 二.Zookeeper Zookeeper是一个开源的分布式协调服务,其设计目标是将那 ...

  6. 【分布式】Chubby与Paxos

    一.前言 在上一篇理解了Paxos算法的理论基础后,接下来看看Paxos算法在工程中的应用. 二.Chubby Chubby是一个面向松耦合分布式系统的锁服务,GFS(Google File Syst ...

  7. 分布式一致性算法--Paxos

    Paxos算法是莱斯利·兰伯特(Leslie Lamport)1990年提出的一种基于消息传递的一致性算法.Paxos算法解决的问题是一个分布式系统如何就某个值(决议)达成一致.在工程实践意义上来说, ...

  8. Paxos

    Paxos算法原理与推导   Paxos算法在分布式领域具有非常重要的地位.但是Paxos算法有两个比较明显的缺点:1.难以理解 2.工程实现更难. 网上有很多讲解Paxos算法的文章,但是质量参差不 ...

  9. Zookeeper学习之:paxos算法

    paxos算法的重要性众所周知,它给如今的分布式一致性提供了迄今为止最好的解决方案.无论是Lamport自己的论文描述,还是网上的诸多资料,对paxos的描述都是及其简洁的,给人的感觉是paxos看似 ...

随机推荐

  1. cookie、localStorage和sessionStorage区别

    三者区别见下表: 说明: cookie的处理过程为: 服务器向客户端发送cookie 浏览器将cookie保存 之后每次http请求浏览器都会将cookie发送给服务器端 对于 cookie,我们还需 ...

  2. CSS3文本溢出

    text-overflow: text-overflow:clip | ellipsis; clip:剪切多余的文字. ellipsis:文本溢出时显示省略标记. 要实现文本溢出剪切显示省略标记,还需 ...

  3. 01-hibernate注解:类级别注解准备工作

    注解简介: 目的:为了简化繁琐的ORM映射文件(.hbm)的配置. JPA与hibernate的关系 JPA:全称 java Persistence API(java持久化API接口) JPA注解是J ...

  4. JDBC数据库编程:callableStatement接口

    了解MySQL存储过程建立, 了解存储过程中参数传递的三种方式 了解callablestatement调用存储过程操作. 因为在现在开发中,使用存储过程的地方越来越少,所以,对于存储过程使用,只需要了 ...

  5. Hibernate 入门示例

    版权声明:本文为博主原创文章,如需转载请标注转载地址 博客地址:http://www.cnblogs.com/caoyc/p/5593406.html  环境: myelipse2015+Hibern ...

  6. Docker技术-cgroup

    分类: 虚拟化 Docker容器采用了linux内核中的cgroup技术来实现container的资源的隔离和控制. 关于cgroup我们需要了解的它的知识点: 1. 基本概念 cgroup涉及到几个 ...

  7. tomcat+nginx+redis集群试验

    Nginx负载平衡 + Tomcat + 会话存储Redis配置要点   使用Nginx作为Tomcat的负载平衡器,Tomcat的会话Session数据存储在Redis,能够实现0当机的7x24 运 ...

  8. mysql表属性、索引、约束

    1.表属性 创建表的基本语法: create table [if not exists] 表名 (字段列表 [,索引或约束列表])[表选项列表] 其中,字段列表格式如下: 字段名 类型 [属性列表], ...

  9. PL/SQL TOAD 不安装Oracle客户端连接数据库的方法

    不安装Oracle客户端连接数据库的方法 本机环境: win7 64位中文旗舰版 一.准备工作: 1)到ORACLE官网下载instantclient,下载地址:http://www.oracle.c ...

  10. const和readonly关键字

    不知道大家对const和readonly这两个关键字的区别有什么了解,原来自己之前还真不清楚它们到底是怎么回事,那么如果你也不是很清楚的话,可以一起来探讨一下.在了解这两个关键字的时候我们先来了解一下 ...