一主多从+Binlog Server，主库故障无法访问，如何在从库中选举一个新主库

一、基本环境

VMware10.0+CentOS6.9+MySQL5.7.19

ROLE	HOSTNAME	BASEDIR	DATADIR	IP	PORT
M	ZST1	/usr/local/mysql	/data/mysql/mysql3306/data	192.168.85.132	3306
S1	ZST2	/usr/local/mysql	/data/mysql/mysql3306/data	192.168.85.133	3306
S2	ZST1	/usr/local/mysql	/data/mysql/mysql3307/data	192.168.85.132	3307

基于Row+Gtid搭建的一主两从异步复制结构：M->{S1、S2}。Binlog Server运行于192.168.85.133，日志保存目录/data/binlogserver

二、故障模拟

M不断地写入数据，分别暂停S1、S2的IO_Thread，然后shutdown主库，再启动S1、S2的IO_Thread。在完成上述操作后，我们来查看主、从数据情况

1、M不断地写入数据，分别暂停S1、S2的IO_Thread，然后shutdown主库

# 主库M在shutdown前的binlog及测试表中的数据

mydba@192.168.85.132,3306 [replcrash]> show master status;

+------------------+----------+--------------+------------------+--------------------------------------------+

| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                          |

+------------------+----------+--------------+------------------+--------------------------------------------+

| mysql-bin.000064 |     1679 |              |                  | 8ab82362-9c37-11e7-a858-000c29c1025c:1-493 |

+------------------+----------+--------------+------------------+--------------------------------------------+

1 row in set (0.00 sec)

mydba@192.168.85.132,3306 [replcrash]> select * from repl;

+----+----------+----------+

| id | name1    | name2    |

+----+----------+----------+

|  1 | 16:12:59 | 16:12:59 |

|  2 | 16:13:00 | 16:13:00 |

|  3 | 16:13:03 | 16:13:03 |

|  4 | 16:13:04 | 16:13:04 |

|  5 | 16:13:06 | 16:13:06 |

|  6 | 16:13:09 | 16:13:09 |

|  7 | 16:13:10 | 16:13:10 |

|  8 | 16:55:04 | 16:55:04 |

|  9 | 17:01:05 | 17:01:05 |

| 10 | 17:01:15 | 17:01:15 |

| 11 | 17:02:15 | 17:02:15 |

| 12 | 17:02:17 | 17:02:17 |

| 13 | 17:04:01 | 17:04:01 |

+----+----------+----------+

13 rows in set (0.00 sec)

mydba@192.168.85.132,3306 [replcrash]> shutdown;

Query OK, 0 rows affected (0.01 sec)

2、再次启动S1、S2的IO_Thread

# 从库S1复制状态

mydba@192.168.85.133,3306 [replcrash]> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Connecting to master

                  Master_Host: 192.168.85.132

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mysql-bin.000064

          Read_Master_Log_Pos: 788

        Relay_Master_Log_File: mysql-bin.000064

             Slave_IO_Running: Connecting

            Slave_SQL_Running: Yes

          Exec_Master_Log_Pos: 788

              Until_Condition: None

                Last_IO_Errno: 2003

                Last_IO_Error: error connecting to master 'repl@192.168.85.132:3306' - retry-time: 60  retries: 1

             Master_Server_Id: 1323306

                  Master_UUID: 8ab82362-9c37-11e7-a858-000c29c1025c

                    SQL_Delay: 0

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

           Retrieved_Gtid_Set: 8ab82362-9c37-11e7-a858-000c29c1025c:166-490

            Executed_Gtid_Set: 8ab82362-9c37-11e7-a858-000c29c1025c:1-490

                Auto_Position: 1

# 从库S2复制状态

mydba@192.168.85.132,3307 [(none)]> show slave status\G

*************************** 1. row ***************************

               Slave_IO_State: Connecting to master

                  Master_Host: 192.168.85.132

                  Master_User: repl

                  Master_Port: 3306

                Connect_Retry: 60

              Master_Log_File: mysql-bin.000064

          Read_Master_Log_Pos: 1382

        Relay_Master_Log_File: mysql-bin.000064

             Slave_IO_Running: Connecting

            Slave_SQL_Running: Yes

          Exec_Master_Log_Pos: 1382

              Until_Condition: None

                Last_IO_Errno: 2003

                Last_IO_Error: error connecting to master 'repl@192.168.85.132:3306' - retry-time: 60  retries: 1

             Master_Server_Id: 1323306

                  Master_UUID: 8ab82362-9c37-11e7-a858-000c29c1025c

             Master_Info_File: /data/mysql/mysql3307/data/master.info

                    SQL_Delay: 60

          SQL_Remaining_Delay: NULL

      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

           Retrieved_Gtid_Set: 8ab82362-9c37-11e7-a858-000c29c1025c:484-492

            Executed_Gtid_Set: 8ab82362-9c37-11e7-a858-000c29c1025c:1-492,90b30799-9215-11e7-8645-000c29c1025c:1-6

                Auto_Position: 1

# 从库S1的binlog及测试表中的数据

mydba@192.168.85.133,3306 [replcrash]> show master status;

+------------------+----------+--------------+------------------+--------------------------------------------+

| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                          |

+------------------+----------+--------------+------------------+--------------------------------------------+

| mysql-bin.000001 |  5462247 |              |                  | 8ab82362-9c37-11e7-a858-000c29c1025c:1-490 |

+------------------+----------+--------------+------------------+--------------------------------------------+

1 row in set (0.00 sec)

mydba@192.168.85.133,3306 [replcrash]> select * from repl;

+----+----------+----------+

| id | name1    | name2    |

+----+----------+----------+

|  1 | 16:12:59 | 16:12:59 |

|  2 | 16:13:00 | 16:13:00 |

|  3 | 16:13:03 | 16:13:03 |

|  4 | 16:13:04 | 16:13:04 |

|  5 | 16:13:06 | 16:13:06 |

|  6 | 16:13:09 | 16:13:09 |

|  7 | 16:13:10 | 16:13:10 |

|  8 | 16:55:04 | 16:55:04 |

|  9 | 17:01:05 | 17:01:05 |

| 10 | 17:01:15 | 17:01:15 |

+----+----------+----------+

10 rows in set (0.00 sec)

# 从库S2的binlog及测试表中的数据

mydba@192.168.85.132,3307 [replcrash]> show master status;

+------------------+----------+--------------+------------------+--------------------------------------------------------------------------------------+

| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                    |

+------------------+----------+--------------+------------------+--------------------------------------------------------------------------------------+

| mysql-bin.000008 |     2703 |              |                  | 8ab82362-9c37-11e7-a858-000c29c1025c:1-492,

90b30799-9215-11e7-8645-000c29c1025c:1-6 |

+------------------+----------+--------------+------------------+--------------------------------------------------------------------------------------+

1 row in set (0.01 sec)

mydba@192.168.85.132,3307 [replcrash]> select * from repl;

+----+----------+----------+

| id | name1    | name2    |

+----+----------+----------+

|  1 | 16:12:59 | 16:12:59 |

|  2 | 16:13:00 | 16:13:00 |

|  3 | 16:13:03 | 16:13:03 |

|  4 | 16:13:04 | 16:13:04 |

|  5 | 16:13:06 | 16:13:06 |

|  6 | 16:13:09 | 16:13:09 |

|  7 | 16:13:10 | 16:13:10 |

|  8 | 16:55:04 | 16:55:04 |

|  9 | 17:01:05 | 17:01:05 |

| 10 | 17:01:15 | 17:01:15 |

| 11 | 17:02:15 | 17:02:15 |

| 12 | 17:02:17 | 17:02:17 |

+----+----------+----------+

12 rows in set (0.00 sec)

汇总执行状态

ROLE	File/Position		Executed_Gtid_Set
ROLE	Master_Log_File/Read_Master_Log_Pos	Relay_Master_Log_File/Exec_Master_Log_Pos	Executed_Gtid_Set
S1	mysql-bin.000064/788	mysql-bin.000064/788	8ab82362-9c37-11e7-a858-000c29c1025c:1-490
S2	mysql-bin.000064/1382	mysql-bin.000064/1382	8ab82362-9c37-11e7-a858-000c29c1025c:1-492
M	mysql-bin.000064/1679		8ab82362-9c37-11e7-a858-000c29c1025c:1-493

从复制状态及测试表中的数据可知S1滞后于S2，S2滞后于M。实际环境中M可能已经完全不可访问，我们只能通过对比S2和Binlog Server的binlog来确定是否还有未传递到从库的日志

三、复制结构调整

3.1、多从选主原则

1、各从库的复制是否执行完成(SQL_Thread已执行完有效的relay-log)
2、复制完成的前提下，各从库读取到主库的位置(Master_Log_File、Read_Master_Log_Pos)，选择一个同步最靠前的
如果从库个数很多，肯定不适合肉眼对比各从库的Master_Log_File、Read_Master_Log_Pos，使用冒泡算法，得到最大的{Sx.Master_Log_File , Sx.Read_Master_Log_Pos}
如果所有从库读取到主库的同一位置，那么提前在配置表中设置谁优先成为主库(根据实际需求权值衡量)
3、将其他从库change到新主库

3.2、结构调整

本例中只有S1、S2两个从库，直接对比复制状态。各从库的SQL_Thread已执行完有效的relay-log；S1执行到mysql-bin.000064/788，S2执行到mysql-bin.000064/1382，因此S2作为我们的目标新主库。
S1应该change到S2的哪个位置呢？之前的复制结构中S1、S2并没有直接的关系，M上的同一个Event应用到各从库的位置并不一定相同。
3.2.1、传统复制
前提：S1、S2上开启了log_slave_updates(建议所有的从库都开启)，通过解析S1上执行完的binlog last event(已commit)到S2上找对应的event

# 解析S1的binlog

[root@ZST2 ~]# cd /data/mysql/mysql3306/logs

[root@ZST2 logs]# ll

total

-rw-r-----.  mysql mysql  Oct  : mysql-bin.

-rw-r-----.  mysql mysql       Oct  : mysql-bin.index

[root@ZST2 logs]# mysqlbinlog -v --base64-output=decode-rows mysql-bin.

...

COMMIT/*!*/;

# at

# :: server id   end_log_pos  CRC32 0x7645e302        GTID    last_committed=      sequence_number=     rbr_only=yes

/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;

SET @@SESSION.GTID_NEXT= '8ab82362-9c37-11e7-a858-000c29c1025c:490'/*!*/;

# at

# :: server id   end_log_pos  CRC32 0x39d04b3a        Query   thread_id=    exec_time=     error_code=

SET TIMESTAMP=/*!*/;

BEGIN

/*!*/;

# at

# :: server id   end_log_pos  CRC32 0xb3a9a2bf        Table_map: `replcrash`.`repl` mapped to number

# at

# :: server id   end_log_pos  CRC32 0x2648ea90        Write_rows: table id  flags: STMT_END_F

### INSERT INTO `replcrash`.`repl`

### SET

###   @=

###   @='17:01:15'

###   @='17:01:15'

# at

# :: server id   end_log_pos  CRC32 0x524c761a        Xid =

COMMIT/*!*/;

SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;

DELIMITER ;

# End of log file

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

[root@ZST2 logs]#

重点关注最后一个commit事务中的server_id(master传过来)、TIMESTAMP(master传过来)
server_id+TIMESTAMP可以定位原始实例上的具体binlog，然后根据TIMESTAMP的上下文去确定pos

# S1上最后一个事务的时间戳

mysql> select from_unixtime(),unix_timestamp(now());

+---------------------------+-----------------------+

| from_unixtime() | unix_timestamp(now()) |

+---------------------------+-----------------------+

| -- ::       |             |

+---------------------------+-----------------------+

# 解析S2的binlog，根据前面的时间2017-- ::15存在于mysql-bin.000008文件中

# 上面的那个时间，准确来说只适合在Master上使用，例如我们将从库的SQL_Thread暂停几小时，再启用，它在binlog中的记录还是1508144475，但是我们不能确保它就写到了mysql-bin.000008文件，因此线上环境我们可能要多解析几个binlog去查找TIMESTAMP

[root@ZST1 ~]# cd /data/mysql/mysql3307/logs

[root@ZST1 logs]# ll

total

-rw-r-----.  mysql mysql    Oct  : mysql-bin.

-rw-r-----.  mysql mysql     Oct  : mysql-bin.

-rw-r-----.  mysql mysql  Oct  : mysql-bin.

-rw-r-----.  mysql mysql     Oct  : mysql-bin.

-rw-r-----.  mysql mysql      Oct  : mysql-bin.

-rw-r-----.  mysql mysql      Oct  : mysql-bin.

-rw-r-----.  mysql mysql    Oct  : mysql-bin.

-rw-r-----.  mysql mysql     Oct  : mysql-bin.

-rw-r-----.  mysql mysql      Oct  : mysql-bin.index

[root@ZST1 logs]# mysqlbinlog -v --base64-output=decode-rows mysql-bin.

...

COMMIT/*!*/;

# at

# :: server id   end_log_pos  CRC32 0xdae59343   GTID    last_committed=        sequence_number=       rbr_only=yes

/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;

SET @@SESSION.GTID_NEXT= '8ab82362-9c37-11e7-a858-000c29c1025c:490'/*!*/;

# at

# :: server id   end_log_pos  CRC32 0x021a70da   Query   thread_id=    exec_time=    error_code=

SET TIMESTAMP=/*!*/;

BEGIN

/*!*/;

# at

# :: server id   end_log_pos  CRC32 0x117d1f49   Table_map: `replcrash`.`repl` mapped to number

# at

# :: server id   end_log_pos  CRC32 0xe5a9bd74   Write_rows: table id  flags: STMT_END_F

### INSERT INTO `replcrash`.`repl`

### SET

###   @=

###   @='17:01:15'

###   @='17:01:15'

# at

# :: server id   end_log_pos  CRC32 0x7ee611d4   Xid =

COMMIT/*!*/;

# at

# :: server id   end_log_pos  CRC32 0x1b5a4b41   GTID    last_committed=        sequence_number=       rbr_only=yes

/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;

SET @@SESSION.GTID_NEXT= '8ab82362-9c37-11e7-a858-000c29c1025c:491'/*!*/;

# at

# :: server id   end_log_pos  CRC32 0x85053196   Query   thread_id=    exec_time=    error_code=

SET TIMESTAMP=/*!*/;

...

很幸运，我们在S2的mysql-bin.000008文件找到对应的TIMESTAMP
我们只需在S2找到对应事务，然后取commit后的第一个at，作为change的起始位置

#传统复制

S1：

mysql> change master to

master_host='192.168.85.132',

master_port=3307,

master_user='repl',

master_password='repl',

master_log_file='mysql-bin.000008',

master_log_pos=2153,

master_auto_position=0;

3.2.2、GTID
直接执行，S1：stop slave;change master to S2...,master_auto_position=1;start slave;

3.3、应用Binlog Server中的binlog

前面复制已搭建S2->S1，只需对S2应用Binlog Server中的binlog，S2之前的复制状态Master_Log_File/Read_Master_Log_Pos：mysql-bin.000064/1382

# 解析binlogserver

[root@ZST2 ~]# cd /data/binlogserver

[root@ZST2 binlogserver]# mysqlbinlog -v --base64-output=decode-rows mysql-bin. --start-position=

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at

# :: server id   end_log_pos  CRC32 0xfd581af7   GTID    last_committed=        sequence_number=       rbr_only=yes

/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;

SET @@SESSION.GTID_NEXT= '8ab82362-9c37-11e7-a858-000c29c1025c:493'/*!*/;

# at

# :: server id   end_log_pos  CRC32 0xb9b3df4a   Query   thread_id=    exec_time=     error_code=

SET TIMESTAMP=/*!*/;

SET @@session.pseudo_thread_id=/*!*/;

SET @@session.foreign_key_checks=, @@session.sql_auto_is_null=, @@session.unique_checks=, @@session.autocommit=/*!*/;

SET @@session.sql_mode=/*!*/;

SET @@session.auto_increment_increment=, @@session.auto_increment_offset=/*!*/;

/*!\C utf8 *//*!*/;

SET @@session.character_set_client=,@@session.collation_connection=,@@session.collation_server=/*!*/;

SET @@session.time_zone='SYSTEM'/*!*/;

SET @@session.lc_time_names=/*!*/;

SET @@session.collation_database=DEFAULT/*!*/;

BEGIN

/*!*/;

# at

# :: server id   end_log_pos  CRC32 0xa14412d2   Table_map: `replcrash`.`repl` mapped to number

# at

# :: server id   end_log_pos  CRC32 0xad9ead7e   Write_rows: table id  flags: STMT_END_F

### INSERT INTO `replcrash`.`repl`

### SET

###   @=

###   @='17:04:01'

###   @='17:04:01'

# at

# :: server id   end_log_pos  CRC32 0xd2670a1c   Xid =

COMMIT/*!*/;

SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;

DELIMITER ;

# End of log file

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

[root@ZST2 binlogserver]#

Binlog Server的binlog还有未传递到从库的日志，将后续日志恢复到S2中

# 导入binlogserver到S2

[root@ZST2 ~]# mysqlbinlog /data/binlogserver/mysql-bin. --start-position= |mysql -h192.168.85. -P3307 -umydba -p

导入后S2、S1数据就和M shutdown前保持一致了，当然我们还可以把M开启，然后就变成M->S2->S1这样的级联复制~

3.4、所有从库同步位置一致

这里的一致实际包括：当前S1、S2已处于一致或者S1、S2在应用Binlog Server的binlog后处于一致
以S1作为新主库，传统模式
S1：show master status;
S2：stop salve;change master to S1...,master_log_file=S1.log_file,master_log_pos=S1.log_pos,master_auto_position=0;start slave;
GTID：直接执行，S2：stop slave;change master to S1...,master_auto_position=1;start slave;

四、总结

问题的难点是找准从库之间的"关联"，如果复制是基于Row+Gtid，Change根本就不是事。至于Binlog Server，只是用于将未传递到从库的日志恢复到从库中。