Mysql 主主复制失败

Mysql 主主复制失败

故障描述

原因描述	因为机柜PDU老化, 导致整个机柜掉电.
故障时间	20160923-10:09
发现时间	20160929-13:56

架构信息

Tomcat	Memcache	Keepalive	Mysql主主复制

节点信息

序号	节点名称	IP地址	报错信息
1	aipprd1	10.66.1.52	Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000084’ at 91941417, the last event read from ‘/aip/mysql/data/log/mysql-bin.000084’ at 91941783, the last byte read from ‘/aip/mysql/data/log/mysql-bin.000084’ at 91942912.’
2	aipprd2	10.66.1.51	Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000082’ at 6369026, the last event read from ‘/aip/mysql/data/log/mysql-bin.000082’ at 6369026, the last byte read from ‘/aip/mysql/data/log/mysql-bin.000082’ at 6369280.’

故障分析

由于Zabbix的Mysql监控脚本的缘故, 没有触发事件, 所以直至20160929检查Zabbix日志的时候, 才发现该故障, 当时Keepalive的VIP在aipprd1上, 这个节点上, 数据库是对外服务的;
那么首先以aipprd1为主, 先将aipprd2的从环境同步起来;
待aipprd2从环境同步完成后, 再将aipprd1的从环境同步起来.

同步`AIPPRD2`的从环境

检查aipprd2的Slave状态
根据代码区的描述, Last_Errno的错误代码为1062, 需要手动修改Position.

[root@aipprd2 ~]# mysql -uzabbixmoniter -ppassw0rd -hlocalhost -e "show slave status\G;"

*************************** . row ***************************

           Slave_IO_State:

              Master_Host: 10.66.1.52

              Master_User: root

              Master_Port:

            Connect_Retry:

          Master_Log_File: mysql-bin.

      Read_Master_Log_Pos:

           Relay_Log_File: mysql-relay-bin.

            Relay_Log_Pos:

    Relay_Master_Log_File: mysql-bin.

         Slave_IO_Running: No

        Slave_SQL_Running: No

          Replicate_Do_DB:

      Replicate_Ignore_DB:

       Replicate_Do_Table:

   Replicate_Ignore_Table:

  Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

               Last_Errno:

               Last_Error: Error 'Duplicate entry '93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604' for key 'PRIMARY'' on query. Default database: 'aipprd'. Query: 'INSERT INTO eahttpsession (     sessionid, username, account,      createtime, loginip,userid,explorer,userDomain,computerName,computerUserName)   VALUES ('93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604', '李花', 'XS003_4200',      '-- ::', , '','MSIE 7.0','','','')'

             Skip_Counter:

      Exec_Master_Log_Pos:

          Relay_Log_Space:

          Until_Condition: None

           Until_Log_File:

            Until_Log_Pos:

       Master_SSL_Allowed: No

       Master_SSL_CA_File:

       Master_SSL_CA_Path:

          Master_SSL_Cert:

        Master_SSL_Cipher:

           Master_SSL_Key:

    Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

            Last_IO_Errno:

            Last_IO_Error: Got fatal error  from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.' at 6369026, the last event read from '/aip/mysql/data/log/mysql-bin.' at 6369026, the last byte read from '/aip/mysql/data/log/mysql-bin.' at 6369280.'

           Last_SQL_Errno:

           Last_SQL_Error: Error 'Duplicate entry '93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604' for key 'PRIMARY'' on query. Default database: 'aipprd'. Query: 'INSERT INTO eahttpsession (     sessionid, username, account,      createtime, loginip,userid,explorer,userDomain,computerName,computerUserName)   VALUES ('93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604', '李花', 'XS003_4200',      '-- ::', , '','MSIE 7.0','','','')'

Replicate_Ignore_Server_Ids:

         Master_Server_Id:

在aipprd2上按照Last_IO_Error中的Posfile和Pos修改.
按照报错给出的提示Posfile和Pos修改后, 报错依旧.

mysql> slave stop;

mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=, master_user='root',     master_password='passw0rd', master_log_file='mysql-bin.000082', master_log_pos=;

mysql> slave start;

在aipprd2检查Mysql的日志
Mysql的日志中记录了Crash开始的时间, 并给出了建议,Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000082' position 6367510

 :: [Note] Starting crash recovery...

 :: [Note] Crash recovery finished.

 :: [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000082' position

在aipprd2上按照日志中建议的Posfile和Pos修改.
按照日志中建议的Posfile和Pos修改后, 报错依旧.

mysql> slave stop;

mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000082', master_log_pos=;

mysql> slave start;

在aipprd1上检查posfile
首先检查show slave status\G;中给出的pos, 发现日志中根本不存在.

[root@aipprd1 log]# mysqlbinlog --no-defaults --start-position= mysql-bin.

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 

# :: server id  end_log_pos  Start: binlog v , server v 5.5.-log created  :: at startup

ROLLBACK/*!*/;

BINLOG '

47HkVw8BAAAAZwAAAGsAAAAAAAQANS41LjI0LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAADjseRXEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

ERROR: Error in Log_event::read_log_event(): 'read error', data_len: , event_type:

ERROR: Could not read entry at offset : Error in log format or read error.

DELIMITER

# End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

此后检查Mysql日志中建议的pos, 发现posfile中是存在此记录的, 而此posfile的最后一个pos是6368660, 而show slave status\G;的pos是6369026, 显然不存在日志文件中.

[root@aipprd1 log]# mysqlbinlog --no-defaults  --start-position= mysql-bin.

/*!40019 SET @@session.max_insert_delayed_threads=0*/;

/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;

DELIMITER /*!*/;

# at 

# :: server id   end_log_pos   Start: binlog v , server v 5.5.-log created  :: at startup

# Warning: this binlog is either in use or was not closed properly.

ROLLBACK/*!*/;

BINLOG '

05XkVw8BAAAAZwAAAGsAAAABAAQANS41LjI0LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAADTleRXEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA==

'/*!*/;

    # at

    # at

    # at

    # at

    # at

    # at

    # at

    # at

    # at

    # at

    # at

    # End of log file

ROLLBACK /* added by mysqlbinlog */;

/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

最后在mysql-bin.000083中检查show slave status\G;提示pos:6369026, 也不存在.

[root@aipprd1 log]# mysqlbinlog --no-defaults  --start-position= mysql-bin.

在aipprd2上重新发起修改posfile和pos操作
检查aipprd1上的日志, 既然mysql-bin.000082日志末不存在pos:6369026, 且mysql-bin.000083为下一个日志, 那么重新发起修改posfile和pos的操作.

[root@aipprd1 log]# ll

total

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql   Sep  : mysql-bin.

-rw-rw----  mysql mysql    Sep  : mysql-bin.

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql   Sep  : mysql-bin.

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql     Sep  : mysql-bin.

-rw-rw----  mysql mysql        Sep  : mysql-bin.log.index

-rw-rw----  mysql mysql        Sep  : mysql-relay-bin.

-rw-rw----  mysql mysql         Sep  : mysql-relay-bin.index

重新修改posfile为mysql-bin.000083, pos为0, 启动Slave后, 现在同步正常.

mysql> slave stop;

mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000083', master_log_pos=;

mysql> slave start;

mysql> show slave status\G;

*************************** . row ***************************

           Slave_IO_State: Waiting for master to send event

              Master_Host: 10.66.1.52

              Master_User: root

              Master_Port:

            Connect_Retry:

          Master_Log_File: mysql-bin.

      Read_Master_Log_Pos:

           Relay_Log_File: mysql-relay-bin.

            Relay_Log_Pos:

    Relay_Master_Log_File: mysql-bin.

         Slave_IO_Running: Yes

        Slave_SQL_Running: Yes

          Replicate_Do_DB:

      Replicate_Ignore_DB:

       Replicate_Do_Table:

   Replicate_Ignore_Table:

  Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

               Last_Errno:

               Last_Error:

             Skip_Counter:

      Exec_Master_Log_Pos:

          Relay_Log_Space:

          Until_Condition: None

           Until_Log_File:

            Until_Log_Pos:

       Master_SSL_Allowed: No

       Master_SSL_CA_File:

       Master_SSL_CA_Path:

          Master_SSL_Cert:

        Master_SSL_Cipher:

           Master_SSL_Key:

    Seconds_Behind_Master:

Master_SSL_Verify_Server_Cert: No

            Last_IO_Errno:

            Last_IO_Error:

           Last_SQL_Errno:

           Last_SQL_Error:

Replicate_Ignore_Server_Ids:

         Master_Server_Id:

 row in set (0.00 sec)

在同步的过程中, 发现有几个Last_SQL_Error: Error ‘Duplicate entry 1026的SQL Error, 这个是因为重复主键导致Slave停止工作, 执行以下操作解决(如果有多条重复的主键, 需要执行多次):
```
mysql> slave stop;

mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=;

mysql> slave start;
```

还有另一种办法就是修改mysql配置文件/etc/my.cnf在[mysqld]下加一行slave_skip_errors = 1062 ,保存后重启mysql,mysql slave可以正常同步了.

同步`AIPPRD1`的从环境

检查aipprd1的Slave状态

mysql> show slave status\G;

*************************** . row ***************************

           Slave_IO_State:

              Master_Host: 10.66.1.51

              Master_User: root

              Master_Port:

            Connect_Retry:

          Master_Log_File: mysql-bin.

      Read_Master_Log_Pos:

           Relay_Log_File: mysql-relay-bin.

            Relay_Log_Pos:

    Relay_Master_Log_File: mysql-bin.

         Slave_IO_Running: No

        Slave_SQL_Running: Yes

          Replicate_Do_DB:

      Replicate_Ignore_DB:

       Replicate_Do_Table:

   Replicate_Ignore_Table:

  Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

               Last_Errno:

               Last_Error:

             Skip_Counter:

      Exec_Master_Log_Pos:

          Relay_Log_Space:

          Until_Condition: None

           Until_Log_File:

            Until_Log_Pos:

       Master_SSL_Allowed: No

       Master_SSL_CA_File:

       Master_SSL_CA_Path:

          Master_SSL_Cert:

        Master_SSL_Cipher:

           Master_SSL_Key:

    Seconds_Behind_Master: NULL

Master_SSL_Verify_Server_Cert: No

            Last_IO_Errno:

            Last_IO_Error: Got fatal error  from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.' at 91941417, the last event read from '/aip/mysql/data/log/mysql-bin.' at 91941783, the last byte read from '/aip/mysql/data/log/mysql-bin.' at 91942912.'

           Last_SQL_Errno:

           Last_SQL_Error:

Replicate_Ignore_Server_Ids:

         Master_Server_Id:

 row in set (0.00 sec)

在aipprd2上检查日志文件
在aipprd1上检查show slave status\G;后, 提示需要修改posfile为mysql-bin.000084和pos91942912, 因为在aipprd2同步完成后, 实际同步的数据是从aipprd1过来的, 这些数据在aipprd1上本身就存在的.

[root@aipprd2 log]# ll

total

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql   Sep  : mysql-bin.

-rw-rw----  mysql mysql    Sep  : mysql-bin.

-rw-rw----  mysql mysql       Sep  : mysql-bin.

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql  Sep  : mysql-bin.

-rw-rw----  mysql mysql   Sep  : mysql-bin.

-rw-rw----  mysql mysql        Sep  : mysql-bin.log.index

-rw-rw----  mysql mysql   Sep  : mysql-relay-bin.

-rw-rw----  mysql mysql  Sep  : mysql-relay-bin.

-rw-rw----  mysql mysql        Sep  : mysql-relay-bin.

-rw-rw----  mysql mysql        Sep  : mysql-relay-bin.

-rw-rw----  mysql mysql   Sep  : mysql-relay-bin.

-rw-rw----  mysql mysql        Sep  : mysql-relay-bin.index

在aipprd2上检查show master status;后, 记录Posfile为mysql-bin.000089, 既然aipprd1的数据为最新的, 且aipprd2已经从aipprd1后同步完成了(通过检查show slave status\G;中的Seconds_Behind_Master:, 如果此项值很小, 应该是同步完成了.), 那么两边的数据应该差不多的.

mysql> show master status;

+------------------+-----------+--------------+------------------+

| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB |

+------------------+-----------+--------------+------------------+

| mysql-bin. |  |              |                  |

+------------------+-----------+--------------+------------------+

 row in set (0.00 sec)

在aipprd1上发起修改posfile和pos操作
所以在此用posfile的mysql-bin.000089和pos的0来修改, 启动Slave后, 开始同步.

mysql> slave stop;

Query OK,  rows affected (0.11 sec)

mysql> CHANGE MASTER TO master_host='10.66.1.51', master_port=, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000089', master_log_pos=;

Query OK,  rows affected (0.06 sec)

mysql> slave start;

Query OK,  rows affected (0.00 sec)

mysql> show slave status\G;

*************************** . row ***************************

            Slave_IO_State: Waiting for master to send event

              Master_Host: 10.66.1.51

              Master_User: root

              Master_Port:

            Connect_Retry:

          Master_Log_File: mysql-bin.

      Read_Master_Log_Pos:

           Relay_Log_File: mysql-relay-bin.

            Relay_Log_Pos:

    Relay_Master_Log_File: mysql-bin.

         Slave_IO_Running: Yes

        Slave_SQL_Running: Yes

          Replicate_Do_DB:

      Replicate_Ignore_DB:

       Replicate_Do_Table:

   Replicate_Ignore_Table:

  Replicate_Wild_Do_Table:

Replicate_Wild_Ignore_Table:

               Last_Errno:

               Last_Error:

             Skip_Counter:

      Exec_Master_Log_Pos:

          Relay_Log_Space:

          Until_Condition: None

           Until_Log_File:

            Until_Log_Pos:

       Master_SSL_Allowed: No

       Master_SSL_CA_File:

       Master_SSL_CA_Path:

          Master_SSL_Cert:

        Master_SSL_Cipher:

           Master_SSL_Key:

    Seconds_Behind_Master:

Master_SSL_Verify_Server_Cert: No

            Last_IO_Errno:

            Last_IO_Error:

           Last_SQL_Errno:

           Last_SQL_Error:

Replicate_Ignore_Server_Ids:

         Master_Server_Id:

 row in set (0.00 sec)

转自

Mysql 主主复制失败 - bluetom520的博客 - CSDN博客
http://blog.csdn.net/bluetom520/article/details/54893183