Mysql 5.7 CentOS 7 安装MHA

开始之前，我只想说，这是一个磨碎了我的小心灵的过程，虽然最终还是被我抛弃了，但是给有兴趣的人留个念想吧~

1 MHA简介

官方介绍：https://code.google.com/archive/p/mysql-master-ha/

源码地址：https://github.com/yoshinorim (在Popular repositories 可以直接看到 Manager 与 node 的连接)

MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由youshimaton(现就职于 Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件.在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用. 并且代码一直在维护中。

1.1 功能

该软件由两部分组成：MHA Manager(管理节点)和MHA Node(数据节点). MHA Manager可以单独部署在一台独立的机器上管理多个master-slave集群,也可以部署在一台slave节点上.MHA Node运行在每台MySQL服务器上,MHA Manager会定时探测集群中的master节点,当master出现故障时,它可以自动将最新数据的slave提升为新的master,然后将所有其他的 slave重新指向新的master.整个故障转移过程对应用程序完全透明.

在MHA自动故障切换过程中,MHA试图从宕机的主服务器上保存二进制日志,最大程度的保证数据的不丢失,但这并不总是可行的.例如,如果主服务器硬件故障或无法通过ssh访问,MHA没法保存二进制日志,只进行故障转移而丢失了最新的数据,使用MySQL 5.5开始提供的半同步复制功能，可以大大降低数据丢失的风险。MHA可以与半同步复制结合起来.如果只有一个 slave已经收到了最新的二进制日志,MHA可以将最新的二进制日志应用于其他所有的slave服务器上,因此可以保证所有节点的数据一致性。

特点:

一个manager 可以管理一组或者多组主从结构
通过VIP 实现对应用的透明

1.2 MHA切换逻辑

从宕机的master中保存二进制文件
检测含有最新日至更新的slave
应用差异的中继日至（relay log）到其他的slave
应用从master中保存的二进制日至事件到其他的slave中
提升一个slave为master
使其他的slave指向最新的master进行复制。

1.3 工具

Manager工具
- masterha_check_ssh 检查MHA的SSH配置状况
- masterha_check_repl 检查MySQL复制状况
- masterha_manger 启动MHA
- masterha_check_status 检测当前MHA运行状态
- masterha_master_monitor 检测master是否宕机
- masterha_master_switch 控制故障转移（自动或者手动）
- masterha_conf_host 添加或删除配置的server信息
failover 相关脚本
- master_ip_failover //自动切换时vip管理的脚本，不是必须，如果我们使用keepalived的，我们可以自己编写脚本完成对vip的管理，比如监控mysql，如果mysql异常,我们停止keepalived vip就会自动漂移
- master_ip_online_change //在线切换时vip的管理
- power_manager //故障发生后关闭主机的脚本
- send_report //因故障切换后发送报警的脚

Node 工具
- save_binary_logs 保存和复制master的二进制日志
- apply_diff_relay_logs 识别差异的中继日志事件并将其差异的事件应用于其他的slave
- filter_mysqlbinlog 去除不必要的ROLLBACK事件（MHA已不再使用这个工具）
- purge_relay_logs 清除中继日志（不会阻塞SQL线程）

2 环境

2.1 软件

软件	版本
MAH	最新（从12年以后就没提供过版本）
MYSQL	MYSQL 5.7.21
per-DBD-MySQL	4.023
操作系统ISO	7.2.1511

2.2 环境

IP	功用	server_id
10.10.100.181	MHA master	2
	mysql slave
10.10.100.184	MHA NODE	3
	mysql slave
10.10.100.175	MHA NODE	1
	mysql master

3 Mysql 主从复制

使用Mysql MHA 需要使用到 Mysql 5.5以后出现的半同步复制。 Oracle公司的Mysql https://dev.mysql.com/downloads/mysql/ 下载需要的版本。小红帽默认使用的Mariadb 至10.3.5 版本都没有提供super_read_only 参数。而在安装MHA时需要对此参数进行检验。因此只能弃用RHEL默认提供的Mariadb。

3.1 Mysql数据同步方式

mysql 主从数据同步分为三种类型：异步，实时全同步、半同步

3.1.1 异步复制（Asynchronous replication）

MySQL默认的复制即是异步的，主库在执行完客户端提交的事务后会立即将结果返给给客户端，并不关心从库是否已经接收并处理，这样就会有一个问题，主如果crash掉了，此时主上已经提交的事务可能并没有传到从上，如果此时，强行将从提升为主，可能导致新主上的数据不完整

3.1.2 全同步复制(Fully synchronous replication)

指当主库执行完一个事务，所有的从库都执行了该事务才返回给客户端。因为需要等待所有从库执行完该事务才能返回，所以全同步复制的性能必然会收到严重的影响。

3.1.3 半同步复制(Semisynchronous replication)

介于异步复制和全同步复制之间，主库在执行完客户端提交的事务后不是立刻返回给客户端，而是等待至少一个从库接收到并写到relay log中才返回给客户端。相对于异步复制，半同步复制提高了数据的安全性，同时它也造成了一定程度的延迟，这个延迟最少是一个TCP/IP往返的时间。所以，半同步复制最好在低延时的网络中使用.

半同步复制在5.7版本中发生了一次处理逻辑顺序的调整。先来看下5.5中的处理逻辑:

mysql SQL parse –> Storage Involve –> Write Binary Log –> Storage Commit –> Waiting Slave Dump –> Return to Client

在5.7中，将 Waiting Slave Dump 调整至 Storage Commit 之前进行.这样调整后,保证了主从数据的一致性, 并避免了重复提交.

5.5中半同步逻辑隐患

客户端事务在存储引擎层提交后，在得到从库确认的过程中，主库宕机了，此时，可能的情况有两种

事务还没发送到从库上

此时，客户端会收到事务提交失败的信息，客户端会重新提交该事务到新的主上，当宕机的主库重新启动后，以从库的身份重新加入到该主从结构中，会发现，该事务在从库中被提交了两次，一次是之前作为主的时候，一次是被新主同步过来的。

事务已经发送到从库上

此时，从库已经收到并应用了该事务，但是客户端仍然会收到事务提交失败的信息，重新提交该事务到新的主上。
无数据丢失的半同步复制
针对上述潜在问题，MySQL 5.7引入了一种新的半同步方案：Loss-Less半同步复制。

“Waiting Slave dump”被调整到“Storage Commit”之前。

之前的半同步方案同样支持，MySQL 5.7.2引入了一个新的参数进行控制: rpl_semi_sync_master_wait_point

rpl_semi_sync_master_wait_point有两种取值

AFTER_SYNC

Waiting Slave dump在Storage Commit之前。

AFTER_COMMIT Waiting Slave dump在Storage Commit之后。

启用半同步复制

在安装完插件后，半同步复制默认是关闭的，这时需设置参数来开启半同步

主：

mysql> SET GLOBAL rpl_semi_sync_master_enabled = 1;

从：

mysql> SET GLOBAL rpl_semi_sync_slave_enabled = 1;

以上的启动方式是在命令行操作，也可写在配置文件中。

主：

plugin-load=rpl_semi_sync_master=semisync_master.so

rpl_semi_sync_master_enabled=1

从：

plugin-load=rpl_semi_sync_slave=semisync_slave.so

rpl_semi_sync_slave_enabled=1

在有的高可用架构下，master和slave需同时启动，以便在切换后能继续使用半同步复制

plugin-load = "rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so"

rpl-semi-sync-master-enabled = 1

rpl-semi-sync-slave-enabled = 1

3.2 搭建Mysql 主从架构

3.2.1 Mysql 数据库安装

此部分仅供参照。此部分记录是之前在其他环境中搭建主从架构的文档。操作步骤是一样的。

PLAN
1. SYSTEM and DATABASE
  
  system mariadb
  
  CentOS 7.2 10.2.6
2. relationship between master and slave
  
  互为主从的关系。
  
  master slave
  
  10.100.3.61 10.100.3.62
  
  10.100.3.62 10.100.3.61

上传并解压

本示例中，将文件上传至/root

tar -xzvf ./mariadb-10.2.6.tar.gz

cd mariadb-10.2.6

安装

下载

mysql 被收购后，不再提供源码，因此不可能再做源码安装。想进行源码安装，可以选择Mariadb。

Oracle mysql: https://dev.mysql.com/downloads/mysql/

mariadb: https://downloads.mariadb.org/mariadb/+releases/

下载后，上传至服务器并解压。这是一般的步骤。对于不熟悉此操作的，请先熟悉 Linux基本操作.

安装

源码安装

安装依赖包

rpm -e --nodeps mariadb-libs

yum install -y cmake ncurses-devel gcc gcc-c++ zlib-devel readline-devel openssl-devel libaio-devel mariadb-libs

创建组、用户、相关目录

groupadd mysql

useradd -g mysql -G mysql -d /home/mysql mysql

mkdir -p /usr/local/mysql

mkdir -p /home/mysql/data

mkdir -p /var/log/mysql

mkdir -p /var/run/mysql

chown -R mysql:mysql /usr/local/mysql

chown -R mysql:mysql /home/mysql/data

chown -R mysql:mysql /var/log/mysql

chown -R mysql:mysql /var/run/mysql

chcon -R system_u:object_r:mysqld_db_t:s0 /home/mysql/data

编译安装

生成makeFile

cmake -DCMAKE_INSTALL_PREFIX=/usr/local/mysql/ -DMYSQL_DATADIR=/home/mysql/data -DMYSQL_UNIX_ADDR=/var/lib/mysql/mysql.sock -DSYSCONFDIR=/etc \

-DMYSQL_TCP_PORT=3306 -DMYSQL_USER=mysql -DDEFAULT_CHARSET=utf8 -DDEFAULT_COLLATION=utf8_general_ci -DEXTRA_CHARSETS=all -DWITH_READLINE=1 \

-DENABLED_LOCAL_INFILE=1 -DWITH_INNOBASE_STORAGE_ENGINE=1 -DWITH_XTRADB_STORAGE_ENGINE=1 -DWITH_FEDERATEDX_STORAGE_ENGINE=1 \

-DWITH_MYISAM_STORAGE_ENGINE=1 -DWITH_ARCHIVE_STPRAGE_ENGINE=1 -DWITHOUT_TOKUDB=1 -DENABLE_DOWNLOADS=1

错误1

-- Performing Test HAVE_STDCXX11

-- Performing Test HAVE_STDCXX11 - Failed

-- Performing Test HAVE_STDCXX0X

-- Performing Test HAVE_STDCXX0X - Failed

CMake Error at storage/tokudb/PerconaFT/cmake_modules/TokuSetupCompiler.cmake:177 (message):

  /usr/bin/c++ doesn't support -std=c++11 or -std=c++0x, you need one that

  does.

Call Stack (most recent call first):

  storage/tokudb/PerconaFT/CMakeLists.txt:38 (include)

出现以上错误，是由于默认安装了tobudb. 此时，我们加上参数 -DWITHOUT_TOKUDB=1，表示不安装tokudb引擎. tokudb是MySQL中一款开源的存储引擎，可以管理大量数据并且有一些新的特性，这些是Innodb所不具备的，这里之所以不安装，是因为一般计算机默认是没有Percona Server的，并且加载tokudb还要依赖jemalloc内存优化，一般开发中也是不用tokudb的，所以暂时屏蔽掉，否则在系统中找不到依赖会出现如上错误。 – notes :: 编译失败后，需要执行 rm -f CMakeCache.txt 以清除编译缓存，才可以继续重新编译。

install(软件安装)

make && make install

NOTES

注意事项：

重新时，需要清除旧的对象文件和缓存信息。 # make clean

# rm -f CMakeCache.txt

# rm -rf /etc/my.cnf

复制服务文件

cd /usr/local/mysql

cp support-files/mysql.server /etc/init.d/mysqld

安装数据库

cd /usr/local/mysql

scripts/mysql_install_db --user=mysql --datadir=/home/mysql/data

准备配置文件内容

配置文件的内容，除server_id,auto_increment* 需要根据节点数来调整，relay_log_purge需要根据架构来调整外, 其他无需调整。

[mysqld]

datadir=/data

socket=/var/lib/mysql/mysql.sock

init-connect='SET NAMES utf8'

character-set-server=utf8

lower_case_table_names = 1

symbolic-links=0

# Tunning settings

#innodb storage configuration

default-storage-engine=innodb

innodb_file_per_table=1

innodb_flush_log_at_trx_commit=2

innodb_flush_method=O_DIRECT

innodb_log_buffer_size=256M

innodb_log_files_in_group=8

transaction_isolation=READ-COMMITTED

# Connection configurations

max_connections=1000

max_connect_errors=10000

# query related

query_cache_type=1

query_cache_size=128M

long_query_time=1

#Memory Configuration

## For temp table

max_heap_table_size=96M

tmp_table_size=96M

## For sort / join / read

sort_buffer_size=5M

join_buffer_size=5M

read_buffer_size=5M

read_rnd_buffer_size=5M

# For Myiasm if be used

key_buffer_size=52M

# For function

log_bin_trust_function_creators=1

# master — slave configurations

server-id= 48

log-bin=mysql-bin

log-bin-index=mysql-bin.index

binlog_format = MIXED

sync-binlog=1

expire-logs-days = 14

auto_increment_offset=1        # —> 多主结构使用

auto_increment_increment=2 # —> 多主结构使用

skip-slave-start

# 配置半同步复制

plugin-load = "rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so"

rpl-semi-sync-master-enabled = 1

rpl-semi-sync-slave-enabled = 1

# 配置relay log . MHA结构中需要将relay_log_purge 设置为0. 其他主从结构配置为1

relay_log_purge=1

relay_log_recover=1

[mysqld_safe]

log-error=/var/log/mysql/mysql.log

pid-file=/var/run/mysql/mysql.pid

修改环境变量

vi /etc/profile

# Path manipulation

if [ "$EUID" = "0" ]; then

    pathmunge /usr/sbin

    pathmunge /usr/local/sbin

    pathmunge /usr/local/mysql/bin    --> 添加此行

else

    pathmunge /usr/local/sbin after

    pathmunge /usr/sbin after

fi

source /etc/profile

启动
```
/etc/init.d/mysqld start
```

Mysql 5.7 安装
1. RPM包安装
  RPM 建议下载RPM bundle 版本的。否则在安装过程中出现依赖关系而安装失败，会很头疼。 RPM bundle版本中包含了所有的安装需要的rpm包. 根据需要可以单独下载：
```
mysql-community-server-5.7.21-1.el7.x86_64.rpm

mysql-community-common-5.7.21-1.el7.x86_64.rpm

mysql-community-client-5.7.21-1.el7.x86_64.rpm

mysql-community-libs-5.7.21-1.el7.x86_64.rpm
```
  安装
```
rpm -ivh mysql-community-server-5.7.21-1.el7.x86_64.rpm mysql-community-common-5.7.21-1.el7.x86_64.rpm mysql-community-client-5.7.21-1.el7.x86_64.rpm mysql-community-libs-5.7.21-1.el7.x86_64.rpm
```
2. 编辑配置文件
  
  配置文件内容详情请参阅配置文件
3. 配置mysqld.service
  执行以下几个命令：
```
systemctl enable mysqld  # 设置开机启动

systemctl stop mysqld    # 停止mysql服务

systemctl start mysqld   # 启动mysql服务
```
  需要注意的是尽量先调整好配置文件再启动mysql数据库。rpm 包安装的mysql，会在第一次启动的时候进行数据库初始化。生成最原始的mysql数据库文件及日志文件等。
4. 修改root用户密码
  1. 在没有调整my.cnf时
```
grep "temporary password" /var/log/mysqld.log    #命令，返回结果最后冒号后面的字符串就是root的默认密码

mysql -uroot -p                                  # 回车，输入上面找到的密码

set password=password('密码');

flush privileges;
```
  2. 调整过my.cnf 调整过my.cnf后，可能日志已经不在/var/log/mysqld.log 中。这时我们的日志不是固定的。有的人找不到日志的位置。因此可以通过直接修改mysql.user 内部表来实现修改密码。
    1. 调整my.cnf 在my.cnf 的[mysqld] 下面添加一行：skip-grant-tables
      
      [mysqld]
      skip-grant-tables
      .........
    2. 重启mysqld
      systemctl restart mysqld
    3. 登录mysql修改密码
      mysql -e "update mysql.user set authentication_string=password('密码') where user='root';"
    4. 恢复my.cnf 将skip-grant-tables 前面加上# 以注释掉此参数。如：
      [mysqld]
      # skip-grant-tables
    5. 重启mysql
      
      systemctl restart mysqld

Mysql 5.6 安装

解压安装

tar -xvf MySQL-5.6.39-1.el7.x86_64.rpm-bundle.tar

rpm -ivh ./MySQL*.rpm

配置参数文件

配置文件内容详情请参阅配置文件

初始化数据库

/usr/bin/mysql_install_db --user=mysql --datadir=/data --keep-my-cnf

启动数据库
```
service mysql start
```

ERRORS
1. Can't open and lock privilege tables: Table 'mysql.XXX' doesn't exist
  
  这个问题一般是因为数据库初始化未完成。
  
  默认初始化命令： scripts/mysql_install_db –basedir=/usr/local/MySQL –datadir=/usr/local/mysql/data –user=mysql

创建同步用户

-- 10.100.3.62

 在10.100.3.61上执行show master status; 查看主库日志文件及当前日志位置

GRANT replication slave ON *.* TO 'rep1'@'10.100.3.61' IDENTIFIED BY 'repl123';

GRANT replication slave ON *.* TO 'rep1'@'10.254.19.63' IDENTIFIED BY 'repl123';

GRANT replication slave ON *.* TO 'rep1'@'10.254.19.62' IDENTIFIED BY 'repl123';

change master to

master_host = '10.100.3.61',

master_user = 'rep1',

master_password='repl123',

master_log_file='mysql-bin.000004',

master_log_pos= 342;

-- 10.100.3.61

 在10.100.3.62上执行show master status; 查看主库日志文件及当前日志位置

GRANT replication slave ON *.* TO 'rep1'@'10.100.3.62' IDENTIFIED BY 'repl123';

change master to

master_host = '10.100.3.62',

master_user = 'rep1',

master_password='repl123',

master_log_file='mysql-bin.000002',

master_log_pos= 328;

重启数据库并启动复制进制
1. 重启数据库
```
/etc/init.d/mysqld stop

/etc/init.d/mysqld start
```
2. 启动slave 进程
```
start slave;
```
3. 查看slave进程状态
```
show slave status\G

特别关注其中的:

Slave_IO_Running: yes

Slave_SQL_Running: yes

这两行，后面的值要都是yes. 如果出现no的情况，查看

 Last_IO_Errno:

 Last_IO_Error:

Last_SQL_Errno:

Last_SQL_Error:

这四行会提示相关错误。根据实际情况再解决。
```
  NOTES
  binlog-do-db 和 replicate-ignore-db 设置必须相同。 MHA 在启动时候会检测过滤规则，如果过滤规则不同，MHA 不启动监控和故障转移
  
  配置半同步。详情参见启用半同步复制
  
  注意配置relay_log_purge 设置为0，即关闭自动清除relay log，因为这些日志可能被其他从库需要，自动清除后，否则无法保证其他从库恢复数据时的数据一致性

填坑之旅

依赖包之libmysqlcilent.so.18

相关信息 下面列出目前较新的两个Mysql版本中提供该库文件的rpm包及mysql版本：

mysql版本	rpm 包	libmysqlclient 版本
5.6	MySQL-shared-5.6.39-1.el7.x86_64.rpm	18
5.7	mysql-community-libs-5.7.21-1.el7.x86_64.rpm	20

因为libmysqlclient.so 在不同的数据库版本中也相应的有不同的版本，而CentOS 7.2.1511 的YUM源对应的是18.

请先参照MHA依赖包安装。使用yum源安装perl-DBD-MySQL 对应的Mysql libs 是libmysqlclient.so.18.0.0(libmysqlclient.so.18 是该文件的一个软连接)。报错信息类似如下：

..........

--> Running transaction check

---> Package perl-DBD-MySQL.x86_64 0:4.023-5.el7 will be installed

--> Processing Dependency: libmysqlclient.so.18(libmysqlclient_18)(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Processing Dependency: libmysqlclient.so.18()(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Running transaction check

---> Package mariadb-libs.x86_64 1:5.5.56-2.el7 will be installed

Removing mariadb-libs.x86_64 1:5.5.56-2.el7 - u due to obsoletes from installed MySQL-shared-compat-5.6.39-1.el7.x86_64

--> Restarting Dependency Resolution with new changes.

--> Running transaction check

---> Package mariadb-libs.x86_64 1:5.5.56-2.el7 will be installed

--> Processing Dependency: libmysqlclient.so.18(libmysqlclient_18)(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Processing Dependency: libmysqlclient.so.18()(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Finished Dependency Resolution

Error: Package: perl-DBD-MySQL-4.023-5.el7.x86_64 (base)

           Requires: libmysqlclient.so.18(libmysqlclient_18)(64bit)

Error: Package: perl-DBD-MySQL-4.023-5.el7.x86_64 (base)

           Requires: libmysqlclient.so.18()(64bit)

 You could try using --skip-broken to work around the problem

 You could try running: rpm -Va --nofiles --nodigest

解决方法

安装Mysql 5.7 配套的rpm包后，还需要另外安装MySQL-shared-5.6.39-1.el7.x86_64.rpm. 下载地址：MySQL-shared-5.6.39-1.el7.x86_64.rpm下载
```
rpm -ivh --nodeps --force MySQL-shared-5.6.39-1.el7.x86_64.rpm
```

安装perl-DBD-MySQL 的时候，使用yum源解决libmysqlclient.so.18的依赖问题。安装Mysql 5.7 时,使用强制安装。

yum install -y perl-DBD-MySQL

rpm -ivh --nodeps --force mysql-community-client-5.7.21-1.el7.x86_64.rpm mysql-community-common-5.7.21-1.el7.x86_64.rpm \

mysql-community-server-5.7.21-1.el7.x86_64.rpm mysql-community-libs-5.7.21-1.el7.x86_64.rpm

3.2.2 调整Mysql数据库

在这里调整mysql 数据库以应用 MHA。

设置所有slave 节点为read only 每次从库重启后，都要做这个操作，因为从库有可能被提升为主库，所以不能设置在配置文件中。
```
mysql -uroot -p"

set global read_only=1;

set global super_read_only=1;
```

创建监控用户在master 上执行。

grant all privileges on *.* to 'root'@'10.10.100.%' identified  by '^NJI0okm';

flush privileges;

relay_log_purge 设置为0 在MHA架构中，必须设置为0. 这点在上面Mysql 主从架构搭建中的参数配置中有明确配置。 MHA在发生切换的过程中，从库的恢复过程中依赖于relay log的相关信息，所以这里要将relay log的自动清除设置为OFF，采用手动清除relay log的方式。在默认情况下，从服务器上的中继日志会在SQL线程执行完毕后被自动删除。但是在MHA环境中，这些中继日志在恢复其他从服务器时可能会被用到，因此需要禁用中继日志的自动删除功能。定期清除中继日志需要考虑到复制延时的问题。在ext3的文件系统下，删除大的文件需要一定的时间，会导致严重的复制延时。为了避免复制延时，需要暂时为中继日志创建硬链接，因为在linux系统中通过硬链接删除大文件速度会很快。（在mysql数据库中，删除大表时，通常也采用建立硬链接的方式）
设置允许备库主机可以连主库这里有点需要注意，候选主库，在主从切换的过程中，要连接主库，因此主库上应该允许相应的连接权限，相对应的mysql.user.host 应该是候选主库的主机名（IP无效）
```
grant all on *.* to 'root'@'test-1' identified by "^NJI0okm";

flush privileges;
```

3.2.3 定时清理relay

编写清理日志脚本 本次示例,将清理脚本放至mysql 环境变量$datadir中。

cat purge_relay_log.sh

#!/bin/bash

user=root

#passwd=123456

#port=3306

log_dir='/data/masterha/log'            # 存储脚本执行日志的路径

work_dir='/home/mysql/data'             # 存放relay bin log 硬链接的地方。

purge='/usr/local/bin/purge_relay_logs' # purge 命令

if [ ! -d $log_dir ]                    # 如果没有变量Log_dir 指定的路径，则创建一个。

then

   mkdir $log_dir -p

fi

$purge --user=$user --password=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1

					# 执行清除命令

添加至定时任务

crontab -e

# 执行完成上面命令后，输入下一行内容

1 2 * * * sh /data/purge_relay_log.sh

# 输入完上面一行内容后，按ESC,然后直接输入:w 回车，即保存成功。

4 MHA安装配置

首先，我们需要在github 上将mha4mysql 的manager与node的源码下载下来，然后才可以进行编译安装。下载地址：

mha4mysql-node https://github.com/yoshinorim/mha4mysql-node
mha4mysql-manager https://github.com/yoshinorim/mha4mysql-manager

4.1 安装依赖包

所有节点上都需要操作,因为MHA manager 与MHA node 的工具运行都依赖于perl模块

4.1.1 依赖包的依赖包

由于perl-DBD-MySQL 安装需要调用libmysqlclient.so.18 这处库文件。而我们安装好Mysql 5.7 后,提供的是libmysqlclient.so.20。这是两个不同的版本,安装老版本时，会提示版本过期的问题。因此我们需要强制安装。

rpm -ivh --nodeps --force MySQL-shared-5.6.39-1.el7.x86_64.rpm

安装好后查看该rpm 提供了哪此文件:

#rpm -ql MySQL-shared-5.6.39-1.el7.x86_64

/usr/lib64/libmysqlclient.so

/usr/lib64/libmysqlclient.so.18               # 此包就是程序提供的

/usr/lib64/libmysqlclient.so.18.1.0

/usr/lib64/libmysqlclient_r.so

/usr/lib64/libmysqlclient_r.so.18

/usr/lib64/libmysqlclient_r.so.18.1.0

/usr/share/doc/MySQL-shared-5.6.39

/usr/share/doc/MySQL-shared-5.6.39/COPYING

/usr/share/doc/MySQL-shared-5.6.39/README

4.1.2 依赖包

YUM 源安装

# 安装epel源

rpm -Uvh http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

yum -y install perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes perl-devel perl-Module-Install.noarch

源码安装
下载 perl-DBD-MySQL :http://www.cpan.org/authors/id/C/CA/CAPTTOFU/DBD-mysql-4.046.tar.gz 安装方法如下：
```
perl Makefile.PL [options]

make

make test

make install
```

CPANM 方式下载安装

如果想要宁为蛋碎，不为瓦全，还可以使用以下脚本来安装:

[root@0 ~]# cat install.sh

#!/bin/bash

wget http://xrl.us/cpanm --no-check-certificate

mv cpanm /usr/bin

chmod 755 /usr/bin/cpanm

cat > /root/list << EOF

install DBD::mysql

EOF

for package in `cat /root/list`

do

cpanm $package

done

4.1.3 其他包

yum install -y daemonize

4.2 下载manager与node

NOTE : mha4mysql-node 需要在所有节点安装，包括主从所有节点。 mha4mysql-manager 只需要安装在计划的mha4mysql-manager节点，在本示例中就是10.10.100.181

官网上提供的下载地址还是老地址：https://code.google.com/archive/p/mysql-master-ha/downloads?page=1

在这个地址上只有11年与12年发布的几个包。因此不在此链接下载。直接从github上下载，然后编译安装。

# 下载manager

git clone https://github.com/yoshinorim/mha4mysql-manager.git

# 下载node

git clone https://github.com/yoshinorim/mha4mysql-node.git

4.3 安装

MHA管理节点 在管理节点执行

# cd ~/mha4mysql-manager/

# perl Makefile.PL && make && make install

MHA NODE节点
注意在所有节点都要操作,因为所有节点都有mha4mysql-node
```
# cd ~/mha4mysql-node/

# perl Makefile.PL && make && make install
```

4.4 配置MHA

4.4.1 MHA Manager

从样例中复制配置文件

cd ~/mha4mysql-manager/

mkdir -p /etc/masterha/scripts

# 复制

cp samples/conf/app1.cnf /etc/masterha/

cp samples/conf/masterha_default.cnf /etc/masterha_default.cnf

# 复制failover相关脚本

cp samples/scripts/* /usr/local/bin/

scripts路径下有几个脚本文件，功能说明下.

脚本	功能
send_report	因故障切换后发送报警的脚本，不是必须
power_manager	故障发生后关闭主机的脚本
master_ip_online_change	在线切换时vip的管理
master_ip_failover	自动切换时vip管理的脚本

编译配置文件内容

#cat /etc/masterha/app1/app1.cnf

[server default]

manager_workdir=/var/log/masterha/app1.log                             #设置manager的工作目录

manager_log=/var/log/masterha/app1/manager.log                         #设置manager的日志

master_binlog_dir=/data                                                #设置master 保存binlog的位置，以便MHA可以找到master的日志，我这里的也就是mysql的数据目录

master_ip_failover_script= /etc/masterha/scripts/master_ip_failover    #设置自动failover时候的切换脚本

master_ip_online_change_script=/etc/masterha/scripts/master_ip_online_change

                                                                       #设置手动切换时候的切换脚本

password=123456                                                        #设置mysql中root用户的密码，这个密码是前文中创建监控用户的那个密码

user=root                                                              #设置监控用户root

ping_interval=1                                                        #设置监控主库，发送ping包的时间间隔，默认是3秒,尝试三次没有回应的时候自动进行railover

remote_workdir=/data                                                   #设置远端mysql在发生切换时binlog的保存位置

repl_password=rep123                                                   #设置复制用户的密码

repl_user=rep1                                                         #设置复制环境中的复制用户名

report_script=/usr/local/bin/send_report                               #设置发生切换后发送的报警的脚本

secondary_check_script= /usr/local/bin/masterha_secondary_check -s server03 -s server02 --user=root --master_host=server02 --master_ip=192.168.2.128 --master_port=3306

                                                                       #一旦MHA到server02的监控之间出现问题，MHA Manager将会尝试从server03登录到server02

shutdown_script=""                                                     #设置故障发生后关闭故障主机脚本（该脚本的主要作用是关闭主机放在发生脑裂,这里没有使用）

ssh_user=root                                                          #设置ssh的登录用户名

[server1]

hostname=10.10.100.175

port=3306

candidate_master=1

[server2]

hostname=10.10.100.181

port=3306

candidate_master=1                                                     #设置为候选master，如果设置该参数以后，发生主从切换以后将会将此从库提升为主库，即使这个主库不是集群中事件最新的slave

check_repl_delay=0                                                     #默认情况下如果一个slave落后master 100M的relay logs的话，MHA将不会选择该slave作为一个新的master，因为对于这个slave的恢复需要花费很长时间，通过设置check_repl_delay=0,MHA触发切换在选择一个新的master的时候将会忽略复制延时，这个参数对于设置了candidate_master=1的主机非常有用，因为这个候选主在切换的过程中一定是新的master

[server3]

hostname=10.10.100.184

port=3306

4.5 添加主机互信

在各个主机上通过ssh-keygen生成密钥。然后通过ssh-copy-id 将密钥发送至本机及其他主机上。三台主机要做9次。并分别通过ssh ip date 来验证是否可以进行免密登录

# ssh-keygen  #后面一路回车

# ssh-copy-id -i ./.ssh/id_rsa.pub root@10.10.100.175  # 将ssh key 发送至10.10.100.175 , 免密用户为root

# ssh 10.10.100.175 date                               # 验证是否可以进行免密登录。

4.6 测试SSH连接

使用mha自带工具:

masterha_check_ssh --conf=/etc/masterha/app1.cnf

Thu Mar  8 09:08:20 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..

Thu Mar  8 09:08:20 2018 - [info] Reading application default configuration from /etc/masterha/app1/app1.cnf..

Thu Mar  8 09:08:20 2018 - [info] Reading server configuration from /etc/masterha/app1/app1.cnf..

Thu Mar  8 09:08:20 2018 - [info] Starting SSH connection tests..

Thu Mar  8 09:08:21 2018 - [debug]

Thu Mar  8 09:08:20 2018 - [debug]  Connecting via SSH from root@10.10.100.175(10.10.100.175:22) to root@10.10.100.181(10.10.100.181:22)..

Thu Mar  8 09:08:20 2018 - [debug]   ok.

Thu Mar  8 09:08:20 2018 - [debug]  Connecting via SSH from root@10.10.100.175(10.10.100.175:22) to root@10.10.100.184(10.10.100.184:22)..

Thu Mar  8 09:08:21 2018 - [debug]   ok.

Thu Mar  8 09:08:21 2018 - [debug]

Thu Mar  8 09:08:20 2018 - [debug]  Connecting via SSH from root@10.10.100.181(10.10.100.181:22) to root@10.10.100.175(10.10.100.175:22)..

Thu Mar  8 09:08:20 2018 - [debug]   ok.

Thu Mar  8 09:08:20 2018 - [debug]  Connecting via SSH from root@10.10.100.181(10.10.100.181:22) to root@10.10.100.184(10.10.100.184:22)..

Thu Mar  8 09:08:21 2018 - [debug]   ok.

Thu Mar  8 09:08:22 2018 - [debug]

Thu Mar  8 09:08:21 2018 - [debug]  Connecting via SSH from root@10.10.100.184(10.10.100.184:22) to root@10.10.100.175(10.10.100.175:22)..

Thu Mar  8 09:08:21 2018 - [debug]   ok.

Thu Mar  8 09:08:21 2018 - [debug]  Connecting via SSH from root@10.10.100.184(10.10.100.184:22) to root@10.10.100.181(10.10.100.181:22)..

Thu Mar  8 09:08:21 2018 - [debug]   ok.

Thu Mar  8 09:08:22 2018 - [info] All SSH connection tests passed successfully.

查看测试结果，最后提示: "ALL SSH connection tests passwd successfully." 说明所有节点间测试都成功的。

4.7 配置VIP

vip配置可以采用两种方式，一种通过keepalived的方式管理虚拟ip的浮动；另外一种通过脚本方式启动虚拟ip的方式 (即不需要keepalived或者heartbeat类似的软件).

为了防止脑裂发生,推荐生产环境采用脚本的方式来管理虚拟ip,而不是使用keepalived来完成.到此为止,基本MHA集群已经配置完毕.接下来就是实际的测试环节了.

下面分别了解不同的配置方式。

4.7.1 Keepalive

Keepalive 支持主备与双备两种模式。这两种模式有很大的区别。

master->backup 一旦主库宕机,虚拟ip会自动漂移到从库,当主库修复后,keepalived启动后,还会把虚拟ip抢占过来,即使设置了非抢占模式（nopreempt）抢占ip的动作也会发生.
backup->backup 当主库宕机后虚拟ip会自动漂移到从库上，当原主库恢复和keepalived服务启动后，并不会抢占新主的虚拟ip，即使是优先级高于从库的优先级别，也不会发生抢占。

为了减少ip漂移次数,及避免数据库未准备好而IP已经漂移过来的情况,我们应该选择双备模式。

下载地址：http://www.keepalived.org/software/

安装示例:

tar xf keepalived-1.2.12.tar.gz

cd keepalived-1.2.12

./configure --prefix=/usr/local/keepalived

make &&  make install

cp /usr/local/keepalived/etc/rc.d/init.d/keepalived /etc/init.d/

cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/

mkdir /etc/keepalived

cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/

cp /usr/local/keepalived/sbin/keepalived /usr/sbin/

配置文件示例如下:

! Configuration File for keepalived

global_defs {

     notification_email {

     "yourmail_address such as halberd.lee@gmail.com"

   }

   notification_email_from dba@dbserver.com

   smtp_server 127.0.0.1

   smtp_connect_timeout 30

   router_id MySQL-HA

}

vrrp_instance VI_1 {

    state BACKUP

    interface eth0

    virtual_router_id 51

    priority 150

    advert_int 1

    nopreempt

    authentication {

    auth_type PASS

    auth_pass 1111

    }

    virtual_ipaddress {

        10.10.100.191

    }

}

note: priority 两个节点不可一致，state 在不同节点上都是BACKUP。

MHA引入keepalived

通过修改脚本master_ip_failover，将master 发生宕机时的处理方式写入该脚本。

#!/usr/bin/env perl

use strict;

use warnings FATAL => 'all';

use Getopt::Long;

my (

    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,

    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port

);

my $vip = '192.168.0.88';

my $ssh_start_vip = "/etc/init.d/keepalived start";

my $ssh_stop_vip = "/etc/init.d/keepalived stop";

GetOptions(

    'command=s'          => \$command,

    'ssh_user=s'         => \$ssh_user,

    'orig_master_host=s' => \$orig_master_host,

    'orig_master_ip=s'   => \$orig_master_ip,

    'orig_master_port=i' => \$orig_master_port,

    'new_master_host=s'  => \$new_master_host,

    'new_master_ip=s'    => \$new_master_ip,

    'new_master_port=i'  => \$new_master_port,

);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;

        eval {

            print "Disabling the VIP on old master: $orig_master_host \n";

            &stop_vip();

            $exit_code = 0;

        };

        if ($@) {

            warn "Got Error: $@\n";

            exit $exit_code;

        }

        exit $exit_code;

    }

    elsif ( $command eq "start" ) {

        my $exit_code = 10;

        eval {

            print "Enabling the VIP - $vip on the new master - $new_master_host \n";

            &start_vip();

            $exit_code = 0;

        };

        if ($@) {

            warn $@;

            exit $exit_code;

        }

        exit $exit_code;

    }

    elsif ( $command eq "status" ) {

        print "Checking the Status of the script.. OK \n";

        #`ssh $ssh_user\@cluster1 \" $ssh_start_vip \"`;

        exit 0;

    }

    else {

        &usage();

        exit 1;

    }

}

# A simple system call that enable the VIP on the new master

sub start_vip() {

    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;

}

# A simple system call that disable the VIP on the old_master

sub stop_vip() {

     return 0  unless  ($ssh_user);

    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

}

sub usage {

    print

    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";

}

4.7.2 脚本方式

修改脚本

我保存了两个版本的failover 脚本。因此脚本名字上稍做改动，加以区别。脚本方式的failover脚本命名为:

master_ip_failover_. 脚本文件内容如下:

#!/usr/bin/env perl

use strict;

use warnings FATAL => 'all';

use Getopt::Long;

my (

    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,

    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port

);

my $vip = '192.168.0.88/24';

my $key = '1';

my $ssh_start_vip = "/sbin/ifconfig eth1:$key $vip";

my $ssh_stop_vip = "/sbin/ifconfig eth1:$key down";

GetOptions(

    'command=s'          => \$command,

    'ssh_user=s'         => \$ssh_user,

    'orig_master_host=s' => \$orig_master_host,

    'orig_master_ip=s'   => \$orig_master_ip,

    'orig_master_port=i' => \$orig_master_port,

    'new_master_host=s'  => \$new_master_host,

    'new_master_ip=s'    => \$new_master_ip,

    'new_master_port=i'  => \$new_master_port,

);

exit &main();

sub main {

    print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n";

    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;

        eval {

            print "Disabling the VIP on old master: $orig_master_host \n";

            &stop_vip();

            $exit_code = 0;

        };

        if ($@) {

            warn "Got Error: $@\n";

            exit $exit_code;

        }

        exit $exit_code;

    }

    elsif ( $command eq "start" ) {

        my $exit_code = 10;

        eval {

            print "Enabling the VIP - $vip on the new master - $new_master_host \n";

            &start_vip();

            $exit_code = 0;

        };

        if ($@) {

            warn $@;

            exit $exit_code;

        }

        exit $exit_code;

    }

    elsif ( $command eq "status" ) {

        print "Checking the Status of the script.. OK \n";

        exit 0;

    }

    else {

        &usage();

        exit 1;

    }

}

sub start_vip() {

    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;

}

sub stop_vip() {

     return 0  unless  ($ssh_user);

    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;

}

sub usage {

    print

    "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";

}

启动VIP

启动虚拟IP的命令如下:

ifconfig eth0:1 10.10.100.191 netmask 255.255.255.0

执行完没有任何反馈，说明执行成功。查询下VIP启动后的相关信息:

# ifconfig -a |grep -A 2 "eth0:1"

eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 10.10.100.191  netmask 255.255.255.0  broadcast 10.10.100.255

        ether fa:16:3e:a3:bd:10  txqueuelen 1000  (Ethernet)

4.8 检查整个复制环境状况

4.8.1 命令

masterha_check_repl --conf=/etc/masterha/app1.cnf

4.8.2 结果

Thu Mar 15 14:08:39 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..

Thu Mar 15 14:08:39 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Thu Mar 15 14:08:39 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Thu Mar 15 14:08:39 2018 - [info] MHA::MasterMonitor version 0.57.

Thu Mar 15 14:08:40 2018 - [info] GTID failover mode = 0

Thu Mar 15 14:08:40 2018 - [info] Dead Servers:

Thu Mar 15 14:08:40 2018 - [info] Alive Servers:

Thu Mar 15 14:08:40 2018 - [info]   10.10.100.175(10.10.100.175:3306)

Thu Mar 15 14:08:40 2018 - [info]   10.10.100.181(10.10.100.181:3306)

Thu Mar 15 14:08:40 2018 - [info]   10.10.100.184(10.10.100.184:3306)

Thu Mar 15 14:08:40 2018 - [info] Alive Slaves:

Thu Mar 15 14:08:40 2018 - [info]   10.10.100.181(10.10.100.181:3306)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled

Thu Mar 15 14:08:40 2018 - [info]     Replicating from 10.10.100.175(10.10.100.175:3306)

Thu Mar 15 14:08:40 2018 - [info]     Primary candidate for the new Master (candidate_master is set)

Thu Mar 15 14:08:40 2018 - [info]   10.10.100.184(10.10.100.184:3306)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled

Thu Mar 15 14:08:40 2018 - [info]     Replicating from 10.10.100.175(10.10.100.175:3306)

Thu Mar 15 14:08:40 2018 - [info] Current Alive Master: 10.10.100.175(10.10.100.175:3306)

Thu Mar 15 14:08:40 2018 - [info] Checking slave configurations..

Thu Mar 15 14:08:40 2018 - [info]  read_only=1 is not set on slave 10.10.100.181(10.10.100.181:3306).

Thu Mar 15 14:08:40 2018 - [info]  read_only=1 is not set on slave 10.10.100.184(10.10.100.184:3306).

Thu Mar 15 14:08:40 2018 - [info] Checking replication filtering settings..

Thu Mar 15 14:08:40 2018 - [info]  binlog_do_db= , binlog_ignore_db=

Thu Mar 15 14:08:40 2018 - [info]  Replication filtering check ok.

Thu Mar 15 14:08:40 2018 - [info] GTID (with auto-pos) is not supported

Thu Mar 15 14:08:40 2018 - [info] Starting SSH connection tests..

Thu Mar 15 14:08:43 2018 - [info] All SSH connection tests passed successfully.

Thu Mar 15 14:08:43 2018 - [info] Checking MHA Node version..

Thu Mar 15 14:08:43 2018 - [info]  Version check ok.

Thu Mar 15 14:08:43 2018 - [info] Checking SSH publickey authentication settings on the current master..

Thu Mar 15 14:08:43 2018 - [info] HealthCheck: SSH to 10.10.100.175 is reachable.

Thu Mar 15 14:08:43 2018 - [info] Checking recovery script configurations on 10.10.100.175(10.10.100.175:3306)..

Thu Mar 15 14:08:43 2018 - [info]   Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data --output_file=/data/save_binary_logs_test --manager_version=0.57 --start_file=mysql-bin.000009

Thu Mar 15 14:08:43 2018 - [info]   Connecting to root@10.10.100.175(10.10.100.175:22)..

  Creating /data if not exists..    ok.

  Checking output directory is accessible or not..

   ok.

  Binlog found at /data, up to mysql-bin.000009

Thu Mar 15 14:08:44 2018 - [info] Binlog setting check done.

Thu Mar 15 14:08:44 2018 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..

Thu Mar 15 14:08:44 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.10.100.181 --slave_ip=10.10.100.181 --slave_port=3306 --workdir=/data --target_version=5.7.21-log --manager_version=0.57 --relay_log_info=/data/relay-log.info  --relay_dir=/data/  --slave_pass=xxx

Thu Mar 15 14:08:44 2018 - [info]   Connecting to root@10.10.100.181(10.10.100.181:22)..

  Checking slave recovery environment settings..

    Opening /data/relay-log.info ... ok.

    Relay log found at /data, up to test-1-relay-bin.000014

    Temporary relay log file is /data/test-1-relay-bin.000014

    Checking if super_read_only is defined and turned on.. not present or turned off, ignoring.

    Testing mysql connection and privileges..

mysql: [Warning] Using a password on the command line interface can be insecure.

 done.

    Testing mysqlbinlog output.. done.

    Cleaning up test file(s).. done.

Thu Mar 15 14:08:44 2018 - [info]   Executing command : apply_diff_relay_logs --command=test --slave_user='root' --slave_host=10.10.100.184 --slave_ip=10.10.100.184 --slave_port=3306 --workdir=/data --target_version=5.7.21-log --manager_version=0.57 --relay_log_info=/data/relay-log.info  --relay_dir=/data/  --slave_pass=xxx

Thu Mar 15 14:08:44 2018 - [info]   Connecting to root@10.10.100.184(10.10.100.184:22)..

  Checking slave recovery environment settings..

    Opening /data/relay-log.info ... ok.

    Relay log found at /data, up to test-2-relay-bin.000014

    Temporary relay log file is /data/test-2-relay-bin.000014

    Checking if super_read_only is defined and turned on.. not present or turned off, ignoring.

    Testing mysql connection and privileges..

mysql: [Warning] Using a password on the command line interface can be insecure.

 done.

    Testing mysqlbinlog output.. done.

    Cleaning up test file(s).. done.

Thu Mar 15 14:08:45 2018 - [info] Slaves settings check done.

Thu Mar 15 14:08:45 2018 - [info]

10.10.100.175(10.10.100.175:3306) (current master)

 +--10.10.100.181(10.10.100.181:3306)

 +--10.10.100.184(10.10.100.184:3306)

Thu Mar 15 14:08:45 2018 - [info] Checking replication health on 10.10.100.181..

Thu Mar 15 14:08:45 2018 - [info]  ok.

Thu Mar 15 14:08:45 2018 - [info] Checking replication health on 10.10.100.184..

Thu Mar 15 14:08:45 2018 - [info]  ok.

Thu Mar 15 14:08:45 2018 - [info] Checking master_ip_failover_script status:

Thu Mar 15 14:08:45 2018 - [info]   /usr/local/bin/master_ip_failover_ --command=status --ssh_user=root --orig_master_host=10.10.100.175 --orig_master_ip=10.10.100.175 --orig_master_port=3306 

IN SCRIPT TEST====/sbin/ifconfig eth0:1 down==/sbin/ifconfig eth0:1 10.10.100.191/24===

Checking the Status of the script.. OK

Thu Mar 15 14:08:45 2018 - [info]  OK.

Thu Mar 15 14:08:45 2018 - [warning] shutdown_script is not defined.

Thu Mar 15 14:08:45 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

坑

mysqlbinlog

Can't exec "mysqlbinlog": No such file or directory at /usr/local/share/perl5/MHA/BinlogManager.pm line 99.

mysqlbinlog version not found!

Testing mysql connection and privileges..sh: mysql: command not found

解决方法如下，添加软连接（所有节点）

ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog

ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql

super_read_only
新版的mha4mysql 需要检测super_read_only 变量。但是mysql 5.5 中是没有这个参数的只能选最新版的mysql与之相对应。 mysql 5.7 中才有,所以需要安装mysql 5.7

libmysqlclient.so.18

在执行过程中有可能会提示如下错误：

Sat Mar 10 05:33:43 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln301] install_driver(mysql) failed: Can't locate DBD/mysql.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at (eval 37) line 3.

Perhaps the DBD::mysql perl module hasn't been fully installed,

or perhaps the capitalisation of 'mysql' isn't right.

Available drivers: DBM, ExampleP, File, Gofer, Proxy, SQLite, Sponge.

at /usr/local/share/perl5/MHA/DBHelper.pm line 208.

at /usr/local/share/perl5/MHA/Server.pm line 166.

Sat Mar 10 05:33:43 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln301] install_driver(mysql) failed: Can't locate DBD/mysql.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at (eval 37) line 3.

Perhaps the DBD::mysql perl module hasn't been fully installed,

or perhaps the capitalisation of 'mysql' isn't right.

Available drivers: DBM, ExampleP, File, Gofer, Proxy, SQLite, Sponge.

at /usr/local/share/perl5/MHA/DBHelper.pm line 208.

at /usr/local/share/perl5/MHA/Server.pm line 166.

Sat Mar 10 05:33:43 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln301] install_driver(mysql) failed: Can't locate DBD/mysql.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at (eval 37) line 3.

Perhaps the DBD::mysql perl module hasn't been fully installed,

or perhaps the capitalisation of 'mysql' isn't right.

Available drivers: DBM, ExampleP, File, Gofer, Proxy, SQLite, Sponge.

at /usr/local/share/perl5/MHA/DBHelper.pm line 208.

at /usr/local/share/perl5/MHA/Server.pm line 166.

Sat Mar 10 05:33:44 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln309] Got fatal error, stopping operations

Sat Mar 10 05:33:44 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 329.

Sat Mar 10 05:33:44 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.

Sat Mar 10 05:33:44 2018 - [info] Got exit code 1 (Not master dead).

其中，”Can't locate DBD/mysql.pm" 原因是缺少相应的库文件。可是我们明明在安装的时候已经安装了Mysql-community-libs-5.7.21-1.el7.x86_64 。那么有可能就是版本不对了。常规解决方案为:

yum install -y perl-DBD-mysql

可是在安装的过程中发现，无法安装，提示冲突：

base                                                                                                                | 3.6 kB  00:00:00

dockerrepo                                                                                                          | 2.9 kB  00:00:00

epel/x86_64/metalink                                                                                                | 7.8 kB  00:00:00

extras                                                                                                              | 3.4 kB  00:00:00

updates                                                                                                             | 3.4 kB  00:00:00

(1/2): extras/7/x86_64/primary_db                                                                                   | 181 kB  00:00:00

(2/2): updates/7/x86_64/primary_db                                                                                  | 6.9 MB  00:09:32

Loading mirror speeds from cached hostfile

* base: mirrors.163.com

* epel: mirror01.idc.hinet.net

* extras: mirrors.sohu.com

* updates: mirrors.sohu.com

Resolving Dependencies

--> Running transaction check

---> Package perl-DBD-MySQL.x86_64 0:4.023-5.el7 will be installed

--> Processing Dependency: libmysqlclient.so.18(libmysqlclient_18)(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Processing Dependency: libmysqlclient.so.18()(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Running transaction check

---> Package mariadb-libs.x86_64 1:5.5.56-2.el7 will be installed

Removing mariadb-libs.x86_64 1:5.5.56-2.el7 - u due to obsoletes from installed mysql-community-libs-5.7.21-1.el7.x86_64

--> Restarting Dependency Resolution with new changes.

--> Running transaction check

---> Package mariadb-libs.x86_64 1:5.5.56-2.el7 will be installed

--> Processing Dependency: libmysqlclient.so.18(libmysqlclient_18)(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Processing Dependency: libmysqlclient.so.18()(64bit) for package: perl-DBD-MySQL-4.023-5.el7.x86_64

--> Finished Dependency Resolution

Error: Package: perl-DBD-MySQL-4.023-5.el7.x86_64 (base)

      Requires: libmysqlclient.so.18(libmysqlclient_18)(64bit)

Error: Package: perl-DBD-MySQL-4.023-5.el7.x86_64 (base)

      Requires: libmysqlclient.so.18()(64bit)

You could try using --skip-broken to work around the problem

You could try running: rpm -Va --nofiles --nodigest

有两个信息：

RPM包冲突mysql-community-libs-5.7.21-1.el7.x86_64与 mariadb-libs.x86_64 1:5.5.56-2.el7 冲突。
缺少libmysqlclient.so.18(libmysqlclient_18)

先把缺少的库安装上：这里我们可以通过安装两者（MySQL-shared-5.6.39-1.el7.x86_64.rpm 、mariadb-libs.x86_64 1:5.5.56-2.el7.rpm 两者之一来实现。在安装时，如果已经安装了Mysql 5.7 数据库，必定会遇到rpm包冲突的问题。此时需要使用 –force –nodeps 参数来实现强制安装。如：

rpm -ivh --force --nodeps MySQL-shared-5.6.39-1.el7.x86_64.rpm

FIXME_xxx

错误信息

Thu Mar 15 09:13:58 2018 - [info]  ok.

Thu Mar 15 09:13:58 2018 - [info] Checking master_ip_failover_script status:

Thu Mar 15 09:13:58 2018 - [info]   /etc/masterha/scripts/master_ip_failover --command=status --ssh_user=root --orig_master_host=10.10.100.175 --orig_master_ip=10.10.100.175 --orig_master_port=3306

Bareword "FIXME_xxx" not allowed while "strict subs" in use at /etc/masterha/scripts/master_ip_failover line 93.

Execution of /etc/masterha/scripts/master_ip_failover aborted due to compilation errors.

Thu Mar 15 09:13:58 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln229]  Failed to get master_ip_failover_script status with return code 255:0.

Thu Mar 15 09:13:58 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations.  at /usr/local/bin/masterha_check_repl line 48.

Thu Mar 15 09:13:58 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.

Thu Mar 15 09:13:58 2018 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

原因 master_ip_failover 没有选择VIP 方式。MHA 支持脚本管理VIP 和 keepalive 方式。出现这个错误是因为没有选择VIP管理方式，也没有做相应的配置。
解决方法 选择好配置VIP方式,并修改master_ip_failover脚本。脚本内容参见配置VIP

4.9 启停MHA

masterha_manager 要以守护进程的形式启动,否则,当完成一次failover（非主动切换)后，manager 会自动停止日后管理也方便。

4.9.1 通过daemonize 启动

最初的目的是想在发生 failover 后，manager 仍可继续工作。可是发现不管怎么配置,都会停掉。尝试方法包含：nohup,daemonize, 配置系统service服务。

daemonize

daemonize -p /var/log/masterha/app1/3307.pid -l /var/log/masterha/app1/3307.lock \

/usr/local/bin/masterha_manager --global_conf=/etc/masterha/app_default.cnf --conf=/etc/\

masterha/app1.cnf --ignore-last_failover>> /var/log/masterha/app1/app1_3307.log 2>&1

service 脚本如下:

#!/bin/env bash

STARTEXEC="/usr/local/bin/masterha_manager --conf"

STOPEXEC="/usr/local/bin/masterha_stop --conf"

CONF="/etc/masterha/app1.cnf"

process_count=`ps -ef |grep masterha_manager|grep -v grep|wc -l`

PARAMS="--ignore_last_failover"

case "$1" in

  start)

      if [ $process_count -gt 1 ]

      then

              echo "masterha_manager exists, process is already running"

      else

              echo "Starting Masterha Manager"

              $STARTEXEC $CONF $PARAMS

      fi

      ;;

  stop)

      if [ $process_count -eq 0 ]

      then

              echo "Masterha Manager does not exist, process is not running"

      else

              echo "Stopping ..."

              $STOPEXEC $CONF

              while(true)

              do

                  process_count=`ps -ef |grep masterha_manager|grep -v grep|wc -l`

                  if [ $process_count -gt 0 ]

                      then

                           sleep 1

                      else

                           break;

                  fi

              done;

              echo "Master Manager stopped"

      fi

      ;;

  *)

      echo "Please use start or stop as first argument"

      ;;

esac

启动参数介绍：
- remove_dead_master_conf
  
  该参数代表当发生主从切换后，老的主库的ip将会从配置文件中移除。
- manger_log
  日志存放位置
- ignore_last_failover
  在缺省情况下，如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应.该参数代表忽略上次MHA触发切换产生的文件. 默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的app1.cnf 参数文件中的manager_workdir中产生app1.failover.complete文件. 下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便进行切换,启动时添加参数ignore_last_failover.
- last_failover_minute=(minutes)
  当最近的一个failover 切换发生在last_failover_minute(默认为8小时) 之内，MHA manager 将不会在切换。因为它会认为有些问题没有得到解决。如果设置了 –ignore_last_failover 参数，参数(–last_failover_minute) 将会失效

4.9.2 检查MHA状态

masterha_check_status --conf=/etc/masterha/app1.cnf       ##查看状态

正常情况返回:

app1 (pid:4931) is running(0:PING_OK), master:10.10.100.175

4.9.3 停止MHA

masterha_stop --conf=/etc/masterha/app1.cnf

4.9.4 坑

在启动MHA时，遇到了错误信息:

Tue Mar 20 08:47:33 2018 - [info]   Connecting to root@10.10.100.175(10.10.100.184:22)..

Checking slave recovery environment settings..

  Opening /data/relay-log.info ... ok.

  Relay log found at /data, up to test-3-relay-bin.000002

  Temporary relay log file is /data/test-3-relay-bin.000002

  Checking if super_read_only is defined and turned on..install_driver(mysql) failed: Can't locate DBD/mysql.pm in @INC (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at (eval 11) line 3.

这个错误，请注意是连到 10.10.100.184 后进行的检查项。我在本机上不停的分析，头发都薅光了，后来才发现是检查的另外一台机器. 然后执行 yum install -y perl-DBD-MySQL 即可。

4.10 主从切换

4.10.1 主动切换

主动切换时，要将manager关闭。

masterha_master_switch --conf=/etc/masterha/app1.cnf -master_state=alive --new_master_host=10.10.100.181 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000

日志如下。

Mon Mar 19 10:23:32 2018 - [info] MHA::MasterRotate version 0.57.

Mon Mar 19 10:23:32 2018 - [info] Starting online master switch..

Mon Mar 19 10:23:32 2018 - [info]

Mon Mar 19 10:23:32 2018 - [info] * Phase 1: Configuration Check Phase..

Mon Mar 19 10:23:32 2018 - [info]

Mon Mar 19 10:23:32 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..

Mon Mar 19 10:23:32 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf..

Mon Mar 19 10:23:32 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf..

Mon Mar 19 10:23:33 2018 - [info] GTID failover mode = 0

Mon Mar 19 10:23:33 2018 - [info] Current Alive Master: 10.10.100.175(10.10.100.175:3306)

Mon Mar 19 10:23:33 2018 - [info] Alive Slaves:

Mon Mar 19 10:23:33 2018 - [info]   10.10.100.181(10.10.100.181:3306)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled

Mon Mar 19 10:23:33 2018 - [info]     Replicating from 10.10.100.175(10.10.100.175:3306)

Mon Mar 19 10:23:33 2018 - [info]     Primary candidate for the new Master (candidate_master is set)

Mon Mar 19 10:23:33 2018 - [info]   10.10.100.184(10.10.100.184:3306)  Version=5.7.21-log (oldest major version between slaves) log-bin:enabled

Mon Mar 19 10:23:33 2018 - [info]     Replicating from 10.10.100.175(10.10.100.175:3306)

It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 10.10.100.175(10.10.100.175:3306)? (YES/no): YES

Mon Mar 19 10:23:40 2018 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..

Mon Mar 19 10:23:40 2018 - [info]  ok.

Mon Mar 19 10:23:40 2018 - [info] Checking MHA is not monitoring or doing failover..

Mon Mar 19 10:23:40 2018 - [info] Checking replication health on 10.10.100.181..

Mon Mar 19 10:23:40 2018 - [info]  ok.

Mon Mar 19 10:23:40 2018 - [info] Checking replication health on 10.10.100.184..

Mon Mar 19 10:23:40 2018 - [info]  ok.

Mon Mar 19 10:23:40 2018 - [info] 10.10.100.181 can be new master.

Mon Mar 19 10:23:40 2018 - [info]

From:

10.10.100.175(10.10.100.175:3306) (current master)

 +--10.10.100.181(10.10.100.181:3306)

 +--10.10.100.184(10.10.100.184:3306)

To:

10.10.100.181(10.10.100.181:3306) (new master)

 +--10.10.100.184(10.10.100.184:3306)

 +--10.10.100.175(10.10.100.175:3306)

Starting master switch from 10.10.100.175(10.10.100.175:3306) to 10.10.100.181(10.10.100.181:3306)? (yes/NO): yes

Mon Mar 19 10:23:44 2018 - [info] Checking whether 10.10.100.181(10.10.100.181:3306) is ok for the new master..

Mon Mar 19 10:23:44 2018 - [info]  ok.

Mon Mar 19 10:23:44 2018 - [info] 10.10.100.175(10.10.100.175:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.

Mon Mar 19 10:23:44 2018 - [info] 10.10.100.175(10.10.100.175:3306): Resetting slave pointing to the dummy host.

Mon Mar 19 10:23:44 2018 - [info] ** Phase 1: Configuration Check Phase completed.

Mon Mar 19 10:23:44 2018 - [info]

Mon Mar 19 10:23:44 2018 - [info] * Phase 2: Rejecting updates Phase..

Mon Mar 19 10:23:44 2018 - [info]

Mon Mar 19 10:23:44 2018 - [info] Executing master ip online change script to disable write on the current master:

Mon Mar 19 10:23:44 2018 - [info]   /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=10.10.100.175 --orig_master_ip=10.10.100.175 --orig_master_port=3306 --orig_master_user='root' --new_master_host=10.10.100.181 --new_master_ip=10.10.100.181 --new_master_port=3306 --new_master_user='root' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx

Mon Mar 19 10:23:44 2018 719895 Set read_only on the new master.. ok.

Mon Mar 19 10:23:44 2018 724609 Drpping app user on the orig master..

Mon Mar 19 10:23:44 2018 725404 Set read_only=1 on the orig master.. ok.

Mon Mar 19 10:23:44 2018 727975 Killing all application threads..

Mon Mar 19 10:23:44 2018 728012 done.

Mon Mar 19 10:23:44 2018 - [info]  ok.

Mon Mar 19 10:23:44 2018 - [info] Locking all tables on the orig master to reject updates from everybody (including root):

Mon Mar 19 10:23:44 2018 - [info] Executing FLUSH TABLES WITH READ LOCK..

Mon Mar 19 10:23:44 2018 - [info]  ok.

Mon Mar 19 10:23:44 2018 - [info] Orig master binlog:pos is mysql-bin.000012:194.

Mon Mar 19 10:23:44 2018 - [info]  Waiting to execute all relay logs on 10.10.100.181(10.10.100.181:3306)..

Mon Mar 19 10:23:44 2018 - [info]  master_pos_wait(mysql-bin.000012:194) completed on 10.10.100.181(10.10.100.181:3306). Executed 0 events.

Mon Mar 19 10:23:44 2018 - [info]   done.

Mon Mar 19 10:23:44 2018 - [info] Getting new master's binlog name and position..

Mon Mar 19 10:23:44 2018 - [info]  mysql-bin.000015:234

Mon Mar 19 10:23:44 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.10.100.181', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000015', MASTER_LOG_POS=234, MASTER_USER='rep1', MASTER_PASSWORD='xxx';

Mon Mar 19 10:23:44 2018 - [info] Executing master ip online change script to allow write on the new master:

Mon Mar 19 10:23:44 2018 - [info]   /usr/local/bin/master_ip_online_change --command=start --orig_master_host=10.10.100.175 --orig_master_ip=10.10.100.175 --orig_master_port=3306 --orig_master_user='root' --new_master_host=10.10.100.181 --new_master_ip=10.10.100.181 --new_master_port=3306 --new_master_user='root' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx

Mon Mar 19 10:23:44 2018 944521 Set read_only=0 on the new master.

Mon Mar 19 10:23:44 2018 945378 Creating app user on the new master..

Mon Mar 19 10:23:44 2018 - [info]  ok.

Mon Mar 19 10:23:44 2018 - [info]

Mon Mar 19 10:23:44 2018 - [info] * Switching slaves in parallel..

Mon Mar 19 10:23:44 2018 - [info]

Mon Mar 19 10:23:44 2018 - [info] -- Slave switch on host 10.10.100.184(10.10.100.184:3306) started, pid: 30724

Mon Mar 19 10:23:44 2018 - [info]

Mon Mar 19 10:23:45 2018 - [info] Log messages from 10.10.100.184 ...

Mon Mar 19 10:23:45 2018 - [info]

Mon Mar 19 10:23:44 2018 - [info]  Waiting to execute all relay logs on 10.10.100.184(10.10.100.184:3306)..

Mon Mar 19 10:23:44 2018 - [info]  master_pos_wait(mysql-bin.000012:194) completed on 10.10.100.184(10.10.100.184:3306). Executed 0 events.

Mon Mar 19 10:23:44 2018 - [info]   done.

Mon Mar 19 10:23:44 2018 - [info]  Resetting slave 10.10.100.184(10.10.100.184:3306) and starting replication from the new master 10.10.100.181(10.10.100.181:3306)..

Mon Mar 19 10:23:45 2018 - [info]  Executed CHANGE MASTER.

Mon Mar 19 10:23:45 2018 - [info]  Slave started.

Mon Mar 19 10:23:45 2018 - [info] End of log messages from 10.10.100.184 ...

Mon Mar 19 10:23:45 2018 - [info]

Mon Mar 19 10:23:45 2018 - [info] -- Slave switch on host 10.10.100.184(10.10.100.184:3306) succeeded.

Mon Mar 19 10:23:45 2018 - [info] Unlocking all tables on the orig master:

Mon Mar 19 10:23:45 2018 - [info] Executing UNLOCK TABLES..

Mon Mar 19 10:23:45 2018 - [info]  ok.

Mon Mar 19 10:23:45 2018 - [info] Starting orig master as a new slave..

Mon Mar 19 10:23:45 2018 - [info]  Resetting slave 10.10.100.175(10.10.100.175:3306) and starting replication from the new master 10.10.100.181(10.10.100.181:3306)..

Mon Mar 19 10:23:45 2018 - [info]  Executed CHANGE MASTER.

Mon Mar 19 10:23:45 2018 - [info]  Slave started.

Mon Mar 19 10:23:45 2018 - [info] All new slave servers switched successfully.

Mon Mar 19 10:23:45 2018 - [info]

Mon Mar 19 10:23:45 2018 - [info] * Phase 5: New master cleanup phase..

Mon Mar 19 10:23:45 2018 - [info]

Mon Mar 19 10:23:46 2018 - [info]  10.10.100.181: Resetting slave info succeeded.

Mon Mar 19 10:23:46 2018 - [info] Switching master to 10.10.100.181(10.10.100.181:3306) completed successfully.

切换完成后，原来的主库175，已变为从库，并且以181 为主库，进行数据同步。

4.10.2 被动切换

被动切换时，mha manger 应已启动。

A. 以上两种vip切换方式，建议采用第一种方法； B. 发生主备切换后，manager服务会自动停掉，且在/var/log/masterha/app1下面生成 app1.failover.complete，若再次发生切换需要删除app1.failover.complete文件； C. 测试过程发现一主两从的架构(两从都设置可以担任主角色candidate_master=1)，当旧主故障迁移到备主后，删除app1.failover.complete，再次启动manager，停掉新主后，发现无法正常切换(解决方式：删除/etc/mha/app1/app1.cnf里面的旧主node1的信息后，重新切换正常)； D. arp缓存导致切换VIP后，无法使用问题；

4.10.3 修复故障数据库

通常情况下自动切换以后，原master可能已经废弃掉，待原master主机修复后，如果数据完整的情况下，可能想把原来master重新作为新主库的slave，这时我们可以借助当时自动切换时刻的MHA日志来完成对原master的修复。下面是提取相关日志的命令：

grep -i "All other slaves should start" manager.log

结果如下:

Tue Mar 20 14:40:26 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='10.10.100.181', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000020', MASTER_LOG_POS=234, MASTER_USER='rep1', MASTER_PASSWORD='xxx';

4.10.4 坑

权限问题报错信息：

Got Error: DBI connect(';host=10.10.100.181;port=3306;mysql_connect_timeout=4','root',...) failed: Access denied for user 'root'@'test-1' (using password: YES) at /usr/local/share/perl5/MHA/DBHelper.pm line 208.

at /usr/local/bin/master_ip_online_change line 132.

Mon Mar 19 06:19:48 2018 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/local/bin/masterha_master_switch line 53.

该信息有明确提示 Access denied for user 'root'@'test-1' (using password: YES 说明是权限问题。需要相应授权：

mysql> grant all on *.* to 'root'@'test-1' identified by "^NJI0okm";

Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;

Query OK, 0 rows affected (0.00 sec)

这里有个坑，root密码包含了特殊字符 "^",在perl 中识别错误，导致提供无权访问。密码设置稍微简单一些就好。比如大小写和数字的混合。

脚本问题错误信息:

Mon Mar 19 09:39:05 2018 255101 Drpping app user on the orig master..

Got Error: Undefined subroutine &main::FIXME_xxx_drop_app_user called at /usr/local/bin/master_ip_online_change line 152.