KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复
案例说明:
KingbaseES V8R3集群默认在触发failover切换后,为保证数据安全,原主库需要通过人工介入后,恢复为新的备库加入到集群。在无人值守的现场环境,需要在触发failover切换后,主库可以自动恢复为新备考加入集群,提升架构的高可用性。
适用版本:
KingbaseES V8R3
集群架构:
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
(2 rows)
一、配置AUTO_PRIMARY_RECOVERY参数
Tips:
AUTO_PRIMARY_RECOVERY参数配置在HAmodule.conf文件中,需要修改db和kingbasecluster目录下相关配置文件。
[kingbase@node102 bin]$ cat ../etc/HAmodule.conf |grep -i auto
#automatic recovery log path.example:RECOVERY_LOG_DIR="./log/recovery.log"
#whether to turn on automatic recovery,0->off,1->on.example:AUTO_PRIMARY_RECOVERY="1"
AUTO_PRIMARY_RECOVERY=0
---如上所示,默认AUTO_PRIMARY_RECOVERY=0不支持主库在failover切换后,自动降为备库加入到集群。
如下图所示:配置主库自动恢复

二、failover切换测试
1、模拟主库数据库服务down
[kingbase@node102 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped
2、切换后集群节点状态
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
---如上所示,failover切换后,集群恢复正常,原主库(102)作为备库加入到集群。
3、主备流复制状态
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACK
END_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCAT
ION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------
------------------+--------------+-----------+---------------+----------------+----------------+-------------
----+---------------+------------
16942 | 10 | SYSTEM | node2 | 192.168.1.102 | | 16773 | 2023-02-22 1
4:29:08.870998+08 | | streaming | 0/D001FDF0 | 0/D001FDF0 | 0/D001FDF0 | 0/D001FDF0
| 2 | sync
(1 row)
三、查看failover切换日志
如下所示,执行failover_stream.sh触发failover切换。
1、新主库failover.log
-----------------2023-02-22 14:28:13 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.1.102], %O = old primary host[192.168.1.102] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let 192.168.1.101 become new primary.....
2023-02-22 14:28:15 del old primary VIP on 192.168.1.102
es_client connect host:192.168.1.102 success, will stop old primary db and del the vip
stop the old primary db
DEL VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
execute: [/sbin/ip addr del 192.168.1.204/24 dev enp0s3]
Oprate del ip cmd end.
2023-02-22 14:28:15 add VIP on 192.168.1.101
ADD VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.204/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U 192.168.1.204 -I enp0s3 -w 1
Success to send 1 packets
2023-02-22 14:28:15 promote begin...let 192.168.1.101 become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-02-22 14:28:16 sync to async
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row)
2023-02-22 14:28:16 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-02-22 14:28:16 failover end---------------------------------------
2、原主库recovery.log
如下所示,在failover切换后,通过sys_rewind将原主库恢复为备库,并加入到集群。
---------------------------------------------------------------------
2023-02-22 14:29:01 recover beging...
my pid is 21729,officially began to perform recovery
2023-02-22 14:29:01 check read/write on mount point
2023-02-22 14:29:01 check read/write on mount point (1 / 6).
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ... OK
2023-02-22 14:29:01 create/write the file "/home/kingbase/cluster/HAR3/db/data/rw_status_file_625758242" ...
........
2023-02-22 14:29:01 success to check read/write on mount point (1 / 6).
2023-02-22 14:29:01 check read/write on mount point ... ok
2023-02-22 14:29:01 check if the network is ok
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
determine if i am master or standby
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
i am standby in cluster,determine if recovery is needed
2023-02-22 14:29:03 now will del vip [192.168.1.204/24]
now, there is no 192.168.1.204/24 on my DEV
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [1] (0 is need ), I,m status is [2] (1 is down), I will be in recovery.
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
if recover node up, let it down , for rewind
2023-02-22 14:29:03 sys_rewind...
sys_rewind --target-data=/home/kingbase/cluster/HAR3/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/HAR3/db/data
rewinding from last common checkpoint at 0/CF000028 on timeline 4
find last common checkpoint start time from 2023-02-22 14:29:03.926782 CST to 2023-02-22 14:29:03.985859 CST, in "0.059077" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
Get archive xlog list from source
Rewind archive log from source
update the control file: minRecoveryPoint is '0/D001F0B0', minRecoveryPointTLI is '5', and database state is 'in archive recovery'
rewind start wal location 0/CF000028 (file 0000000400000000000000CF), end wal location 0/D001F0B0 (file 0000000500000000000000D0). time from 2023-02-22 14:29:05.926782 CST to 2023-02-22 14:29:06.184927 CST, in "2.258145" seconds.
Done!
sed conf change #synchronous_standby_names
2023-02-22 14:29:08 file operate
cp recovery.conf...
change recovery.conf ip -> primary.ip
2023-02-22 14:29:08 no need change recovery.conf, primary node is 192.168.1.101
delete pid file if exist
del the replication_slots if exist
drop the slot [slot_node1].
drop the slot [slot_node2].
2023-02-22 14:29:08 start up the kingbase...
waiting for server to start....LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/HAR3/db/data/sys_log".
done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node1,)
(1 row)
2023-02-22 14:29:10 create the slot [slot_node1] success.
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node2,)
(1 row)
2023-02-22 14:29:10 create the slot [slot_node2] success.
2023-02-22 14:29:10 start up standby successful!
cluster is sync cluster.
SYNC RECOVER MODE ...
2023-02-22 14:29:10 remote primary node change sync
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row)
SYNC RECOVER MODE DONE
2023-02-22 14:29:13 attach pool...
IM Node is 1, will try [pcp_attach_node -U kingbase -W MTIzNDU2 -h 192.168.1.205 -n 1]
pcp_attach_node -- Command Successful
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
2023-02-22 14:29:14 attach end..
recovery success,exit script with success
---------------------------------------------------------------------
---如上所示,原主库在failover切换后,触发auto-recovery,被恢复为新的备库加入到集群。
KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复的更多相关文章
- KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
- KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...
- KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
- KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
- KingbaseES V8R3集群维护案例之---pcp_node_refresh应用
案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...
- KingbaseES V8R3集群维护案例之---在线添加备库管理节点
案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...
- KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构
案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...
- SQL Server自动化运维系列——关于邮件通知那点事(.Net开发人员的福利)
需求描述 在我们的生产环境中,大部分情况下需要有自己的运维体制,包括自己健康状态的检测等.如果发生异常,需要提前预警的,通知形式一般为发邮件告知. 邮件作为一种非常便利的预警实现方式,在及时性和易用性 ...
- SQL Server自动化运维系列——监控跑批Job运行状态(Power Shell)
需求描述 在我们的生产环境中,大部分情况下需要有自己的运维体制,包括自己健康状态的检测等.如果发生异常,需要提前预警的,通知形式一般为发邮件告知. 在上一篇文章中已经分析了SQL SERVER中关于邮 ...
- saltstack自动化运维系列⑤之saltstack的配置管理详解
saltstack自动化运维系列⑤之saltstack的配置管理详解 配置管理初始化: a.服务端配置vim /etc/salt/master file_roots: base: - /srv/sal ...
随机推荐
- Spring Cloud Gateway微服务网关快速入门
介绍 Spring Cloud Gateway 是 Spring 官方基于 Spring 5.0,Spring Boot 2.0 和 Project Reactor 等技术开发的网关,Spring C ...
- Java并发编程实例--18.修改锁的公平性
ReentrantLock和ReentrantReadWriteLock类的构造函数可接受一个布尔类型参数fair,表示你可以控制这2个类的行为. 其默认值为false,代表non-fair(不公平) ...
- SpringBoot中Redis的基础使用
基础使用 首先引入依赖 <!-- redis依赖--> <dependency> <groupId>org.springframework.boot</gro ...
- Redis原理再学习02:数据结构-动态字符串sds
Redis原理再学习:动态字符串sds 字符 字符就是英文里的一个一个英文字母,比如:a.中文里的单个汉字,比如:好. 字符串就是多个字母或多个汉字组成,比如字符串:redis,中文字符串:你好吗. ...
- 简化Simulink的建模与模型重构
简化Simulink的建模与模型重构 模型重构 Simulink作为汽车和自动化领域中经典的模型工程必备工具,不管是专业的汽车控制器的开发还是自动化控制的专业应用编程,都会使用到Simulink进行图 ...
- 03、Etcd 客户端常用命令
上一讲我们安装 etcd 服务端,这一讲我们来一起学学如何使用 etcd 客户端常见的命令.文章内容来源于参考资料,如若侵权,请联系删除,谢谢. etcd可通过客户端命令行工具 etcdctl 对et ...
- maven配置全局私服地址和阿里云仓库
直接上配置代码 <?xml version="1.0" encoding="UTF-8"?> <!-- Licensed to the Apa ...
- 在 Spring Boot 3.x 中使用 SpringDoc 2 / Swagger V3
SpringDoc V1 只支持到 Spring Boot 2.x springdoc-openapi v1.7.0 is the latest Open Source release support ...
- 在vmware里安装ubuntu的简单过程(具体的见网址)
在官网选择vmware版本为16,安装后,在vmware里升级到最新版.(这个可以解决蓝屏) 在下面的这个文章里下载ubuntu的镜像文件iso,我下载的是16年的,内存为1.6GB,下载的时间用的少 ...
- 简单封装 Flurl
FlurlHttpClient类 public class FlurlHttpClient { private readonly FlurlClient client; public FlurlHtt ...