KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复
案例说明:
KingbaseES V8R3集群默认在触发failover切换后,为保证数据安全,原主库需要通过人工介入后,恢复为新的备库加入到集群。在无人值守的现场环境,需要在触发failover切换后,主库可以自动恢复为新备考加入集群,提升架构的高可用性。
适用版本:
KingbaseES V8R3
集群架构:
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
(2 rows)
一、配置AUTO_PRIMARY_RECOVERY参数
Tips:
AUTO_PRIMARY_RECOVERY参数配置在HAmodule.conf文件中,需要修改db和kingbasecluster目录下相关配置文件。
[kingbase@node102 bin]$ cat ../etc/HAmodule.conf |grep -i auto
#automatic recovery log path.example:RECOVERY_LOG_DIR="./log/recovery.log"
#whether to turn on automatic recovery,0->off,1->on.example:AUTO_PRIMARY_RECOVERY="1"
AUTO_PRIMARY_RECOVERY=0
---如上所示,默认AUTO_PRIMARY_RECOVERY=0不支持主库在failover切换后,自动降为备库加入到集群。
如下图所示:配置主库自动恢复

二、failover切换测试
1、模拟主库数据库服务down
[kingbase@node102 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped
2、切换后集群节点状态
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
---如上所示,failover切换后,集群恢复正常,原主库(102)作为备库加入到集群。
3、主备流复制状态
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACK
END_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCAT
ION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------
------------------+--------------+-----------+---------------+----------------+----------------+-------------
----+---------------+------------
16942 | 10 | SYSTEM | node2 | 192.168.1.102 | | 16773 | 2023-02-22 1
4:29:08.870998+08 | | streaming | 0/D001FDF0 | 0/D001FDF0 | 0/D001FDF0 | 0/D001FDF0
| 2 | sync
(1 row)
三、查看failover切换日志
如下所示,执行failover_stream.sh触发failover切换。
1、新主库failover.log
-----------------2023-02-22 14:28:13 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.1.102], %O = old primary host[192.168.1.102] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let 192.168.1.101 become new primary.....
2023-02-22 14:28:15 del old primary VIP on 192.168.1.102
es_client connect host:192.168.1.102 success, will stop old primary db and del the vip
stop the old primary db
DEL VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
execute: [/sbin/ip addr del 192.168.1.204/24 dev enp0s3]
Oprate del ip cmd end.
2023-02-22 14:28:15 add VIP on 192.168.1.101
ADD VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.204/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U 192.168.1.204 -I enp0s3 -w 1
Success to send 1 packets
2023-02-22 14:28:15 promote begin...let 192.168.1.101 become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-02-22 14:28:16 sync to async
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row)
2023-02-22 14:28:16 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-02-22 14:28:16 failover end---------------------------------------
2、原主库recovery.log
如下所示,在failover切换后,通过sys_rewind将原主库恢复为备库,并加入到集群。
---------------------------------------------------------------------
2023-02-22 14:29:01 recover beging...
my pid is 21729,officially began to perform recovery
2023-02-22 14:29:01 check read/write on mount point
2023-02-22 14:29:01 check read/write on mount point (1 / 6).
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ... OK
2023-02-22 14:29:01 create/write the file "/home/kingbase/cluster/HAR3/db/data/rw_status_file_625758242" ...
........
2023-02-22 14:29:01 success to check read/write on mount point (1 / 6).
2023-02-22 14:29:01 check read/write on mount point ... ok
2023-02-22 14:29:01 check if the network is ok
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
determine if i am master or standby
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
i am standby in cluster,determine if recovery is needed
2023-02-22 14:29:03 now will del vip [192.168.1.204/24]
now, there is no 192.168.1.204/24 on my DEV
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [1] (0 is need ), I,m status is [2] (1 is down), I will be in recovery.
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
if recover node up, let it down , for rewind
2023-02-22 14:29:03 sys_rewind...
sys_rewind --target-data=/home/kingbase/cluster/HAR3/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/HAR3/db/data
rewinding from last common checkpoint at 0/CF000028 on timeline 4
find last common checkpoint start time from 2023-02-22 14:29:03.926782 CST to 2023-02-22 14:29:03.985859 CST, in "0.059077" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
Get archive xlog list from source
Rewind archive log from source
update the control file: minRecoveryPoint is '0/D001F0B0', minRecoveryPointTLI is '5', and database state is 'in archive recovery'
rewind start wal location 0/CF000028 (file 0000000400000000000000CF), end wal location 0/D001F0B0 (file 0000000500000000000000D0). time from 2023-02-22 14:29:05.926782 CST to 2023-02-22 14:29:06.184927 CST, in "2.258145" seconds.
Done!
sed conf change #synchronous_standby_names
2023-02-22 14:29:08 file operate
cp recovery.conf...
change recovery.conf ip -> primary.ip
2023-02-22 14:29:08 no need change recovery.conf, primary node is 192.168.1.101
delete pid file if exist
del the replication_slots if exist
drop the slot [slot_node1].
drop the slot [slot_node2].
2023-02-22 14:29:08 start up the kingbase...
waiting for server to start....LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/HAR3/db/data/sys_log".
done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node1,)
(1 row)
2023-02-22 14:29:10 create the slot [slot_node1] success.
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node2,)
(1 row)
2023-02-22 14:29:10 create the slot [slot_node2] success.
2023-02-22 14:29:10 start up standby successful!
cluster is sync cluster.
SYNC RECOVER MODE ...
2023-02-22 14:29:10 remote primary node change sync
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row)
SYNC RECOVER MODE DONE
2023-02-22 14:29:13 attach pool...
IM Node is 1, will try [pcp_attach_node -U kingbase -W MTIzNDU2 -h 192.168.1.205 -n 1]
pcp_attach_node -- Command Successful
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
2023-02-22 14:29:14 attach end..
recovery success,exit script with success
---------------------------------------------------------------------
---如上所示,原主库在failover切换后,触发auto-recovery,被恢复为新的备库加入到集群。
KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复的更多相关文章
- KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
- KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...
- KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
- KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
- KingbaseES V8R3集群维护案例之---pcp_node_refresh应用
案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...
- KingbaseES V8R3集群维护案例之---在线添加备库管理节点
案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...
- KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构
案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...
- SQL Server自动化运维系列——关于邮件通知那点事(.Net开发人员的福利)
需求描述 在我们的生产环境中,大部分情况下需要有自己的运维体制,包括自己健康状态的检测等.如果发生异常,需要提前预警的,通知形式一般为发邮件告知. 邮件作为一种非常便利的预警实现方式,在及时性和易用性 ...
- SQL Server自动化运维系列——监控跑批Job运行状态(Power Shell)
需求描述 在我们的生产环境中,大部分情况下需要有自己的运维体制,包括自己健康状态的检测等.如果发生异常,需要提前预警的,通知形式一般为发邮件告知. 在上一篇文章中已经分析了SQL SERVER中关于邮 ...
- saltstack自动化运维系列⑤之saltstack的配置管理详解
saltstack自动化运维系列⑤之saltstack的配置管理详解 配置管理初始化: a.服务端配置vim /etc/salt/master file_roots: base: - /srv/sal ...
随机推荐
- vmware之NAT模式配置
题外话之前的题外话,本文迁移自别的社区,三年前大学实习时写下本文,过了几年再回过头来看,虽然讲得浅显,作为入门笔记也勉强合格. ---------------------------------- ...
- win32 - 自动开关光驱
#include <tchar.h> #include <windows.h> #include <mmsystem.h> // for MCI functions ...
- [BUUCTF][Web][ACTF2020 新生赛]Exec 1
打开靶机对应url 显示可以输出ip 来进行ping操作 尝试试一下是否有命令注入的可能 127.0.0.1|ls 果然可以,输出结果 index.php PING 127.0.0.1 (127.0. ...
- 异常处理之raise A from B
raise A from B 语句用于连锁chain异常 from 后面的B可以是: - 异常类 - 异常实例 - None 如果B是异常类或者异常实例,那么B会被设置为A的__cause__属性,表 ...
- 【Azure Cloud Service】Cloud Service(Classic) 迁移失败,找不到解决方案怎么办?
问题描述 很老很老的云服务,在迁移到 Cloud Service(Extended Support)[云服务外延支持] 时,迁移的验证步骤不通过,因为资源中没有包含虚拟网络(Virtual Netwo ...
- 面试必备:一线大厂Redis缓存设计规范与性能优化
说在前面 你是否在使用Redis时,不清楚Redis应该遵循的设计规范而苦恼? 你是否在Redis出现性能问题时,不知道该如何优化而发愁? 你是否被面试官拷问过Redis的设计规范和性能优化而回答不出 ...
- STM32标准库时钟树设置
STM32的系统时钟大致可以分为以下流程 1.外部晶振提供HSE高速外部时钟信号 2.HSE经过PLL锁相环,倍频后得到PLL_CLK高速内部时钟信号 3.PLL_CLK经过分频后得到系统时钟SYSC ...
- 360 数科实践:JanusGraph 到 NebulaGraph 迁移
摘要:在本文中 360 数科的周鹏详细讲解了业务从 JanusGraph 迁移到 Nebula Graph 带来的性能提升,在机器资源不到之前 JanusGraph 配置三分之一的情况下,业务性能提升 ...
- Java 开发人员调度软件项目 (java基础编程总结项目)+javaBean+测试代码+数组知识+数据结构+继承+多态+封装+自定义异常,异常处理+构造器知识+重载+重写+接口+实现接口+关键字使用(static +equalsIgnoreCase+fianl+instanceof判断类型)+向下转型与向上转型
/** * * @Description Java 开发人员调度软件项目 (java基础编程总结项目) * +javaBean+测试代码+数组知识+数据结构+继承+多态+封装+自定义异常,异常处理 * ...
- pip 查看某个包有哪些版本并升级
查看某个包有哪些版本 pip install xxx== 升级包 pip install xxx==1.1