KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复
案例说明:
KingbaseES V8R3集群默认在触发failover切换后,为保证数据安全,原主库需要通过人工介入后,恢复为新的备库加入到集群。在无人值守的现场环境,需要在触发failover切换后,主库可以自动恢复为新备考加入集群,提升架构的高可用性。
适用版本:
KingbaseES V8R3
集群架构:
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | false | 0
(2 rows)
一、配置AUTO_PRIMARY_RECOVERY参数
Tips:
AUTO_PRIMARY_RECOVERY参数配置在HAmodule.conf文件中,需要修改db和kingbasecluster目录下相关配置文件。
[kingbase@node102 bin]$ cat ../etc/HAmodule.conf |grep -i auto
#automatic recovery log path.example:RECOVERY_LOG_DIR="./log/recovery.log"
#whether to turn on automatic recovery,0->off,1->on.example:AUTO_PRIMARY_RECOVERY="1"
AUTO_PRIMARY_RECOVERY=0
---如上所示,默认AUTO_PRIMARY_RECOVERY=0不支持主库在failover切换后,自动降为备库加入到集群。
如下图所示:配置主库自动恢复

二、failover切换测试
1、模拟主库数据库服务down
[kingbase@node102 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped
2、切换后集群节点状态
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replicatio
n_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-----------
--------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
---如上所示,failover切换后,集群恢复正常,原主库(102)作为备库加入到集群。
3、主备流复制状态
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACK
END_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCAT
ION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------
------------------+--------------+-----------+---------------+----------------+----------------+-------------
----+---------------+------------
16942 | 10 | SYSTEM | node2 | 192.168.1.102 | | 16773 | 2023-02-22 1
4:29:08.870998+08 | | streaming | 0/D001FDF0 | 0/D001FDF0 | 0/D001FDF0 | 0/D001FDF0
| 2 | sync
(1 row)
三、查看failover切换日志
如下所示,执行failover_stream.sh触发failover切换。
1、新主库failover.log
-----------------2023-02-22 14:28:13 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.1.102], %O = old primary host[192.168.1.102] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let 192.168.1.101 become new primary.....
2023-02-22 14:28:15 del old primary VIP on 192.168.1.102
es_client connect host:192.168.1.102 success, will stop old primary db and del the vip
stop the old primary db
DEL VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
execute: [/sbin/ip addr del 192.168.1.204/24 dev enp0s3]
Oprate del ip cmd end.
2023-02-22 14:28:15 add VIP on 192.168.1.101
ADD VIP NOW AT 2023-02-22 14:28:15 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.204/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U 192.168.1.204 -I enp0s3 -w 1
Success to send 1 packets
2023-02-22 14:28:15 promote begin...let 192.168.1.101 become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-02-22 14:28:16 sync to async
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row)
2023-02-22 14:28:16 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-02-22 14:28:16 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-02-22 14:28:16 failover end---------------------------------------
2、原主库recovery.log
如下所示,在failover切换后,通过sys_rewind将原主库恢复为备库,并加入到集群。
---------------------------------------------------------------------
2023-02-22 14:29:01 recover beging...
my pid is 21729,officially began to perform recovery
2023-02-22 14:29:01 check read/write on mount point
2023-02-22 14:29:01 check read/write on mount point (1 / 6).
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...
2023-02-22 14:29:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ... OK
2023-02-22 14:29:01 create/write the file "/home/kingbase/cluster/HAR3/db/data/rw_status_file_625758242" ...
........
2023-02-22 14:29:01 success to check read/write on mount point (1 / 6).
2023-02-22 14:29:01 check read/write on mount point ... ok
2023-02-22 14:29:01 check if the network is ok
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
determine if i am master or standby
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
i am standby in cluster,determine if recovery is needed
2023-02-22 14:29:03 now will del vip [192.168.1.204/24]
now, there is no 192.168.1.204/24 on my DEV
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
primary node/Im node status is changed, primary ip[192.168.1.101], recovery.conf NEED_CHANGE [1] (0 is need ), I,m status is [2] (1 is down), I will be in recovery.
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)
if recover node up, let it down , for rewind
2023-02-22 14:29:03 sys_rewind...
sys_rewind --target-data=/home/kingbase/cluster/HAR3/db/data --source-server="host=192.168.1.101 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/HAR3/db/data
rewinding from last common checkpoint at 0/CF000028 on timeline 4
find last common checkpoint start time from 2023-02-22 14:29:03.926782 CST to 2023-02-22 14:29:03.985859 CST, in "0.059077" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
Get archive xlog list from source
Rewind archive log from source
update the control file: minRecoveryPoint is '0/D001F0B0', minRecoveryPointTLI is '5', and database state is 'in archive recovery'
rewind start wal location 0/CF000028 (file 0000000400000000000000CF), end wal location 0/D001F0B0 (file 0000000500000000000000D0). time from 2023-02-22 14:29:05.926782 CST to 2023-02-22 14:29:06.184927 CST, in "2.258145" seconds.
Done!
sed conf change #synchronous_standby_names
2023-02-22 14:29:08 file operate
cp recovery.conf...
change recovery.conf ip -> primary.ip
2023-02-22 14:29:08 no need change recovery.conf, primary node is 192.168.1.101
delete pid file if exist
del the replication_slots if exist
drop the slot [slot_node1].
drop the slot [slot_node2].
2023-02-22 14:29:08 start up the kingbase...
waiting for server to start....LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/HAR3/db/data/sys_log".
done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node1,)
(1 row)
2023-02-22 14:29:10 create the slot [slot_node1] success.
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node2,)
(1 row)
2023-02-22 14:29:10 create the slot [slot_node2] success.
2023-02-22 14:29:10 start up standby successful!
cluster is sync cluster.
SYNC RECOVER MODE ...
2023-02-22 14:29:10 remote primary node change sync
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row)
SYNC RECOVER MODE DONE
2023-02-22 14:29:13 attach pool...
IM Node is 1, will try [pcp_attach_node -U kingbase -W MTIzNDU2 -h 192.168.1.205 -n 1]
pcp_attach_node -- Command Successful
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
2023-02-22 14:29:14 attach end..
recovery success,exit script with success
---------------------------------------------------------------------
---如上所示,原主库在failover切换后,触发auto-recovery,被恢复为新的备库加入到集群。
KingbaseES V8R3 集群运维系列 -- failover切换后集群自动恢复的更多相关文章
- KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
- KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...
- KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
- KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
- KingbaseES V8R3集群维护案例之---pcp_node_refresh应用
案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...
- KingbaseES V8R3集群维护案例之---在线添加备库管理节点
案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...
- KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构
案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...
- SQL Server自动化运维系列——关于邮件通知那点事(.Net开发人员的福利)
需求描述 在我们的生产环境中,大部分情况下需要有自己的运维体制,包括自己健康状态的检测等.如果发生异常,需要提前预警的,通知形式一般为发邮件告知. 邮件作为一种非常便利的预警实现方式,在及时性和易用性 ...
- SQL Server自动化运维系列——监控跑批Job运行状态(Power Shell)
需求描述 在我们的生产环境中,大部分情况下需要有自己的运维体制,包括自己健康状态的检测等.如果发生异常,需要提前预警的,通知形式一般为发邮件告知. 在上一篇文章中已经分析了SQL SERVER中关于邮 ...
- saltstack自动化运维系列⑤之saltstack的配置管理详解
saltstack自动化运维系列⑤之saltstack的配置管理详解 配置管理初始化: a.服务端配置vim /etc/salt/master file_roots: base: - /srv/sal ...
随机推荐
- 【Unity3D】动画回调函数、动画事件、动画曲线
1 动画回调函数 动画回调函数是指动画在开始时.执行中.结束时回调的函数,主要有:OnStateEnter.OnStateUpdate.OnStateExit.OnStateMove.OnStat ...
- 【Unity3D】血条(HP)
1 需求实现 人机交互Input 中实现了通过键盘控制坦克运动,通过鼠标控制坦克发射炮弹,本文将在此基础上,增加血条(HP)功能.炮弹命中后,HP 值会减少,因此需要应用到 刚体组件Rigidbo ...
- patch命令
patch命令 patch指令让用户利用设置修补文件的方式.修改.更新原始文件,倘若一次仅修改一个文件,可直接在指令列中下达指令依序执行,如果配合修补文件的方式则能一次修补大批文件,这也是Linux系 ...
- ELK查询和汇总
查询表明细: ELK的KQL样例,显示时间请选择最近15天: 样例1:查询ol_lc 表增删改查,不是jy2_rw的账号明细 KQL:(ol_lc or oc.ol_lc) and (select o ...
- Redis能保证数据不丢失吗?
大家即使没用过Redis,也应该都听说过Redis的威名. Redis是一种Nosql类型的数据存储,全称Remote Dictionary Server,也就是远程字典服务器,用过Dictionar ...
- 2019年在浏览器用原生js写WebGL,绘制图形;
因为JavaScript高级程序设计(第三版)中的运行书上15.3WebGL部分的代码时在chrome和firefox浏览器下报错,在后面我网上初步找了一圈,好像没人做出真正可以用的代码;所以我就自己 ...
- 硬件开发笔记(九): 硬件开发基本流程,制作一个USB转RS232的模块(八):创建asm1117-3.3V封装库并关联原理图元器件
前言 有了原理图,可以设计硬件PCB,在设计PCB之间还有一个协同优先动作,就是映射封装,原理图库的元器件我们是自己设计的.为了更好的表述封装设计过程,本文描述了一个创建asm1117-3.3V封 ...
- window上使用Putty通过ssh远程连接并通过Xming实现X11图形界面功能
# 0.先确认远程服务器的ssh配置 >>> grep X11 /etc/ssh/sshd_config X11Forwarding yes #X11DisplayOffset 10 ...
- 【LeetCode二叉树#15】二叉搜索树的最小绝对差(巩固迭代中序遍历#2)
二叉搜索树的最小绝对差(迭代法中序遍历巩固) 力扣题目链接(opens new window) 给你一棵所有节点为非负值的二叉搜索树,请你计算树中任意两节点的差的绝对值的最小值. 示例: 提示:树中至 ...
- 【Azure 应用服务】在App Service for Windows中实现反向代理
问题描述 如何在App Service for Windows(.NET Stack)中,如何实现反向代理呢? 正向代理:客户端想要访问一个服务器,但是它可能无法直接访问这台服务器,这时候这可找一台可 ...