案例三:测试‘recovery = manual’

1、查看集群节点状态信息:

[kingbase@node1 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+---------------------------
1 | node243 | primary | * running | | default | 100 | 3 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node248 | standby | running | node243 | default | 100 | 3 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、查看recovery配置信息

3、重启主库主机系统

[root@node3 ~]# reboot

4、查看备库hamgr日志

=从以下日志信息获知,主库系统宕机后,集群执行主备切换,备库被提升为主库。==

[2022-03-02 10:32:38] [NOTICE] starting monitoring of node "node248" (ID: 2)
[2022-03-02 10:32:38] [INFO] "connection_check_type" set to "ping"
[2022-03-02 10:32:38] [INFO] monitoring connection to upstream node "node243" (ID: 1)
[2022-03-02 10:32:38] [NOTICE] try to change wal catched_up state to 1
[2022-03-02 10:32:38] [INFO] primary flush lsn is 0/1F000D40, local flush lsn is 0/1F000D40
[2022-03-02 10:32:38] [NOTICE] try to change streaming_sync state to TRUE
[2022-03-02 10:34:24] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-02 10:34:24] [DETAIL] PQping() returned "PQPING_REJECT"
[2022-03-02 10:34:24] [WARNING] unable to connect to upstream node "node243" (ID: 1)
[2022-03-02 10:34:24] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-02 10:34:30] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-02 10:34:40] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-02 10:34:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-02 10:34:40] [INFO] sleeping 6 seconds until next reconnection attempt ...... [2022-03-02 10:35:47] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-02 10:35:47] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-02 10:35:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-02 10:35:47] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-02 10:35:47] [WARNING] wal receiver not running
[2022-03-02 10:35:47] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-02 10:35:47] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-02 10:35:47] [INFO] 0 active sibling nodes registered
[2022-03-02 10:35:47] [INFO] primary and this node have the same location ("default")
[2022-03-02 10:35:47] [INFO] no other sibling nodes - we win by default
[2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-02 10:35:48] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-02 10:35:48] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command
[2022-03-02 10:35:50] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data. --- 192.168.7.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1008ms
rtt min/avg/max/mdev = 2.473/2.535/2.598/0.080 ms [2022-03-02 10:35:50] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"
[2022-03-02 10:35:51] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data. --- 192.168.7.241 ping statistics ---
2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms [2022-03-02 10:35:51] [WARNING] ping host"192.168.7.241" failed
[2022-03-02 10:35:51] [DETAIL] average RTT value is not greater than zero
[2022-03-02 10:35:51] [INFO] loadvip result: 1, arping result: 1
[2022-03-02 10:35:51] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success
[2022-03-02 10:35:51] [INFO] promote_command is:
"/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node248" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2022-03-02 10:35:51] [NOTICE] try to stop old primary db (host: "192.168.7.243")
INFO: SET synchronous TO "async" on primary host
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node248" (ID: 2) was successfully promoted to primary
[2022-03-02 10:35:56] [INFO] 0 followers to notify
[2022-03-02 10:35:56] [INFO] switching to primary monitoring mode
[2022-03-02 10:35:56] [NOTICE] monitoring cluster primary "node248" (ID: 2)
[2022-03-02 10:35:56] [INFO] create a thread 0x7fdeaa4b9700 to check the cluster status
[2022-03-02 10:35:57] [INFO] node (ID: 1): no server running
[2022-03-02 10:35:57] [INFO] [thread 0x7fdeaa4b9700] the cluster has no other running primary node, exit

5、原主库系统正常启动

1)从新主库查看集群状态 信息

=从以下信息可以获知,集群现在处于‘双主’状态,只是原主库是‘failed’,无法连接。=

[kingbase@node1 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------
1 | node243 | primary | - failed | | default | 100 | ? | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 WARNING: following issues were detected
- unable to connect to node "node243" (ID: 1)
You have new mail in /var/spool/mail/kingbase

2)在新主库(原备库)创建复制槽

# 创建replication slots

test=# select sys_create_physical_replication_slot('repmgr_slot_1');
sys_create_physical_replication_slot
--------------------------------------
(repmgr_slot_1,)
(1 row) test=# select sys_create_physical_replication_slot('repmgr_slot_2');
sys_create_physical_replication_slot
--------------------------------------
(repmgr_slot_2,)
(1 row) test=# select * from sys_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn | confirmed_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+-----
repmgr_slot_1 | | physical | | | f | f | | | |
|
repmgr_slot_2 | | physical | | | f | f | | | |
|
(2 rows)

3)在原主库(新主库)执行以下恢复操作

# 备份data目录
[kingbase@node3 kingbase]$ cp data data.bk -r # 生成备库标识文件
[kingbase@node3 kingbase]$ cd data
[kingbase@node3 data]$ touch standby.signal

4)在原主库执行repmgr node rejoin重新加入到集群

 [kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2
DETAIL: rejoin target server's timeline 10 forked off current database system timeline 9 before current recovery point 0/200000A0
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 0/1F000D70 on timeline 9
sys_rewind: rewinding from last common checkpoint at 0/1E000A70 on timeline 9
sys_rewind: find last common checkpoint start time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:52:34.358066 CST, in "0.225008" seconds.
sys_rewind: update the control file: minRecoveryPoint is '0/1F011AD0', minRecoveryPointTLI is '10', and database state is 'in archive recovery'
sys_rewind: rewind start wal location 0/1E000A40 (file 00000009000000000000001E), end wal location 0/1F011AD0 (file 0000000A000000000000001F). time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:53:06.442270 CST, in "32.309212" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-03-02 10:53:06.588331
NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-03-02 10:53:07.313294
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2

5)启动新备库数据库服务

[kingbase@node3 bin]$ ps -ef |grep kingbase
kingbase 3218 1 0 10:36 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase 5817 1 0 10:49 ? 00:00:01 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf
kingbase 6730 1 0 10:53 ? 00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data
kingbase 6731 6730 0 10:53 ? 00:00:00 kingbase: logger
kingbase 6732 6730 0 10:53 ? 00:00:00 kingbase: startup recovering 0000000A000000000000001F
kingbase 6736 6730 0 10:53 ? 00:00:00 kingbase: checkpointer
kingbase 6737 6730 0 10:53 ? 00:00:00 kingbase: background writer
kingbase 6738 6730 0 10:53 ? 00:00:00 kingbase: stats collector
kingbase 6739 6730 0 10:53 ? 00:00:00 kingbase: walreceiver streaming 0/1F012A78
kingbase 6743 6730 0 10:53 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(55941) idle
kingbase 6750 6730 0 10:53 ? 00:00:00 kingbase: esrep esrep 192.168.7.243(55947) idle

6)查看集群节点状态

[kingbase@node3 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------
1 | node243 | standby | running | node248 | default | 100 | 9 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

7)重启集群测试(可选)

[kingbase@node3 bin]$ ./sys_monitor.sh restart
2022-03-02 10:55:26 Ready to stop all DB ...
....
server started
2022-03-02 10:55:52 execute to start DB on "[192.168.7.248]" success, connect to check it.
2022-03-02 10:55:53 DB on "[192.168.7.248]" start success.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+-----------+----------+----------+----------+---------------
1 | node243 | standby | running | ! node248 | default | 100 | 10 | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node248 | primary | * running | | default | 100 | 10 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
WARNING: following issues were detected
- node "node243" (ID: 1) is not attached to its upstream node "node248" (ID: 2)
2022-03-02 10:55:53 The primary DB is started.
......
2022-03-02 10:56:15 repmgrd on "[192.168.7.248]" start success.
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node243 | standby | running | node248 | running | 9500 | no | 1 second(s) ago
2 | node248 | primary | * running | | running | 27881 | no | n/a
[2022-03-02 10:56:18] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
[2022-03-02 10:56:20] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"
2022-03-02 10:56:22 Done.

=从以上信息获知,通过手工执行repmgr node rejoin,原主库作为新备库重新加入到集群中。=

总结:

   1、对于recovery=standby,主库节点系统宕机后,集群执行主库切换,原主库需要人工配置为备库模式,并启动数据库服务,然后集群可自动将其加入到集群。
2、对于recovery=automatic,主库节点系统宕机后,集群执行主库切换,不需要人工参与,原主库将作为新的备库自动加入到集群。
3、对于recovery=manual,主库节点系统宕机后,集群执行主库切换,需要人工参与,在原主库执行‘repmgr node rejoin’操作,将原主库将作为新的备库自动加入到集群。
4、对于无DBA日常监控管理的生产环境,可以考虑将recovery配置为automatic,提升集群架构的可靠性。

KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)的更多相关文章

  1. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)

    KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...

  2. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(二)

    案例二:测试'recovery = automatic' 1.查看集群节点状态信息: [kingbase@node1 bin]$ ./repmgr cluster show ID | Name | R ...

  3. KingbaseES R6 集群备库网卡down测试案例

    数据库版本: test=# select version(); version ------------------------------------------------------------ ...

  4. KingbaseES R6 集群修改物理IP和VIP案例

    在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...

  5. KingbaseES R6 集群repmgr witness 手工配置案例

    使用见证服务器: 见证服务器是一个正常的KingbaseES实例,不是流复制群集的一部分; 其目的是,如果发生故障转移情况,则提供证明它是主服务器本身不可用的证据,而不是例如在不同物理位置之间的网络分 ...

  6. KingbaseES R6 集群 recovery 参数对切换的影响

    案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...

  7. KingbaseES R6 集群一键修改集群和数据库参数测试案例

    ​ 案例说明: 集群环境修改集群或数据库参数,需要在每个node上都要修改,在每个节点而执行修改操作,容易出现漏改或节点上参数不一致等错误:在KingbaseES V8R6的集群中增加了,一键修改参数 ...

  8. KingbaseES R6 集群修改data目录

    案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...

  9. KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例

    案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...

随机推荐

  1. 过年了,基于Vue做一个消息通知组件

    前言 今天除夕,在这里祝大家新年快乐!!!今天在这个特别的日子里我们做一个消息通知组件,好,我们开始行动起来吧!!!项目一览 效果很简单,就是这种的小卡片似的效果. 我们先开始写UI页面,可自定义消息 ...

  2. HDLBits->Circuits->Arithmetic Circuitd->3-bit binary adder

    Verilog实例数组 对于一个定义好的简单module,例如加法器之类,如果我们要对其进行几十次几百次的例化,并且这些例化基本都是相同的形式,那么我们肯定不能一个个的单独对其进行例化,此时我们就可以 ...

  3. linux系统健康检查脚本

    #!/bin/bash echo "You are logged in as `whoami`"; if [ `whoami` != root ]; then echo " ...

  4. RPA应用场景-营业收入核对

    场景概述营业收入核对 所涉系统名称 SAP ,Excel,门店业务系统 人工操作(时间/次) 4 小时 所涉人工数量 6 操作频率每日 场景流程 1.每日13点起进入SAP查询前一日营业收入记账情况: ...

  5. 从Mpx资源构建优化看splitChunks代码分割

    背景 MPX是滴滴出品的一款增强型小程序跨端框架,其核心是对原生小程序功能的增强.具体的使用不是本文讨论的范畴,想了解更多可以去官网了解更多. 回到正题,使用MPX开发小程序有一段时间了,该框架对不同 ...

  6. 如果一个promise永不resolve,会内存泄漏吗

    答:跟内存泄漏没有直接关系gc的策略不会改变,如果该promise没有被人引用,就会被gc掉.如果仍被引用,就不会被gc掉.即使一个promise,resolve或者reject了,但是它还被人引用, ...

  7. Issues in multiparty dialogues(科普性质)

    多人对话过程中存在的问题: 1)对于双方对话:存在明显的Speaker和Listener/addressee.但对于多人会话:就存在很多种情况.Clark[6]给出了对listener的分类

  8. LMC7660即-5V产生电路

    LMC7660为小功率极性反转电源转换器,通过LMC7660电路产生-5V电压,其芯片管脚定义如下表所示. LMC7660负电压产生电路如下图所示. 其中6脚当供电电压大于等于5V时该脚必须悬空,当供 ...

  9. SSH 多密钥配置

    目录 前言 一.SSH 是什么 二.密钥生成工具 三.密钥类型 四.本地配置 1.单密钥配置 2.多密钥配置 五.远端配置 1.GitHub/Gitee 2.服务器 前言 当我们从 GitHub 克隆 ...

  10. 《吐血整理》保姆级系列教程-玩转Fiddler抓包教程(4)-会话面板和HTTP会话数据操作详解

    1.简介 按照从上往下,从左往右的计划,今天就轮到介绍和分享Fiddler的会话面板了. 2.会话列表 (Session list) 概览 Fiddler抓取到的每条http请求(每一条称为一个ses ...