KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)

案例三：测试‘recovery = manual’

1、查看集群节点状态信息：

[kingbase@node1 bin]$ ./repmgr cluster show

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string

----+---------+---------+-----------+----------+----------+----------+----------+---------------------------

 1  | node243 | primary | * running |          | default  | 100      | 3        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

 2  | node248 | standby |   running | node243  | default  | 100      | 3        | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

2、查看recovery配置信息

3、重启主库主机系统

[root@node3 ~]# reboot

4、查看备库hamgr日志

=从以下日志信息获知，主库系统宕机后，集群执行主备切换，备库被提升为主库。==

[2022-03-02 10:32:38] [NOTICE] starting monitoring of node "node248" (ID: 2)

[2022-03-02 10:32:38] [INFO] "connection_check_type" set to "ping"

[2022-03-02 10:32:38] [INFO] monitoring connection to upstream node "node243" (ID: 1)

[2022-03-02 10:32:38] [NOTICE] try to change wal catched_up state to 1

[2022-03-02 10:32:38] [INFO] primary flush lsn is 0/1F000D40, local flush lsn is 0/1F000D40

[2022-03-02 10:32:38] [NOTICE] try to change streaming_sync state to TRUE

[2022-03-02 10:34:24] [WARNING] unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"

[2022-03-02 10:34:24] [DETAIL] PQping() returned "PQPING_REJECT"

[2022-03-02 10:34:24] [WARNING] unable to connect to upstream node "node243" (ID: 1)

[2022-03-02 10:34:24] [INFO] sleeping 6 seconds until next reconnection attempt

[2022-03-02 10:34:30] [INFO] checking state of node 1, 1 of 10 attempts

[2022-03-02 10:34:40] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"

[2022-03-02 10:34:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2022-03-02 10:34:40] [INFO] sleeping 6 seconds until next reconnection attempt

......

[2022-03-02 10:35:47] [INFO] checking state of node 1, 10 of 10 attempts

[2022-03-02 10:35:47] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.168.7.243 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"

[2022-03-02 10:35:47] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2022-03-02 10:35:47] [WARNING] unable to reconnect to node 1 after 10 attempts

[2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds

[2022-03-02 10:35:47] [WARNING] wal receiver not running

[2022-03-02 10:35:47] [NOTICE] WAL receiver disconnected on all sibling nodes

[2022-03-02 10:35:47] [INFO] WAL receiver disconnected on all 0 sibling nodes

[2022-03-02 10:35:47] [INFO] 0 active sibling nodes registered

[2022-03-02 10:35:47] [INFO] primary and this node have the same location ("default")

[2022-03-02 10:35:47] [INFO] no other sibling nodes - we win by default

[2022-03-02 10:35:47] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms

[2022-03-02 10:35:48] [NOTICE] this node is the only available candidate and will now promote itself

[2022-03-02 10:35:48] [INFO] try to ping the trusted_servers "192.168.7.1" before execute promote_command

[2022-03-02 10:35:50] [NOTICE] PING 192.168.7.1 (192.168.7.1) 56(84) bytes of data.

--- 192.168.7.1 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1008ms

rtt min/avg/max/mdev = 2.473/2.535/2.598/0.080 ms

[2022-03-02 10:35:50] [NOTICE] successfully ping one or more of the trusted_servers "192.168.7.1"

[2022-03-02 10:35:51] [NOTICE] PING 192.168.7.241 (192.168.7.241) 56(84) bytes of data.

--- 192.168.7.241 ping statistics ---

2 packets transmitted, 0 received, +1 errors, 100% packet loss, time 1000ms

[2022-03-02 10:35:51] [WARNING] ping host"192.168.7.241" failed

[2022-03-02 10:35:51] [DETAIL] average RTT value is not greater than zero

[2022-03-02 10:35:51] [INFO] loadvip result: 1, arping result: 1

[2022-03-02 10:35:51] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.168.7.241/24 success

[2022-03-02 10:35:51] [INFO] promote_command is:

  "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgr  standby promote -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/etc/repmgr.conf"

NOTICE: promoting standby to primary

DETAIL: promoting server "node248" (ID: 2) using sys_promote()

NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete

[2022-03-02 10:35:51] [NOTICE] try to stop old primary db (host: "192.168.7.243")

INFO: SET synchronous TO "async" on primary host

NOTICE: STANDBY PROMOTE successful

DETAIL: server "node248" (ID: 2) was successfully promoted to primary

[2022-03-02 10:35:56] [INFO] 0 followers to notify

[2022-03-02 10:35:56] [INFO] switching to primary monitoring mode

[2022-03-02 10:35:56] [NOTICE] monitoring cluster primary "node248" (ID: 2)

[2022-03-02 10:35:56] [INFO] create a thread 0x7fdeaa4b9700 to check the cluster status

[2022-03-02 10:35:57] [INFO] node (ID: 1): no server running

[2022-03-02 10:35:57] [INFO] [thread 0x7fdeaa4b9700] the cluster has no other running primary node, exit

5、原主库系统正常启动

1）从新主库查看集群状态信息

=从以下信息可以获知，集群现在处于‘双主’状态，只是原主库是‘failed’，无法连接。=

[kingbase@node1 bin]$ ./repmgr cluster show

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string

----+---------+---------+-----------+----------+----------+----------+----------+----------------

 1  | node243 | primary | - failed  |          | default  | 100      | ?        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

 2  | node248 | primary | * running |          | default  | 100      | 10       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected

  - unable to connect to node "node243" (ID: 1)

You have new mail in /var/spool/mail/kingbase

2）在新主库（原备库）创建复制槽

# 创建replication slots

test=# select sys_create_physical_replication_slot('repmgr_slot_1');

 sys_create_physical_replication_slot

--------------------------------------

 (repmgr_slot_1,)

(1 row)

test=# select sys_create_physical_replication_slot('repmgr_slot_2');

 sys_create_physical_replication_slot

--------------------------------------

 (repmgr_slot_2,)

(1 row)

test=# select * from sys_replication_slots;

   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |

restart_lsn | confirmed_flush_lsn

---------------+--------+-----------+--------+----------+-----------+--------+------------+-----

 repmgr_slot_1 |        | physical  |        |          | f         | f      |            |      |              |

            |

 repmgr_slot_2 |        | physical  |        |          | f         | f      |            |      |              |

            |

(2 rows)

3）在原主库（新主库）执行以下恢复操作

# 备份data目录

[kingbase@node3 kingbase]$ cp data data.bk -r

# 生成备库标识文件

[kingbase@node3 kingbase]$ cd data

[kingbase@node3 data]$ touch standby.signal

4）在原主库执行repmgr node rejoin重新加入到集群

 [kingbase@node3 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind

NOTICE: sys_rewind execution required for this node to attach to rejoin target node 2

DETAIL: rejoin target server's timeline 10 forked off current database system timeline 9 before current recovery point 0/200000A0

NOTICE: executing sys_rewind

DETAIL: sys_rewind command is "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"

sys_rewind: servers diverged at WAL location 0/1F000D70 on timeline 9

sys_rewind: rewinding from last common checkpoint at 0/1E000A70 on timeline 9

sys_rewind: find last common checkpoint start time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:52:34.358066 CST, in "0.225008" seconds.

sys_rewind: update the control file: minRecoveryPoint is '0/1F011AD0', minRecoveryPointTLI is '10', and database state is 'in archive recovery'

sys_rewind: rewind start wal location 0/1E000A40 (file 00000009000000000000001E), end wal location 0/1F011AD0 (file 0000000A000000000000001F). time from 2022-03-02 10:52:34.133058 CST to 2022-03-02 10:53:06.442270 CST, in "32.309212" seconds.

sys_rewind: Done!

NOTICE: 0 files copied to /home/kingbase/cluster/R6C5/R6C5R/kingbase/data

NOTICE: setting node 1's upstream to node 2

WARNING: unable to ping "host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"

DETAIL: PQping() returned "PQPING_NO_RESPONSE"

NOTICE: begin to start server at 2022-03-02 10:53:06.588331

NOTICE: starting server using "/home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6C5/R6C5R/kingbase/data' -l /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/logfile start"

NOTICE: start server finish at 2022-03-02 10:53:07.313294

NOTICE: NODE REJOIN successful

DETAIL: node 1 is now attached to node 2

5）启动新备库数据库服务

[kingbase@node3 bin]$ ps -ef |grep kingbase

kingbase  3218     1  0 10:36 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf

kingbase  5817     1  0 10:49 ?        00:00:01 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/../etc/repmgr.conf

kingbase  6730     1  0 10:53 ?        00:00:00 /home/kingbase/cluster/R6C5/R6C5R/kingbase/bin/kingbase -D /home/kingbase/cluster/R6C5/R6C5R/kingbase/data

kingbase  6731  6730  0 10:53 ?        00:00:00 kingbase: logger

kingbase  6732  6730  0 10:53 ?        00:00:00 kingbase: startup   recovering 0000000A000000000000001F

kingbase  6736  6730  0 10:53 ?        00:00:00 kingbase: checkpointer

kingbase  6737  6730  0 10:53 ?        00:00:00 kingbase: background writer

kingbase  6738  6730  0 10:53 ?        00:00:00 kingbase: stats collector

kingbase  6739  6730  0 10:53 ?        00:00:00 kingbase: walreceiver   streaming 0/1F012A78

kingbase  6743  6730  0 10:53 ?        00:00:00 kingbase: esrep esrep 192.168.7.243(55941) idle

kingbase  6750  6730  0 10:53 ?        00:00:00 kingbase: esrep esrep 192.168.7.243(55947) idle

6）查看集群节点状态

[kingbase@node3 bin]$ ./repmgr cluster show

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string

----+---------+---------+-----------+----------+----------+----------+----------+----------------

 1  | node243 | standby |   running | node248  | default  | 100      | 9        | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

 2  | node248 | primary | * running |          | default  | 100      | 10       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

7）重启集群测试（可选）

[kingbase@node3 bin]$ ./sys_monitor.sh restart

2022-03-02 10:55:26 Ready to stop all DB ...

....

server started

2022-03-02 10:55:52 execute to start DB on "[192.168.7.248]" success, connect to check it.

2022-03-02 10:55:53 DB on "[192.168.7.248]" start success.

 ID | Name    | Role    | Status    | Upstream  | Location | Priority | Timeline | Connection string

----+---------+---------+-----------+-----------+----------+----------+----------+---------------

 1  | node243 | standby |   running | ! node248 | default  | 100      | 10       | host=192.168.7.243 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

 2  | node248 | primary | * running |           | default  | 100      | 10       | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected

  - node "node243" (ID: 1) is not attached to its upstream node "node248" (ID: 2)

2022-03-02 10:55:53 The primary DB is started.

......

2022-03-02 10:56:15 repmgrd on "[192.168.7.248]" start success.

 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen

----+---------+---------+-----------+----------+---------+-------+---------+--------------------

 1  | node243 | standby |   running | node248  | running | 9500  | no      | 1 second(s) ago

 2  | node248 | primary | * running |          | running | 27881 | no      | n/a

[2022-03-02 10:56:18] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"

[2022-03-02 10:56:20] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6C5/R6C5R/kingbase/log/kbha.log"

2022-03-02 10:56:22 Done.

=从以上信息获知，通过手工执行repmgr node rejoin，原主库作为新备库重新加入到集群中。=

总结：

   1、对于recovery=standby，主库节点系统宕机后，集群执行主库切换，原主库需要人工配置为备库模式，并启动数据库服务，然后集群可自动将其加入到集群。

   2、对于recovery=automatic，主库节点系统宕机后，集群执行主库切换，不需要人工参与，原主库将作为新的备库自动加入到集群。

   3、对于recovery=manual，主库节点系统宕机后，集群执行主库切换，需要人工参与，在原主库执行‘repmgr node rejoin’操作，将原主库将作为新的备库自动加入到集群。

   4、对于无DBA日常监控管理的生产环境，可以考虑将recovery配置为automatic，提升集群架构的可靠性。

KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)的更多相关文章

KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)
KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...
KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(二)
案例二:测试'recovery = automatic' 1.查看集群节点状态信息: [kingbase@node1 bin]$ ./repmgr cluster show ID | Name | R ...
KingbaseES R6 集群备库网卡down测试案例
数据库版本: test=# select version(); version ------------------------------------------------------------ ...
KingbaseES R6 集群修改物理IP和VIP案例
在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境操作系统: [KINGBASE@nod ...
KingbaseES R6 集群repmgr witness 手工配置案例
使用见证服务器: 见证服务器是一个正常的KingbaseES实例,不是流复制群集的一部分; 其目的是,如果发生故障转移情况,则提供证明它是主服务器本身不可用的证据,而不是例如在不同物理位置之间的网络分 ...
KingbaseES R6 集群 recovery 参数对切换的影响
案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...
KingbaseES R6 集群一键修改集群和数据库参数测试案例
案例说明: 集群环境修改集群或数据库参数,需要在每个node上都要修改,在每个节点而执行修改操作,容易出现漏改或节点上参数不一致等错误:在KingbaseES V8R6的集群中增加了,一键修改参数 ...
KingbaseES R6 集群修改data目录
案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...
KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例
案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...

随机推荐

web文本划线的极简实现
开篇文本划线是目前逐渐流行的一个功能,不管你是小说阅读网站,还是卖教程的的网站,一般都会有记笔记或者评论的功能,传统的做法都是在文章底部加一个评论区,优点是简单,统一,缺点是不方便对文章的某一段或一 ...
windows平台编译CEF支持H264(MP3、MP4)超详细
编译目标(如何确定目标定版本请查看:BranchesAndBuilding) CEF Branch:4664 CEF Commit:fe551e4 Chromium Version:96.0.4664 ...
不是吧？30秒就能学会一个python小技巧？！
大家好鸭!我是小熊猫很多学习Python的朋友在项目实战中会遇到不少功能实现上的问题,有些问题并不是很难的问题,或者已经有了很好的方法来解决.当然,孰能生巧,当我们代码熟练了,自然就能总结一些好用的 ...
Oracle创建用户和表空间
一.概述 1.数据库实际管理中,不同业务系统需要使用'不同的用户'进行管理维护和使用,这样做把业务数据和系统数据独立分开管理,利于数据库系统管理: 2.在数据库中创建业务系统用户时候,建议为用户创建指 ...
可变参数和Collections集合工具类的方法_addAll&shuffle
可变参数可变参数:是JDK1.5之后出现的新特性使用前提:当方法的参数列表数据类型已经确定,但是参数的个数不确定,就可以使用可变参数使用格式:定义方法时使用 ~修饰符返回值类型方法名(数据类 ...
Deep Learning-深度学习（二）
深度学习入门 1.随机梯度下降在之前的学习过程当中,对于损失函数的最为重要的参数的梯度的更新是基于数据集中的所有数据,每一个数据都会进行到计算过程当中去,在本案例中,因为波士顿房价预测这个案例所涉及 ...
我已经说了5种css居中实现的方式了，面试官竟然说还不够？
这是一篇关于居中对齐方式的总结开篇之前,先问一下大家都知道几种居中的实现方式? 面试时答出来两三个就不错了,就怕面试官还让你继续说.今天就来总结一下这些居中的方式使用flex布局设置居中. 使用f ...
DeiT：注意力也能蒸馏
DeiT:注意力也能蒸馏 <Training data-efﬁcient image transformers & distillation through attention> ...
以十字链表为存储结构实现矩阵相加（严5.27）--------西工大noj
#define _CRT_SECURE_NO_WARNINGS #include <stdio.h> #include <stdlib.h> typedef int ElemT ...
SpringCloud微服务实战——搭建企业级开发框架（四十四）：【微服务监控告警实现方式一】使用Actuator + Spring Boot Admin实现简单的微服务监控告警系统
业务系统正常运行的稳定性十分重要,作为SpringBoot的四大核心之一,Actuator让你时刻探知SpringBoot服务运行状态信息,是保障系统正常运行必不可少的组件. spring-b ...

KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)

KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(三)的更多相关文章

随机推荐

热门专题