实际工作中,可能会碰到集群脑裂的情况,在脑裂时,会出现双 primary情况。这时,需要用户介入,人工判断哪个节点的数据最新,减少数据丢失。

一、测试环境信息

操作系统:
[kingbase@node1 bin]$ cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)

数据库:
[kingbase@node1 bin]$ ./ksql -U system test
ksql (V8.0)
Type "help" for help.

test=# select version();
version
----------------------------------------------------------------------------------------
KingbaseES V008R006C003B0010 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)

  

二、集群启动后“双主”故障

1、故障现象

[kingbase@node1 bin]$ ./sys_monitor.sh restart
2021-03-01 13:30:03 Ready to stop all DB ...
Service process "node_export" was killed at process 8253
Service process "postgres_ex" was killed at process 8254
Service process "node_export" was killed at process 8131
Service process "postgres_ex" was killed at process 8132
2021-03-01 13:30:09 begin to stop repmgrd on "[192.168.7.248]".
2021-03-01 13:30:10 repmgrd on "[192.168.7.248]" stop success.
2021-03-01 13:30:10 begin to stop repmgrd on "[192.168.7.249]".
2021-03-01 13:30:11 repmgrd on "[192.168.7.249]" stop success.
2021-03-01 13:30:11 begin to stop DB on "[192.168.7.249]".
waiting for server to shut down..... done
server stopped
2021-03-01 13:30:13 DB on "[192.168.7.249]" stop success.
2021-03-01 13:30:13 begin to stop DB on "[192.168.7.248]".
waiting for server to shut down.... done
server stopped
2021-03-01 13:30:14 DB on "[192.168.7.248]" stop success.
2021-03-01 13:30:14 Done.
2021-03-01 13:30:14 Ready to start all DB ...
2021-03-01 13:30:14 begin to start DB on "[192.168.7.248]".
waiting for server to start.... done
server started
2021-03-01 13:30:16 execute to start DB on "[192.168.7.248]" success, connect to check it.
2021-03-01 13:30:17 DB on "[192.168.7.248]" start success.
2021-03-01 13:30:17 Try to ping trusted_servers on host 192.168.7.248 ...
2021-03-01 13:30:19 Try to ping trusted_servers on host 192.168.7.249 ...
2021-03-01 13:30:22 begin to start DB on "[192.168.7.249]".
waiting for server to start.... done
server started
2021-03-01 13:30:23 execute to start DB on "[192.168.7.249]" success, connect to check it.
2021-03-01 13:30:24 DB on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+-------
1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | ! running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
WARNING: following issues were detected
- node "node249" (ID: 2) is running but the repmgr node record is inactive
2021-03-01 13:30:24 There are more than one primary DBs([2] DBs are running), will do nothing and exit​
如上所示:集群在启动过程中,出现“双主”的故障,对于“双主”故障,需要人工参与,判断集群中那个节点是最新的主库,重新恢复集群。

2、查看原备库数据库服务

node2 (原主库):


[kingbase@node2 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+----------------------+----------+----------+----------+--------
1 | node248 | standby | ! running as primary | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | primary | * running | | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

WARNING: following issues were detected
- node "node248" (ID: 1) is running as primary but the repmgr node record is inactive

  ​

三、查看控制文件对比节点数据差异

node1:

[kingbase@node1 bin]$ ./sys_controldata -D ../data
sys_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6950158917747347623
Database cluster state: in production
sys_control last modified: Mon 01 Mar 2021 01:35:16 PM CST
Latest checkpoint location: 1/F2008980
Latest checkpoint's REDO location: 1/F2008948
Latest checkpoint's REDO WAL file: 0000000500000001000000F2
Latest checkpoint's TimeLineID: 5
Latest checkpoint's PrevTimeLineID: 5
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:8813
Latest checkpoint's NextOID: 32951
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 839
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 8813
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon 01 Mar 2021 01:35:16 PM CST

  

node2:

[kingbase@node2 bin]$ ./sys_controldata -D ../data
sys_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6950158917747347623
Database cluster state: in production
sys_control last modified: Mon 01 Mar 2021 01:34:45 PM CST
Latest checkpoint location: 1/F2002AC0
Latest checkpoint's REDO location: 1/F2002A88
Latest checkpoint's REDO WAL file: 0000000400000001000000F2
Latest checkpoint's TimeLineID: 4
Latest checkpoint's PrevTimeLineID: 4
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:8810
Latest checkpoint's NextOID: 32951
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 839
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 8810
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Mon 01 Mar 2021 01:34:45 PM CST

从control文件对比可以获知,新主库的timeline(5)高于原主库timeline(4),并且新主库的事务id:8813高于原主库事务id:8810,故选择新主库作为集群的primary节点,原主库被standby。

注意:对于选择主库的判断,最好能在启动数据库,连接到业务上进行判断,那个主机数据是最新的。

四、将原主库重新加入到集群

node2 rejoin 到集群:

[kingbase@node2 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped
[kingbase@node2 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep
ERROR: this node cannot attach to rejoin target node 1
DETAIL: rejoin target server's timeline 5 forked off current database system timeline 4 before current recovery point 1/F2002B70
HINT: use --force-rewind to execute sys_rewind

[kingbase@node2 bin]$ ./repmgr node rejoin -h 192.168.7.248 -U esrep -d esrep --force-rewind
NOTICE: sys_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 5 forked off current database system timeline 4 before current recovery point 1/F2002B70
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_rewind -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' --source-server='host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: servers diverged at WAL location 1/F20000D8 on timeline 4
sys_rewind: rewinding from last common checkpoint at 1/F2000060 on timeline 4
sys_rewind: find last common checkpoint start time from 2021-03-01 14:06:28.539405 CST to 2021-03-01 14:06:28.577794 CST, in "0.038389" seconds.
sys_rewind: update the control file: minRecoveryPoint is '1/F2031590', minRecoveryPointTLI is '5', and database state is 'in archive recovery'
sys_rewind: we will remove the dir '/home/kingbase/cluster/R6HA/KHA/kingbase/data/sys_replslot/repmgr_slot_1.rewind' and all the file/dir in it.
sys_rewind: we will remove the dir '/home/kingbase/cluster/R6HA/KHA/kingbase/data/base/syssql_tmp.rewind' and all the file/dir in it.
sys_rewind: rewind start wal location 1/F2000060 (file 0000000400000001000000F2), end wal location 1/F2031590 (file 0000000500000001000000F2). time from 2021-03-01 14:06:28.539405 CST to 2021-03-01 14:06:44.221603 CST, in "15.682198" seconds.
sys_rewind: Done!
NOTICE: 0 files copied to /home/kingbase/cluster/R6HA/KHA/kingbase/data
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2021-03-01 14:06:44.800564
NOTICE: starting server using "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/sys_ctl -w -t 90 -D '/home/kingbase/cluster/R6HA/KHA/kingbase/data' -l /home/kingbase/cluster/R6HA/KHA/kingbase/bin/logfile start"
NOTICE: start server finish at 2021-03-01 14:06:46.217825
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1

查看集群节点状态:

[kingbase@node2 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+--------
1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | standby | running | node248 | default | 100 | 4 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

查看主备流复制状态:

[kingbase@node1 bin]$ ./ksql -U system test
ksql (V8.0)
Type "help" for help.

test=# select * from sys_stat_replication;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_s
tart | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_la
g | replay_lag | sync_priority | sync_state | reply_time
-------+----------+---------+------------------+---------------+-----------------+-------
22853 | 16384 | esrep | node249 | 192.168.7.249 | | 38638 | 2021-03-01 14:07:
24.293687+08 | | streaming | 1/F20357A8 | 1/F20357A8 | 1/F20357A8 | 1/F20357A8 | |
| | 0 | async | 2021-03-01 14:07:57.851500+08
(1 row)

五、重新启动集群测试

[kingbase@node1 bin]$ ./sys_monitor.sh restart
2021-03-01 14:09:05 Ready to stop all DB ...
There is no service "node_export" running currently.
There is no service "postgres_ex" running currently.
There is no service "node_export" running currently.
There is no service "postgres_ex" running currently.
2021-03-01 14:09:10 begin to stop repmgrd on "[192.168.7.248]".
2021-03-01 14:09:11 repmgrd on "[192.168.7.248]" already stopped.
2021-03-01 14:09:11 begin to stop repmgrd on "[192.168.7.249]".
2021-03-01 14:09:11 repmgrd on "[192.168.7.249]" already stopped.
2021-03-01 14:09:11 begin to stop DB on "[192.168.7.249]".
waiting for server to shut down.... done
server stopped
2021-03-01 14:09:13 DB on "[192.168.7.249]" stop success.
2021-03-01 14:09:13 begin to stop DB on "[192.168.7.248]".
waiting for server to shut down...... done
server stopped
2021-03-01 14:09:16 DB on "[192.168.7.248]" stop success.
2021-03-01 14:09:16 Done.
2021-03-01 14:09:16 Ready to start all DB ...
2021-03-01 14:09:16 begin to start DB on "[192.168.7.248]".
waiting for server to start.... done
server started
2021-03-01 14:09:17 execute to start DB on "[192.168.7.248]" success, connect to check it.
2021-03-01 14:09:19 DB on "[192.168.7.248]" start success.
2021-03-01 14:09:19 Try to ping trusted_servers on host 192.168.7.248 ...
2021-03-01 14:09:21 Try to ping trusted_servers on host 192.168.7.249 ...
2021-03-01 14:09:24 begin to start DB on "[192.168.7.249]".
waiting for server to start.... done
server started
2021-03-01 14:09:25 execute to start DB on "[192.168.7.249]" success, connect to check it.
2021-03-01 14:09:26 DB on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+-------
1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2021-03-01 14:09:26 The primary DB is started.
2021-03-01 14:09:31 Success to load virtual ip [192.168.7.240/24] on primary host [192.168.7.248].
2021-03-01 14:09:31 Try to ping vip on host 192.168.7.248 ...
2021-03-01 14:09:33 Try to ping vip on host 192.168.7.249 ...
2021-03-01 14:09:36 begin to start repmgrd on "[192.168.7.248]".
[2021-03-01 14:09:37] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 14:09:37] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"

2021-03-01 14:09:37 repmgrd on "[192.168.7.248]" start success.
2021-03-01 14:09:37 begin to start repmgrd on "[192.168.7.249]".
[2021-03-01 14:09:00] [NOTICE] using provided configuration file "/home/kingbase/cluster/R6HA/KHA/kingbase/bin/../etc/repmgr.conf"
[2021-03-01 14:09:00] [NOTICE] redirecting logging output to "/home/kingbase/cluster/R6HA/KHA/kingbase/hamgr.log"

2021-03-01 14:09:38 repmgrd on "[192.168.7.249]" start success.
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node248 | primary | * running | | running | 24725 | no | n/a
2 | node249 | standby | running | node248 | running | 23587 | no | n/a
2021-03-01 14:09:46 Done.

查看集群节点状态:

[kingbase@node1 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+--------
1 | node248 | primary | * running | | default | 100 | 5 | host=192.168.7.248 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node249 | standby | running | node248 | default | 100 | 5 | host=192.168.7.249 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

  

KingbaseES R6 集群“双主”故障解决案例的更多相关文章

  1. KingbaseES R6集群归档备份故障分析解决案例

    案例说明: 在使用ps工具查看主库进程,发现主库'archiver'进程失败,检查sys_log日志可以发现归档失败的信息.通过sys_log日志提取归档语句手工执行归档操作,提示"当前数据 ...

  2. KingbaseES R6 集群repmgr witness 手工配置案例

    使用见证服务器: 见证服务器是一个正常的KingbaseES实例,不是流复制群集的一部分; 其目的是,如果发生故障转移情况,则提供证明它是主服务器本身不可用的证据,而不是例如在不同物理位置之间的网络分 ...

  3. KingbaseES R6 集群主库网卡down测试案例

    数据库版本: test=# select version(); version ------------------------------------------------------------ ...

  4. KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例

    案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...

  5. KingbaseES R6 集群 recovery 参数对切换的影响

    案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...

  6. KingbaseES R6 集群修改物理IP和VIP案例

    在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...

  7. KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)

    KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...

  8. KingbaseES R6 集群创建流复制只读副本库案例

    一.环境概述 [kingbase@node2 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test= ...

  9. KingbaseES R6 集群修改data目录

    案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...

随机推荐

  1. Hexo + VSCode 插入 Markdown 图片解决办法

    最近打开 typora 时发现弹窗强更,不让用 beta 版了 想到自己并不是非常需要 WYSIWYG,而且也不是经常使用 typora,于是直接退回到 VSCode 了,而且在 VSCode 里可以 ...

  2. JSP 入门学习

    概念:java Server Pages java服务器页面 一个特殊的,既可以定义html标签,也可以定义Java代码 用于简化书写 原理 在浏览器上访问 jsp文件时 服务器先解析请求,找到是否有 ...

  3. python简单处理验证码,三分钟,不能再多了

    序言 大家好鸭, 又是我小熊猫啦 我们在做采集数据的时候,过快或者访问频繁,或者一访问就给弹出验证码,然后就蚌珠了~今天就给大家来一个简单处理验证码的方法 环境模块 Python和pycharm如果还 ...

  4. Linux 安装Apche服务

    用yum 进行在线安装apche服务 yum install -y httpd 我这边是centos7 需要开启一下端口: 1 firewall-cmd --zone=public --add-por ...

  5. 【MySQL 8】Generated Invisible Primary Keys(GIPK)

    从MySQL 8.0.30开始,MySQL支持在GIPK模式下运行时生成不可见的主键.在这种模式下运行时,对于任何在没有显式主键的情况下创建的InnoDB表,MySQL服务器会自动将生成的不可见主键 ...

  6. mysql导出bug备注

    注:yiicms库和area表均存在

  7. 在DELL服务器上安装windows2012 r2服务器系统

    主要过程: 1.准备安装光盘,开启服务器,当出现画面按F10进入服务器自带光盘系统安装向导.(若没有系统光盘,可以用软蝶通刻一个服务系统到+R的光盘).进入后选择设置和安装系统. 2.开始安装前,提示 ...

  8. 《ABP Framework 极速开发》教程首发

    写在发布之前 有没有小伙伴跟我刚开始接触 ABP Framework 的感觉一样"一看文档深似海",看完文档之后,想要上手却找不着头绪. 本套教程写作的目的之一是为初学者提供一条相 ...

  9. vue2与vue3实现响应式的原理区别和提升

    区别: vue2.x: 实现原理: 对象类型:Object.defineProperty()对属性的读取,修改进行拦截(数据劫持): 数组类型:通过重写更新数组的一系列方法来进行拦截(对数组的变更方法 ...

  10. Jetpack Compose学习(9)——Compose中的列表控件(LazyRow和LazyColumn)

    原文:Jetpack Compose学习(9)--Compose中的列表控件(LazyRow和LazyColumn) - Stars-One的杂货小窝 经过前面的学习,大致上已掌握了compose的基 ...