KingbaseES V8R6版本数据库自动故障转移失败(Automatic Database Failover Fails)

适用于：

KingbaseES V8R6 版本。

repmgr配置信息：

首先检查repmgr.conf配置文件，确任数据库主节点，数据库备节点参数：failover='automatic'、recovery='standby'一致

一、故障现象：

数据库自动故障转移失败，也就是failover切换失败。
其他的正常可用的备节点未被选择（切换）成为新的主数据节点。
在KingbaseES数据库可用备节点hamgr.log日志，可以看到类似于以下的信息条目。

[2023-02-10 11:39:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2023-02-10 11:39:46] [INFO] sleeping up to 6 seconds until next reconnection attempt

[2023-02-10 11:39:52] [INFO] checking state of node "node20" (ID: 1), 10 of 10 attempts

[2023-02-10 11:39:52] [WARNING] unable to reconnect to node "node20" (ID: 1) after 10 attempts

[2023-02-10 11:39:52] [NOTICE] repmgrd on this node is paused

[2023-02-10 11:39:52] [DETAIL] no failover will be carried out

[2023-02-10 11:39:52] [HINT] execute "repmgr service unpause" to resume normal failover mode

在KingbaseES数据库可用备节点kbha.log日志，可以看到类似于以下的信息条目。

[2023-02-10 10:31:34] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID

[2023-02-10 10:31:34] [DEBUG] repmgrd is running, can not start another one.

[2023-02-10 10:31:37] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.

--- 10.10.10.1 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1029ms

rtt min/avg/max/mdev = 0.729/0.745/0.761/0.016 ms

...

[2023-02-10 11:32:37] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID

[2023-02-10 11:32:37] [DEBUG] repmgrd is running, can not start another one.

[2023-02-10 11:32:37] [DEBUG] the thread 428402432 is still running

ping: socket: 不允许的操作 or  Operation not permitted

[2023-02-10 11:32:38] [NOTICE]

[2023-02-10 11:32:38] [WARNING] ping host"10.10.10.1" failed

[2023-02-10 11:32:38] [DETAIL] average RTT value is not greater than zero

[2023-02-10 11:32:38] [DEBUG] ping process end early. usleep(1978998).

ping: socket: 不允许的操作 or  Operation not permitted

注意：前面的日志记录只是示例。日期、时间和环境变量可能因不同环境而异。

二、排查过程：

根据KingbaseES数据库服务连续性运维：https://help.kingbase.com.cn/v8/admin/general/maintenance/maintenance-1.html

在数据库集群出现故障、计划外的停机时，通过hamgr.log、kbha.log日志定位故障原因。

1. hamgr.log日志：

# 节点1 主库hamgr日志

[2023-02-10 11:37:58] [NOTICE] repmgrd (repmgrd 5.0.0) starting up

[2023-02-10 11:37:58] [INFO] connecting to database "host=10.10.10.20 user=esrep dbname=esrep port=5432 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"

INFO:  set_repmgrd_pid(): provided pidfile is /home/kingbase/cluster/kingbase/etc/hamgrd.pid

[2023-02-10 11:37:58] [NOTICE] starting monitoring of node "node20" (ID: 1)

[2023-02-10 11:37:58] [INFO] "connection_check_type" set to "mix"

[2023-02-10 11:37:58] [NOTICE] monitoring cluster primary "node20" (ID: 1)

[2023-02-10 11:37:59] [INFO] child node "node21" (ID: 2) is attached

[2023-02-10 11:38:22] [NOTICE] TERM signal received

[2023-02-10 11:38:22] [ERROR] unable to determine if server is in recovery

[2023-02-10 11:38:22] [DETAIL]

FATAL:  terminating connection due to administrator command

server closed the connection unexpectedly

	This probably means the server terminated abnormally

	before or while processing the request.

[2023-02-10 11:38:22] [DETAIL] query text is:

SELECT pg_catalog.pg_is_in_recovery()

[2023-02-10 11:38:22] [INFO] repmgrd terminating...

# 节点2 备库hamgr日志

[2023-02-10 11:38:00] [NOTICE] starting monitoring of node "node21" (ID: 2)

[2023-02-10 11:38:00] [INFO] "connection_check_type" set to "mix"

[2023-02-10 11:38:00] [INFO] monitoring connection to upstream node "node20" (ID: 1)

[2023-02-10 11:38:00] [NOTICE] try to change wal catched_up state to 1

[2023-02-10 11:38:00] [INFO] primary flush lsn is AE/DF000590, local flush lsn is AE/DF000590

[2023-02-10 11:38:00] [NOTICE] try to change streaming_sync state to TRUE

[2023-02-10 11:38:23] [WARNING] unable to ping "host=10.10.10.20 user=esrep dbname=esrep port=5432 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"

[2023-02-10 11:38:23] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2023-02-10 11:38:23] [WARNING] unable to connect to upstream node "node20" (ID: 1)

[2023-02-10 11:38:23] [INFO] checking state of node "node20" (ID: 1), 1 of 10 attempts

[2023-02-10 11:38:23] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=10.10.10.20 port=5432 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"

...

[2023-02-10 11:39:46] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2023-02-10 11:39:46] [INFO] sleeping up to 6 seconds until next reconnection attempt

[2023-02-10 11:39:52] [INFO] checking state of node "node20" (ID: 1), 10 of 10 attempts

[2023-02-10 11:39:52] [WARNING] unable to reconnect to node "node20" (ID: 1) after 10 attempts

[2023-02-10 11:39:52] [NOTICE] repmgrd on this node is paused

[2023-02-10 11:39:52] [DETAIL] no failover will be carried out

[2023-02-10 11:39:52] [HINT] execute "repmgr service unpause" to resume normal failover mode

[2023-02-10 11:43:02] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"

[2023-02-10 11:43:02] [INFO] node "node21" (ID: 2) monitoring upstream node "node20" (ID: 1) in degraded state

[2023-02-10 11:43:02] [DETAIL] repmgrd paused by administrator

[2023-02-10 11:43:02] [HINT] execute "repmgr service unpause" to resume normal failover mode

2. 通过hamgr.log日志信息可知：

主节点在[2023-02-10 11:38:22]发生故障异常宕机，可用备节点在[2023-02-10 11:38:23]发现主节点不能进行访问，在尝试连接超过reconnect_attempts 次阈值后，正常应该进行自动故障转移，但是备节点hamgr信息[2023-02-10 11:39:52]显示节点上repmgrd服务暂停，不会进行故障转移。

3. kbha.log日志：

# 备节点 kbha.log日志

--- 10.10.10.1 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1019ms

rtt min/avg/max/mdev = 0.729/0.737/0.745/0.008 ms

[2023-02-10 10:31:34] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID

[2023-02-10 10:31:34] [DEBUG] repmgrd is running, can not start another one.

[2023-02-10 10:31:37] [NOTICE] PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.

--- 10.10.10.1 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1029ms

rtt min/avg/max/mdev = 0.729/0.745/0.761/0.016 ms

...

[2023-02-10 11:32:37] [DEBUG] PID file "/home/kingbase/cluster/etc/hamgrd.pid" exists and seems to contain a valid PID

[2023-02-10 11:32:37] [DEBUG] repmgrd is running, can not start another one.

[2023-02-10 11:32:37] [DEBUG] the thread 428402432 is still running

ping: socket: 不允许的操作 or  Operation not permitted

[2023-02-10 11:32:38] [NOTICE]

[2023-02-10 11:32:38] [WARNING] ping host"10.10.10.1" failed

[2023-02-10 11:32:38] [DETAIL] average RTT value is not greater than zero

[2023-02-10 11:32:38] [DEBUG] ping process end early. usleep(1978998).

ping: socket: 不允许的操作 or  Operation not permitted

4. 通过可用备节点kbha.log日志信息可知：

备用节点在[2023-02-10 10:31:37]期间，可以正常使用ping命令检查集群网关可用性，但是在[2023-02-10 11:32:37]之后，ping命令提示ping: socket: 不允许的操作 or Operation not permitted，出现此问题的原因是由于ping命令权限被修改导致。ping命令在运行中采用了ICMP协议，需要发送ICMP报文。但是只有root用户才能建立ICMP报文。ping命令的权限正确的应该是-rwsr-xr-x，即带有suid的文件，一旦该权限被修改，则普通用户无法正常使用该命令。（在CentOS Linux中ping命令的权限为-rwxr-xr-x普通用户是可以正常使用的，但是发生故障的环境操作系统是Kylin linux，ping命令的权限为-rwxr-xr-x普通用户无法正常使用）。

$ ls -l /bin/ping

-rwxr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

三、故障原因：

1. KingbaseES数据库主备切换过程：

KingbaseES数据库集群当集群主节点发生故障时，备节点启动repmrd进程进行监控、检测，当检测到上游节点连接错误，会重试reconnect_attempts次，重试间隔reconnect_interval秒。
KingbaseES数据库集群判断主节点是否故障时，通过网络连接超时，才能判断主节点故障。重试reconnect_attempts次后判断上游节点确定故障，然后判断上游节点是主节点还是备节点。
当KingbaseES确认主节点故障后，备节点杀掉wal_receiver进程（所有的备节点都会杀掉wal_receiver进程），开始进行升主操作。集群通过可用的备节点选择需要提升为主节点的备节点，执行升主语句，升主成功（failover切换成功）。
切换过程：假设本地节点选举成功，检测信任网关（如果配置了vip，会执行卸载集群旧的主节点vip，并在集群新的主节点加载vip的操作，这个操作会使用到ping命令并且会ping两次vip），然后执行真正的升主语句，升主成功（failover切换成功）。

2. 故障具体原因：

通过以上KingbaseES数据库主备切换过程及kbha.log日志确定问题是由于ping命令权限被修改导致。ping命令在运行中采用ICMP协议，需要发送ICMP报文，但只有root用户才能建立ICMP报文。

正常情况下，ping命令的权限应为-rwsr-xr-x，即带有suid的文件，一旦该权限被修改，则普通用户无法正常使用该命令。

$ ls -l /bin/ping

-rwxr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

三、解决方法：

使用root用户执行以下命令：

考虑到Linux发行版的种类较多，建议在部署KingbaseES集群时，检查修改ping命令的权限为-rwsr-xr-x，保证发生故障时不会由于ping命令权限导致自动故障转移失败。

# 以下命令选择执行其中一条就可以

chmod u+x /bin/ping

# 或者

chmod 4755 /bin/ping

# 权限正确的ping

$ ls -l /bin/ping

-rwsr-xr-x. 1 root root 67680 Feb 23  2021 /bin/ping

再次验证，手动关闭主节点，可以正常完成数据库故障自动切换。

KingbaseES V8R6 集群运维案例 -- 自动故障转移失败处理的更多相关文章

KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例
案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...
KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
KingbaseES V8R3集群运维案例之---用户自定义表空间管理
案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...
KingbaseES V8R6集群外部备份案例
案例说明: 本案例采用sys_backup.sh执行物理备份,备份使用如下逻辑架构:集群采用CentOS 7系统,repo采用kylin V10 Server. 一主一备+外部备份此场景为主备双机常 ...
kingbaseES V8R6集群备份恢复案例之---备库作为repo主机执行物理备份
案例说明: 此案例是在KingbaseES V8R6集群环境下,当主库磁盘空间不足时,执行sys_rman备份,将集群的备库节点作为repo主机,执行备份,并将备份存储在备库的磁盘空间. 集群架构 ...
KingbaseES V8R6集群管理运维案例之---repmgr standby switchover故障
案例说明: 在KingbaseES V8R6集群备库执行"repmgr standby switchover"时,切换失败,并且在执行过程中,伴随着"repmr stan ...
KingbaseES V8R6集群维护案例之---停用集群node_export进程
案例说明: 在KingbaseES V8R6集群启动时,会启动node_exporter进程,此进程主要用于向kmonitor监控服务输出节点状态信息.在系统安全漏洞扫描中,提示出现以下安全漏洞: 对 ...
KingbaseES V8R6集群维护之--修改数据库服务端口案例
案例说明: 对于KingbaseES数据库单实例环境,只需要修改kingbase.conf文件的'port'参数即可,但是对于KingbaseES V8R6集群中涉及到多个配置文件的修改,并且在应 ...

随机推荐

java常用包下载地址（非maven）
httpclient与httpcore: http://hc.apache.org/downloads.cgi jdbc: https://dev.mysql.com/downloads/connec ...
Annotation-specified bean name conflicts with existing
问题说明 Annotation-specified bean name conflicts with existing,non-compatible bean definition of same n ...
powerdesigner导出模型为RTF
在设计模型的时候需要将设计好的模型进行尺寸调整并复制到word中.这时候就需要利用导出报告功能,导出为word格式. 再做进一步处理. 1.选择[报告]-[Generate Report] 2.自动打 ...
Spring Security实现JDBC用户登录认证
在搭建博客后端服务框架时,我采用邮件注册+Spring Security登录认证方式,结合mysql数据库,给大家展示下具体是怎么整合的. 本篇是基于上一篇:spring boot实现邮箱验证码注册 ...
Java新建一个子线程异步运行方法
如何在运行主方法的同时异步运行另一个方法,我是用来更新缓存: 1. 工具类 public class ThreadPoolUtils { private static final Logger LOG ...
SQL Server 连接数据库报错 (ObjectExplorer)
报错信息无法访问数据库 ReportServer. (ObjectExplorer) 具体错误信息: 程序位置: 在 Microsoft.SqlServer.Management.UI.VSInte ...
【ACM专项练习#02】输入整行字符串、输入值到vector、取输入整数的每一位
输入整行字符串平均绩点题目描述每门课的成绩分为A.B.C.D.F五个等级,为了计算平均绩点,规定A.B.C.D.F分别代表4分.3分.2分.1分.0分. 输入有多组测试样例.每组输入数据占一行 ...
kotlin协程异常处理之-CoroutineExceptionHandler
转载请标明出处:https://www.cnblogs.com/tangZH/p/17307406.html kotlin协程小记协程的async使用 kotlin协程异常处理之-try catch ...
单词本z develop vel = 到上面从下面到上面的一种过程抽象是相对从无到有
单词本z develop vel = 到上面从下面到上面的一种过程抽象是相对从无到有 develop 发展开发 de = down 下面 velop 这里 vel 就是 lev的反写 op = ...
Xmind 括号图风格不错，挺好看的
Xmind 括号图风格不错,挺好看的之前没注意到呢~ 又搞了个竖屏的,竖屏的关键点是先隐藏第一层包括线,然后线就全部隐藏了,然后再选择要显示线的那部分,让线显示就ok了.

KingbaseES V8R6 集群运维案例 -- 自动故障转移失败处理

KingbaseES V8R6版本 数据库自动故障转移失败(Automatic Database Failover Fails)