官方文档介绍:

https://help.kingbase.com.cn/v8/highly/availability/cluster-use/cluster-use-2.html#id35

全局故障恢复(集群多级别自动恢复)

当出现整个集群故障、掉电后重新上电等情况下,对整个集群的自动恢复功能(目前只支持1级自动恢复)。当参数auto_cluster_recovery_level=1时,此功能开启;参数auto_cluster_recovery_level=0时,此功能关闭。此功能默认开启,如果需要修改此参数,在repmgr.conf文件中修改此参数后,需要重启repmgrd/kbha守护进程生效。

自动启动恢复功能在满足以下所有条件后才会生效并恢复集群:集群全部节点均故障、节点间网络正常、集群中只有1个节点为主库状态。

集群故障后,守护进程kbha检查集群其它节点状态,当其它节点都能连通,数据库均处于停库状态且没有其它数据库处于主库状态时,启动本地的主库。随后主库上的守护进程repmgrd通过故障自动恢复尝试恢复其它备库节点。

适用版本:

KingbaseES V8R6

测试版本:

[kingbase@node101 bin]$ ./ksql -V
ksql (Kingbase) V008R006C005B0041

集群节点信息:

 ID | Name    | Role    | Status    | Upstream | repmgrd | PID   | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node101 | primary | * running | | running | 11054 | no | n/a
2 | node102 | standby | running | node101 | running | 17731 | no | 1 second(s) ago

一、集群节点状态

[kingbase@node101 bin]$ ./repmgr cluster show

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
1 | node101 | primary | * running | | default | 100 | 43 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node102 | standby | running | node101 | default | 100 | 43 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

二、集群配置参数信息

[kingbase@node101 bin]$ cat ../etc/repmgr.conf|grep auto
recovery='automatic'
auto_cluster_recovery_level=1
failover='automatic'

三、集群断电

主备节点同时断电测试。

四、集群加电后节点状态

1、主库节点恢复状态

[kingbase@node101 bin]$ ps -ef |grep kingbase

kingbase  2456     1  0 16:27 ?        00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2507 2456 0 16:27 ? 00:00:00 sh -c /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr node service --action start 2>/dev/null
kingbase 2508 2507 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr node service --action start
kingbase 2509 2508 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D /data/kingbase/r6ha/data -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start
kingbase 2511 2509 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kingbase -D /data/kingbase/r6ha/data
kingbase 2512 2511 0 16:27 ? 00:00:00 kingbase: logger
kingbase 2513 2511 0 16:27 ? 00:00:00 kingbase: startup

---如上所示,主库节点启动kbha、sys_ctl进程启动集群及数据库服务。

主库节点恢复完成:

[kingbase@node101 bin]$ ps -ef |grep kingbase
kingbase 2519 2511 0 16:27 ? 00:00:00 kingbase: autovacuum launcher
kingbase 2520 2511 0 16:27 ? 00:00:00 kingbase: archiver
kingbase 2521 2511 0 16:27 ? 00:00:00 kingbase: stats collector
kingbase 2522 2511 0 16:27 ? 00:00:00 kingbase: ksh writer
kingbase 2523 2511 0 16:27 ? 00:00:00 kingbase: ksh collector
kingbase 2524 2511 0 16:27 ? 00:00:00 kingbase: kwr collector
kingbase 2525 2511 0 16:27 ? 00:00:00 kingbase: logical replication launcher
kingbase 2528 2511 0 16:27 ? 00:00:00 kingbase: system esrep 192.168.1.101(41774) idle
kingbase 2530 1 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2532 2511 0 16:27 ? 00:00:00 kingbase: system esrep 192.168.1.101(41778) idle
kingbase 2661 2456 0 16:28 ? 00:00:00 ping -q -c3 -w2 192.168.1.1
kingbase 2664 2511 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.101(41848) idle
kingbase 2675 2530 0 16:28 ? 00:00:00 ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -h 192.168.1.101 -A rejoin
kingbase 2679 2511 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(11472) idle
kingbase 2681 2511 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(11476) COPY

---如上所示,主库加电后启动kbha、remgrd进程及数据库服务,并远程连接备库执行备库的recovery。

2、备库节点状态

[kingbase@node102 ~]$ ps -ef |grep kingbase
kingbase 2306 1 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2688 1 0 16:28 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kingbase -D /data/kingbase/r6ha/data
kingbase 2689 2688 0 16:28 ? 00:00:00 kingbase: logger
kingbase 2690 2688 0 16:28 ? 00:00:00 kingbase: startup recovering 0000002B0000000500000066
kingbase 2694 2688 0 16:28 ? 00:00:00 kingbase: checkpointer
kingbase 2695 2688 0 16:28 ? 00:00:00 kingbase: background writer
kingbase 2696 2688 0 16:28 ? 00:00:00 kingbase: stats collector
kingbase 2697 2688 0 16:28 ? 00:00:00 kingbase: walreceiver streaming 5/66027EF8
kingbase 2708 2688 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(56891) idle
kingbase 2710 1 0 16:28 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2712 2688 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(56897) idle

---如上所示,备库节点集群和数据库服务都已经启动。

3、集群状态信息

[kingbase@node101 bin]$ ./repmgr cluster show

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
1 | node101 | primary | * running | | default | 100 | 43 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node102 | standby | running | node101 | default | 100 | 43 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

---如上所示,集群状态恢复正常。

4、集群加电恢复日志分析

查看主库hamgr.log:

# 主机加电后,主库repmgrd进程启动
[2023-02-02 11:14:35] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2023-02-02 11:14:35] [INFO] connecting to database "host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
...... # 主库启动数据库服务后,判断并加载vip
[2023-02-02 11:14:57] [NOTICE] found primary node lost virtual_ip, try to acquire virtual_ip
[2023-02-02 11:14:59] [NOTICE] PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.
--- 192.168.1.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms
[2023-02-02 11:14:59] [WARNING] ping host"192.168.1.254" failed
[2023-02-02 11:14:59] [DETAIL] average RTT value is not greater than zero
[2023-02-02 11:14:59] [DEBUG] executing:
/home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A loadvip
[2023-02-02 11:14:59] [DEBUG] result of command was 0 (0)
[2023-02-02 11:14:59] [DEBUG] local_command(): no output returned
[2023-02-02 11:14:59] [DEBUG] executing:
/home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A arping
[2023-02-02 11:14:59] [DEBUG] result of command was 0 (0)
[2023-02-02 11:14:59] [DEBUG] local_command(): no output returned
[2023-02-02 11:14:59] [INFO] loadvip result: 1, arping result: 1
[2023-02-02 11:14:59] [NOTICE] acquire the virtual ip 192.168.1.254/24 success on localhost
....... # 主库判断备库状态,并远程连接到备库执行recovery
[2023-02-02 11:15:16] [INFO] child node: 2; attached: no
[2023-02-02 11:15:16] [INFO] recovery delay time reached. can do recovery now.
[2023-02-02 11:15:16] [DEBUG] update_node_record_set_active():
UPDATE repmgr.nodes SET active = FALSE WHERE node_id = 2
[2023-02-02 11:15:16] [NOTICE] mark node "node102" (ID: 2) as inactive
......
[2023-02-02 11:15:16] [DEBUG] test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /bin/true 2>/dev/null
[2023-02-02 11:15:16] [DEBUG] remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 test -e /data/kingbase/r6ha/data/standby.signal
[2023-02-02 11:15:16] [DEBUG] remote_command(): no output returned
[2023-02-02 11:15:16] [NOTICE] [thread pid:2722] node (ID: 2; host: "192.168.1.102") is not attached, ready to auto-recovery
[2023-02-02 11:15:16] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-02-02 11:15:16] [DEBUG] executing:
/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr cluster show | grep -w "host" | grep -w "running" | grep -w "primary" | awk -F '|' '{print $NF}'
....... #主库远程连接到备库,通过kbha进程执行node rejoin操作
[2023-02-02 11:15:16] [NOTICE] [thread pid:2722] Now, the primary host ip: 192.168.1.101
[2023-02-02 11:15:16] [DEBUG] test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /bin/true 2>/dev/null
[2023-02-02 11:15:16] [INFO] [thread pid:2722] ES connection to host "192.168.1.102" succeeded, ready to do auto-recovery
[2023-02-02 11:15:16] [DEBUG] remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -h 192.168.1.101 -A rejoin
........ #主库通过pid文件和共享内存的访问判断备库的数据库服务是否启动,并通过执行sys_rewind保证备库和主库的数据一致
[2023-02-02 11:15:14] [INFO] the Kingbase pid file is already exists, check pre-existing shared memory block (key 54321001, ID 2)
[2023-02-02 11:15:14] [INFO] pre-existing shared memory block (key 54321001, ID 2) is not in use
2023-02-02 11:15:14.465 CST [2678] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2023-02-02 11:15:14.465 CST [2678] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2023-02-02 11:15:14.465 CST [2678] DEBUG: proc_exit(0): 0 callbacks to make
2023-02-02 11:15:14.465 CST [2678] DEBUG: exit(0)
[2023-02-02 11:15:14] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
[2023-02-02 11:15:14] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr --dbname="host=192.168.1.101 dbname=esrep user=system port=54321" node rejoin --force-rewind"
WARNING: database is not running, but it is not shut down cleanly
DEBUG: connecting to: "user=system connect_timeout=10 dbname=esrep host=192.168.1.101 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 5/6C0000A0, rejoin target lsn is 5/6C006058
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_rewind -D '/data/kingbase/r6ha/data' --source-server='host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: warning: sys_rewind: target server must be shut down cleanly in control file, and could not open PID file "/data/kingbase/r6ha/data/kingbase.pid": No such file or directorypid file not found that it seems bogus. Trying to start rewind anyway...
sys_rewind: servers diverged at WAL location 0/0 on timeline 43
sys_rewind: the divergerec is invalid, set the diverged to the target server's WAL location at 5/6C000028 on timeline 43
sys_rewind: rewinding from last common checkpoint at 5/6C000028 on timeline 43
sys_rewind: find last common checkpoint start time from 2023-02-02 11:15:14.503077 CST to 2023-02-02 11:15:14.541435 CST, in "0.038358" seconds.
....... #对备库的recovery完成,备库加入到集群
NOTICE: begin to start server at 2023-02-02 11:15:21.817078
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D '/data/kingbase/r6ha/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2023-02-02 11:15:22.625625
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
[2023-02-02 11:15:22] [NOTICE] kbha: node (ID: 2) rejoin success.

五、总结

对于集群整个断电后,加电后自动恢复功能,在无人值守的生产环境,当机房电力不稳定时,减少人为地干预,快速恢复集群。

KingbaseES V8R6 集群运维案例 -- 集群断电重新加电后恢复的更多相关文章

  1. KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例

    案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...

  2. KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例

    案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...

  3. KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析

    ​ 案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...

  4. KingbaseES V8R3集群运维案例之---用户自定义表空间管理

    ​案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...

  5. KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed

    案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...

  6. PB 级大规模 Elasticsearch 集群运维与调优实践

    PB 级大规模 Elasticsearch 集群运维与调优实践 https://mp.weixin.qq.com/s/PDyHT9IuRij20JBgbPTjFA | 导语 腾讯云 Elasticse ...

  7. 集群运维ansible

    ssh免密登录 集群运维 生成秘钥,一路enter cd ~/.ssh/ ssh-keygen -t rsa 讲id_rsa.pub文件追加到授权的key文件中 cat ~/.ssh/id_rsa.p ...

  8. 阿里巴巴大规模神龙裸金属 Kubernetes 集群运维实践

    作者 | 姚捷(喽哥)阿里云容器平台集群管理高级技术专家 本文节选自<不一样的 双11 技术:阿里巴巴经济体云原生实践>一书,点击即可完成下载. 导读:值得阿里巴巴技术人骄傲的是 2019 ...

  9. KingbaseES V8R6集群管理运维案例之---repmgr standby switchover故障

    案例说明: 在KingbaseES V8R6集群备库执行"repmgr standby switchover"时,切换失败,并且在执行过程中,伴随着"repmr stan ...

  10. PB级大规模Elasticsearch集群运维与调优实践

    导语 | 腾讯云Elasticsearch 被广泛应用于日志实时分析.结构化数据分析.全文检索等场景中,本文将以情景植入的方式,向大家介绍与腾讯云客户合作过程中遇到的各种典型问题,以及相应的解决思路与 ...

随机推荐

  1. python中两个不同shape的数组间运算规则

    1 前言 声明:本博客讨论的数组间运算是指四则运算,如:a+b.a-b.a*b.a/b,不包括 a.dot(b) 等运算,由于 numpy 和 tensorflow 中都遵循相同的规则,本博客以 nu ...

  2. useMemo与useCallback

    useMemo与useCallback useMemo和useCallback都可缓存函数的引用或值,从更细的角度来说useMemo则返回一个缓存的值,useCallback是返回一个缓存函数的引用. ...

  3. django中从你的代码运行管理命令call_command

    # 主要用法就是调用django自定义的Command命令 # 语法 django.core.management.call_command(name,*args,**options) - name ...

  4. 解析Spring中的循环依赖问题:初探三级缓存

    什么是循环依赖? 这个情况很简单,即A对象依赖B对象,同时B对象也依赖A对象,让我们来简单看一下. // A依赖了B class A{ public B b; } // B依赖了A class B{ ...

  5. 【Python OO其一】构造函数__init__()

    Python对象包括三个部分:id(identity识别码).type(对象类型).value(对象的值) __ init __()构造函数 __ init __()方法应用定义构造函数,作用是在实例 ...

  6. zynq7000 I2C RTC 与 串口使用

    RS485 串口 测试 硬件上2路串口,其中UART 1对应PS STD IN/OUT,UART 0对应RS485: 图 ‑1 RS485电路,自动转换输入.输出方向 可参考 https://blog ...

  7. 三分钟数据持久化:Spring Boot, JPA 与 SQLite 的完美融合

    三分钟,迎接一个更加高效和简便的开发体验. 在快节奏的软件开发领域,每一个简化工作流程的机会都不容错过.想要一个无需繁琐配置.能够迅速启动的数据持久化方案吗?这篇文章将是你的首选攻略.在这里,我们将向 ...

  8. 【Azure 事件中心】EventHub 中同一条消息不停的推送给消费端问题记录

    问题描述 EventHub 中同一条消息,不停的推送给消费端,查看日志发现错误: Caused by: com.azure.messaging.eventhubs.implementation.Par ...

  9. PostgreSQL、KingBase 数据库 ORDER BY LIMIT 查询缓慢案例

    好久没写博客了,最近从人大金仓离职了,新公司入职了蚂蚁集团,正在全力学习 OcenaBase 数据库的体系结构中. 以后分享的案例知识基本上都是以 OcenaBase 分布式数据库为主了,呦西. 昨天 ...

  10. 【专业技能】程序员的软件工程素养之画好 UML 时序图

    目录 前言 一.认识时序图 1.1时序图元素 1.2怎么使用 二.画好时序图 2.1一般步骤 2.2举个例子 2.3推荐工具 三.其它作用 四.文章小结 前言 笔者在本科的时候上过软件工程的专业课,也 ...