KingbaseES V8R6 集群运维案例 -- 集群断电重新加电后恢复
官方文档介绍:
https://help.kingbase.com.cn/v8/highly/availability/cluster-use/cluster-use-2.html#id35
全局故障恢复(集群多级别自动恢复)
当出现整个集群故障、掉电后重新上电等情况下,对整个集群的自动恢复功能(目前只支持1级自动恢复)。当参数auto_cluster_recovery_level=1时,此功能开启;参数auto_cluster_recovery_level=0时,此功能关闭。此功能默认开启,如果需要修改此参数,在repmgr.conf文件中修改此参数后,需要重启repmgrd/kbha守护进程生效。
自动启动恢复功能在满足以下所有条件后才会生效并恢复集群:集群全部节点均故障、节点间网络正常、集群中只有1个节点为主库状态。
集群故障后,守护进程kbha检查集群其它节点状态,当其它节点都能连通,数据库均处于停库状态且没有其它数据库处于主库状态时,启动本地的主库。随后主库上的守护进程repmgrd通过故障自动恢复尝试恢复其它备库节点。
适用版本:
KingbaseES V8R6
测试版本:
[kingbase@node101 bin]$ ./ksql -V
ksql (Kingbase) V008R006C005B0041
集群节点信息:
ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen
----+---------+---------+-----------+----------+---------+-------+---------+--------------------
1 | node101 | primary | * running | | running | 11054 | no | n/a
2 | node102 | standby | running | node101 | running | 17731 | no | 1 second(s) ago
一、集群节点状态
[kingbase@node101 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
1 | node101 | primary | * running | | default | 100 | 43 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node102 | standby | running | node101 | default | 100 | 43 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
二、集群配置参数信息
[kingbase@node101 bin]$ cat ../etc/repmgr.conf|grep auto
recovery='automatic'
auto_cluster_recovery_level=1
failover='automatic'

三、集群断电
主备节点同时断电测试。
四、集群加电后节点状态
1、主库节点恢复状态
[kingbase@node101 bin]$ ps -ef |grep kingbase
kingbase 2456 1 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2507 2456 0 16:27 ? 00:00:00 sh -c /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr node service --action start 2>/dev/null
kingbase 2508 2507 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr node service --action start
kingbase 2509 2508 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D /data/kingbase/r6ha/data -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start
kingbase 2511 2509 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kingbase -D /data/kingbase/r6ha/data
kingbase 2512 2511 0 16:27 ? 00:00:00 kingbase: logger
kingbase 2513 2511 0 16:27 ? 00:00:00 kingbase: startup
---如上所示,主库节点启动kbha、sys_ctl进程启动集群及数据库服务。
主库节点恢复完成:
[kingbase@node101 bin]$ ps -ef |grep kingbase
kingbase 2519 2511 0 16:27 ? 00:00:00 kingbase: autovacuum launcher
kingbase 2520 2511 0 16:27 ? 00:00:00 kingbase: archiver
kingbase 2521 2511 0 16:27 ? 00:00:00 kingbase: stats collector
kingbase 2522 2511 0 16:27 ? 00:00:00 kingbase: ksh writer
kingbase 2523 2511 0 16:27 ? 00:00:00 kingbase: ksh collector
kingbase 2524 2511 0 16:27 ? 00:00:00 kingbase: kwr collector
kingbase 2525 2511 0 16:27 ? 00:00:00 kingbase: logical replication launcher
kingbase 2528 2511 0 16:27 ? 00:00:00 kingbase: system esrep 192.168.1.101(41774) idle
kingbase 2530 1 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2532 2511 0 16:27 ? 00:00:00 kingbase: system esrep 192.168.1.101(41778) idle
kingbase 2661 2456 0 16:28 ? 00:00:00 ping -q -c3 -w2 192.168.1.1
kingbase 2664 2511 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.101(41848) idle
kingbase 2675 2530 0 16:28 ? 00:00:00 ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -h 192.168.1.101 -A rejoin
kingbase 2679 2511 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(11472) idle
kingbase 2681 2511 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(11476) COPY

---如上所示,主库加电后启动kbha、remgrd进程及数据库服务,并远程连接备库执行备库的recovery。
2、备库节点状态
[kingbase@node102 ~]$ ps -ef |grep kingbase
kingbase 2306 1 0 16:27 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A daemon -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2688 1 0 16:28 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kingbase -D /data/kingbase/r6ha/data
kingbase 2689 2688 0 16:28 ? 00:00:00 kingbase: logger
kingbase 2690 2688 0 16:28 ? 00:00:00 kingbase: startup recovering 0000002B0000000500000066
kingbase 2694 2688 0 16:28 ? 00:00:00 kingbase: checkpointer
kingbase 2695 2688 0 16:28 ? 00:00:00 kingbase: background writer
kingbase 2696 2688 0 16:28 ? 00:00:00 kingbase: stats collector
kingbase 2697 2688 0 16:28 ? 00:00:00 kingbase: walreceiver streaming 5/66027EF8
kingbase 2708 2688 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(56891) idle
kingbase 2710 1 0 16:28 ? 00:00:00 /home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgrd -d -v -f /home/kingbase/cluster/R6HA/kha/kingbase/bin/../etc/repmgr.conf
kingbase 2712 2688 0 16:28 ? 00:00:00 kingbase: system esrep 192.168.1.102(56897) idle
---如上所示,备库节点集群和数据库服务都已经启动。
3、集群状态信息
[kingbase@node101 bin]$ ./repmgr cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
1 | node101 | primary | * running | | default | 100 | 43 | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
2 | node102 | standby | running | node101 | default | 100 | 43 | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
---如上所示,集群状态恢复正常。
4、集群加电恢复日志分析
查看主库hamgr.log:
# 主机加电后,主库repmgrd进程启动
[2023-02-02 11:14:35] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2023-02-02 11:14:35] [INFO] connecting to database "host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
......
# 主库启动数据库服务后,判断并加载vip
[2023-02-02 11:14:57] [NOTICE] found primary node lost virtual_ip, try to acquire virtual_ip
[2023-02-02 11:14:59] [NOTICE] PING 192.168.1.254 (192.168.1.254) 56(84) bytes of data.
--- 192.168.1.254 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms
[2023-02-02 11:14:59] [WARNING] ping host"192.168.1.254" failed
[2023-02-02 11:14:59] [DETAIL] average RTT value is not greater than zero
[2023-02-02 11:14:59] [DEBUG] executing:
/home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A loadvip
[2023-02-02 11:14:59] [DEBUG] result of command was 0 (0)
[2023-02-02 11:14:59] [DEBUG] local_command(): no output returned
[2023-02-02 11:14:59] [DEBUG] executing:
/home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -A arping
[2023-02-02 11:14:59] [DEBUG] result of command was 0 (0)
[2023-02-02 11:14:59] [DEBUG] local_command(): no output returned
[2023-02-02 11:14:59] [INFO] loadvip result: 1, arping result: 1
[2023-02-02 11:14:59] [NOTICE] acquire the virtual ip 192.168.1.254/24 success on localhost
.......
# 主库判断备库状态,并远程连接到备库执行recovery
[2023-02-02 11:15:16] [INFO] child node: 2; attached: no
[2023-02-02 11:15:16] [INFO] recovery delay time reached. can do recovery now.
[2023-02-02 11:15:16] [DEBUG] update_node_record_set_active():
UPDATE repmgr.nodes SET active = FALSE WHERE node_id = 2
[2023-02-02 11:15:16] [NOTICE] mark node "node102" (ID: 2) as inactive
......
[2023-02-02 11:15:16] [DEBUG] test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /bin/true 2>/dev/null
[2023-02-02 11:15:16] [DEBUG] remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 test -e /data/kingbase/r6ha/data/standby.signal
[2023-02-02 11:15:16] [DEBUG] remote_command(): no output returned
[2023-02-02 11:15:16] [NOTICE] [thread pid:2722] node (ID: 2; host: "192.168.1.102") is not attached, ready to auto-recovery
[2023-02-02 11:15:16] [DEBUG] get_recovery_type(): SELECT pg_catalog.pg_is_in_recovery()
[2023-02-02 11:15:16] [DEBUG] executing:
/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr cluster show | grep -w "host" | grep -w "running" | grep -w "primary" | awk -F '|' '{print $NF}'
.......
#主库远程连接到备库,通过kbha进程执行node rejoin操作
[2023-02-02 11:15:16] [NOTICE] [thread pid:2722] Now, the primary host ip: 192.168.1.101
[2023-02-02 11:15:16] [DEBUG] test_ssh_connection(): executing ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /bin/true 2>/dev/null
[2023-02-02 11:15:16] [INFO] [thread pid:2722] ES connection to host "192.168.1.102" succeeded, ready to do auto-recovery
[2023-02-02 11:15:16] [DEBUG] remote_command():
ssh -o Batchmode=yes -q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 22 192.168.1.102 /home/kingbase/cluster/R6HA/kha/kingbase/bin/kbha -h 192.168.1.101 -A rejoin
........
#主库通过pid文件和共享内存的访问判断备库的数据库服务是否启动,并通过执行sys_rewind保证备库和主库的数据一致
[2023-02-02 11:15:14] [INFO] the Kingbase pid file is already exists, check pre-existing shared memory block (key 54321001, ID 2)
[2023-02-02 11:15:14] [INFO] pre-existing shared memory block (key 54321001, ID 2) is not in use
2023-02-02 11:15:14.465 CST [2678] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make
2023-02-02 11:15:14.465 CST [2678] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make
2023-02-02 11:15:14.465 CST [2678] DEBUG: proc_exit(0): 0 callbacks to make
2023-02-02 11:15:14.465 CST [2678] DEBUG: exit(0)
[2023-02-02 11:15:14] [INFO] unlink file /tmp/.s.KINGBASE.54321.lock
[2023-02-02 11:15:14] [NOTICE] executing repmgr command "/home/kingbase/cluster/R6HA/kha/kingbase/bin/repmgr --dbname="host=192.168.1.101 dbname=esrep user=system port=54321" node rejoin --force-rewind"
WARNING: database is not running, but it is not shut down cleanly
DEBUG: connecting to: "user=system connect_timeout=10 dbname=esrep host=192.168.1.101 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 5/6C0000A0, rejoin target lsn is 5/6C006058
NOTICE: executing sys_rewind
DETAIL: sys_rewind command is "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_rewind -D '/data/kingbase/r6ha/data' --source-server='host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3'"
sys_rewind: warning: sys_rewind: target server must be shut down cleanly in control file, and could not open PID file "/data/kingbase/r6ha/data/kingbase.pid": No such file or directorypid file not found that it seems bogus. Trying to start rewind anyway...
sys_rewind: servers diverged at WAL location 0/0 on timeline 43
sys_rewind: the divergerec is invalid, set the diverged to the target server's WAL location at 5/6C000028 on timeline 43
sys_rewind: rewinding from last common checkpoint at 5/6C000028 on timeline 43
sys_rewind: find last common checkpoint start time from 2023-02-02 11:15:14.503077 CST to 2023-02-02 11:15:14.541435 CST, in "0.038358" seconds.
.......
#对备库的recovery完成,备库加入到集群
NOTICE: begin to start server at 2023-02-02 11:15:21.817078
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl -w -t 90 -D '/data/kingbase/r6ha/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2023-02-02 11:15:22.625625
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
[2023-02-02 11:15:22] [NOTICE] kbha: node (ID: 2) rejoin success.
五、总结
对于集群整个断电后,加电后自动恢复功能,在无人值守的生产环境,当机房电力不稳定时,减少人为地干预,快速恢复集群。
KingbaseES V8R6 集群运维案例 -- 集群断电重新加电后恢复的更多相关文章
- KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例
案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...
- KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
- KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
- KingbaseES V8R3集群运维案例之---用户自定义表空间管理
案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...
- KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
- PB 级大规模 Elasticsearch 集群运维与调优实践
PB 级大规模 Elasticsearch 集群运维与调优实践 https://mp.weixin.qq.com/s/PDyHT9IuRij20JBgbPTjFA | 导语 腾讯云 Elasticse ...
- 集群运维ansible
ssh免密登录 集群运维 生成秘钥,一路enter cd ~/.ssh/ ssh-keygen -t rsa 讲id_rsa.pub文件追加到授权的key文件中 cat ~/.ssh/id_rsa.p ...
- 阿里巴巴大规模神龙裸金属 Kubernetes 集群运维实践
作者 | 姚捷(喽哥)阿里云容器平台集群管理高级技术专家 本文节选自<不一样的 双11 技术:阿里巴巴经济体云原生实践>一书,点击即可完成下载. 导读:值得阿里巴巴技术人骄傲的是 2019 ...
- KingbaseES V8R6集群管理运维案例之---repmgr standby switchover故障
案例说明: 在KingbaseES V8R6集群备库执行"repmgr standby switchover"时,切换失败,并且在执行过程中,伴随着"repmr stan ...
- PB级大规模Elasticsearch集群运维与调优实践
导语 | 腾讯云Elasticsearch 被广泛应用于日志实时分析.结构化数据分析.全文检索等场景中,本文将以情景植入的方式,向大家介绍与腾讯云客户合作过程中遇到的各种典型问题,以及相应的解决思路与 ...
随机推荐
- python中两个不同shape的数组间运算规则
1 前言 声明:本博客讨论的数组间运算是指四则运算,如:a+b.a-b.a*b.a/b,不包括 a.dot(b) 等运算,由于 numpy 和 tensorflow 中都遵循相同的规则,本博客以 nu ...
- useMemo与useCallback
useMemo与useCallback useMemo和useCallback都可缓存函数的引用或值,从更细的角度来说useMemo则返回一个缓存的值,useCallback是返回一个缓存函数的引用. ...
- django中从你的代码运行管理命令call_command
# 主要用法就是调用django自定义的Command命令 # 语法 django.core.management.call_command(name,*args,**options) - name ...
- 解析Spring中的循环依赖问题:初探三级缓存
什么是循环依赖? 这个情况很简单,即A对象依赖B对象,同时B对象也依赖A对象,让我们来简单看一下. // A依赖了B class A{ public B b; } // B依赖了A class B{ ...
- 【Python OO其一】构造函数__init__()
Python对象包括三个部分:id(identity识别码).type(对象类型).value(对象的值) __ init __()构造函数 __ init __()方法应用定义构造函数,作用是在实例 ...
- zynq7000 I2C RTC 与 串口使用
RS485 串口 测试 硬件上2路串口,其中UART 1对应PS STD IN/OUT,UART 0对应RS485: 图 ‑1 RS485电路,自动转换输入.输出方向 可参考 https://blog ...
- 三分钟数据持久化:Spring Boot, JPA 与 SQLite 的完美融合
三分钟,迎接一个更加高效和简便的开发体验. 在快节奏的软件开发领域,每一个简化工作流程的机会都不容错过.想要一个无需繁琐配置.能够迅速启动的数据持久化方案吗?这篇文章将是你的首选攻略.在这里,我们将向 ...
- 【Azure 事件中心】EventHub 中同一条消息不停的推送给消费端问题记录
问题描述 EventHub 中同一条消息,不停的推送给消费端,查看日志发现错误: Caused by: com.azure.messaging.eventhubs.implementation.Par ...
- PostgreSQL、KingBase 数据库 ORDER BY LIMIT 查询缓慢案例
好久没写博客了,最近从人大金仓离职了,新公司入职了蚂蚁集团,正在全力学习 OcenaBase 数据库的体系结构中. 以后分享的案例知识基本上都是以 OcenaBase 分布式数据库为主了,呦西. 昨天 ...
- 【专业技能】程序员的软件工程素养之画好 UML 时序图
目录 前言 一.认识时序图 1.1时序图元素 1.2怎么使用 二.画好时序图 2.1一般步骤 2.2举个例子 2.3推荐工具 三.其它作用 四.文章小结 前言 笔者在本科的时候上过软件工程的专业课,也 ...