KingbaseES V8R3 集群运维案例--sys_rewind恢复备库节点
案例说明:
在KingbaseES V8R3集群执行failover切换后,原主库被人为误(未配置recovery.conf)启动;或者人为promote备库为主库后。需要将操作节点再重新加入集群,此时节点与主库的timeline将出现分叉,导致节点直接加入集群失败,可以通过sys_rewind工具恢复节点为新的备库加入到集群。
适用版本:
KingbaseES V8R3
案例一: failover切换后,原主库未配置recovery.conf直接启动数据库服务
集群failover切换后,没有在原主库data目录下创建recovery.conf文件,启动原主库数据库服务。后创建了recovery.conf,再启动原主库以备库加入流复制失败,因为timeline与新主库不一致。采用sys_rewind工具重新将原主库加入集群。如下图所示,故障现象:
1、查看原主库sys_log日志
=如下所示,备库数据库服务启动失败,因为timeline与新主库不一致=
[kingbase@node1 sys_log]$ tail -100f kingbase-2021-03-01_132943.log
LOG: database system was shut down in recovery at 2021-03-01 13:29:42 CST
LOG: entering standby mode
FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/12000028 on timeline 1, but in the history of the requested timeline, the server forked off from that timeline at 0/11000098.
LOG: startup process (PID 10918) exited with exit code 1
LOG: aborting startup due to startup process failure
LOG: database system is shut down
---数据库启动时会读取sys_wal目录下的所有X.history文件,并取最高的来判断控制文件中的checkpoint是否合理。X.history会记录每个时间线timeline的结束LSN,当发现checkpoint的LSN为A,
而history文件中记录对应时间线的结束LSN为B,且A>B时,就会认为checkpoint是非法的,
就会出现此报错。
2、在原主库执行sys_rewind加入集群
[kingbase@node1 bin]$ ./sys_rewind -D /home/kingbase/cluster/kha/db/data --source-server='host=192.168.7.243 port=54321 user=system dbname=PROD' -P -n
connected to server
datadir_source = /home/kingbase/cluster/kha/db/data
rewinding from last common checkpoint at 0/10000028 on timeline 1
find last common checkpoint start time from 2021-03-01 13:36:24.675116 CST to 2021-03-01 13:36:24.702727 CST, in "0.027611" seconds.
reading source file list
reading target file list
reading WAL in target
need to copy 298 MB (total source directory size is 363 MB)
Rewind datadir file from source
Get archive xlog list from source
Rewind archive log from source
59462/305222 kB (19%) copied
creating backup label and updating control file
syncing target data directory
rewind start wal location 0/10000028 (file 000000010000000000000010), end wal location 0/11070290 (file 000000020000000000000011). time from 2021-03-01 13:36:25.675116 CST to 2021-03-01 13:36:25.379386 CST, in "0.704270" seconds.
Done!
3、启动新备库数据库服务
[kingbase@node1 bin]$ ./sys_ctl start -D ../data
server starting
......
[kingbase@node1 bin]$ ps -ef|grep kingbase
.......
kingbase 11983 13899 0 13:31 pts/1 00:00:00 tail -100f kingbase-2021-03-01_132943.log
kingbase 13611 1 0 13:36 pts/0 00:00:00 /home/kingbase/cluster/kha/db/bin/kingbase -D ../data
kingbase 13614 13611 0 13:36 ? 00:00:00 kingbase: logger process
kingbase 13615 13611 0 13:36 ? 00:00:00 kingbase: startup process recovering 000000020000000000000011
kingbase 13619 13611 0 13:36 ? 00:00:00 kingbase: checkpointer process
kingbase 13620 13611 0 13:36 ? 00:00:00 kingbase: writer process
kingbase 13621 13611 0 13:36 ? 00:00:00 kingbase: wal receiver process streaming 0/11070EB8
kingbase 13622 13611 0 13:36 ? 00:00:00 kingbase: stats collector process
4、查询集群节点状态
prod=# select * from sys_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+--
27104 | 10 | system | node248 | 192.168.7.248 | | 22355 | 2021-03-01 13:19:11.376063+08 | | streaming | 0/11070FD0 | 0/11070FD0 | 0/11070FD0 | 0/11070FD0 | 0 | async
(1 row)
prod=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.7.243 | 54321 | up | 0.500000 | primary | 1 | true | 0
1 | 192.168.7.248 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
---如上所示,节点恢复为新的备库,加入集群后状态正常。
案例2:备库promote为新主库重新加入到集群
在一个备库节点上执行promote后执行测试,测试完成后,在data目录下配置了recovery.conf,然后启动数据库服务,出现timeline不匹配的故障,如下图所示,故障现象:
1、备库执行promote
[kingbase@node102 bin]$ ./sys_ctl promote -D ../data
server promoting
2、配置recovery.conf启动数据库
1)配置recovery.conf
[kingbase@node102 bin]$ cat ../data/recovery.conf
standby_mode='on'
primary_conninfo='port=54321 host=192.168.1.101 user=SYSTEM password=MTIzNDU2Cg== application_name=node2'
recovery_target_timeline='latest'
primary_slot_name ='slot_node2'
2、启动数据库服务
Tips:
备库数据库服务启动正常,但不能加入到集群。
kingbase 19814 1 0 15:14 pts/0 00:00:00 /home/kingbase/cluster/HAR3/db/bin/kingbase -D ../data
kingbase 19815 19814 0 15:14 ? 00:00:00 kingbase: logger process
kingbase 19816 19814 0 15:14 ? 00:00:00 kingbase: startup process recovering 0000000600000000000000D3
kingbase 19820 19814 0 15:14 ? 00:00:00 kingbase: checkpointer process
kingbase 19821 19814 0 15:14 ? 00:00:00 kingbase: writer process
kingbase 19822 19814 0 15:14 ? 00:00:00 kingbase: stats collector process
3、查看sys_log日志
2023-02-22 15:15:04.317 CST,,,20264,,63f5c0f8.4f28,1,,2023-02-22 15:15:04 CST,,0,FATAL,XX000,"highest timeline 5 of the primary is behind recovery timeline 6",,,,,,,,,""
2023-02-22 15:15:09.328 CST,,,20281,,63f5c0fd.4f39,1,,2023-02-22 15:15:09 CST,,0,FATAL,XX000,"highest timeline 5 of the primary is behind recovery timeline 6",,,,,,,,,""
2023-02-22 15:15:14.376 CST,,,20315,,63f5c102.4f5b,1,,2023-02-22 15:15:14 CST,,0,FATAL,XX000,"highest timeline 5 of the primary is behind recovery timeline 6",,,,,,,,,""
---如上所示,出现备库和主库timeline不一致的问题。
4、查看流复制状态
prod=# select * from sys_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin |
state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+--
3、在备库执行sys_rewind恢复
1)执行sys_rewind
[kingbase@node102 bin]$ ./sys_rewind -D /home/kingbase/cluster/HAR3/db/data --source-server='host=192.168.1.101 port=54321 user=SYSTEM password=123456 dbname=PROD' -P -n
connected to server
datadir_source = /home/kingbase/cluster/HAR3/db/data
rewinding from last common checkpoint at 0/D1000108 on timeline 5
find last common checkpoint start time from 2023-02-22 15:18:54.216014 CST to 2023-02-22 15:18:54.233126 CST, in "0.017112" seconds.
reading source file list
reading target file list
reading WAL in target
need to copy 91 MB (total source directory size is 580 MB)
Rewind datadir file from source
Get archive xlog list from source
Rewind archive log from source
44827/93979 kB (47%) copied
creating backup label and updating control file
syncing target data directory
rewind start wal location 0/D10000D0 (file 0000000500000000000000D1), end wal location 0/D103EAA0 (file 0000000500000000000000D1). time from 2023-02-22 15:18:54.216014 CST to 2023-02-22 15:18:54.543176 CST, in "0.327162" seconds.
Done!
2)启动备库数据库服务
[kingbase@node102 bin]$ ./sys_ctl start -D ../data
server starting
3)查看集群节点状态
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)
4)查看流复制状态
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
13034 | 10 | SYSTEM | node2 | 192.168.1.102 | | 27047 | 2023-02-22 15:19:07.660862+08 | | streaming | 0/D103F700 | 0/D103F700 | 0/D103F700 | 0/D103F700 | 2 | sync
(1 row)
---如上所示,备库恢复后,加入到集群。
KingbaseES V8R3 集群运维案例--sys_rewind恢复备库节点的更多相关文章
- KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
- KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
- KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
- KingbaseES V8R3集群运维案例之---用户自定义表空间管理
案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...
- KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例
案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...
- KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构
案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...
- KingbaseES V8R3集群维护案例之---pcp_node_refresh应用
案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...
- KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...
- KingbaseES V8R3集群维护案例之---在线添加备库管理节点
案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...
- PB 级大规模 Elasticsearch 集群运维与调优实践
PB 级大规模 Elasticsearch 集群运维与调优实践 https://mp.weixin.qq.com/s/PDyHT9IuRij20JBgbPTjFA | 导语 腾讯云 Elasticse ...
随机推荐
- spring boot2.0集成mybatis-plus实战
说明: 本例演示spring boot2.0如何集成mybatis-plus 如何使用代码生成器 项目源码: https://gitee.com/indexman/mybatis-plus-demo ...
- oracle 使用comment语句添加表注释
使用oracle comment语句可以给表.字段.视图等对象添加备注信息. 大致语法为: comment on TABLE table_name IS '备注内容'; 权限要求: 默认情况下用户只能 ...
- SpringCloud SpringBoot 组件使用:使用Nacos作为服务的注册中心和配置中心
基础篇 一.什么是Nacos? 官方介绍是这样的: Nacos 致力于帮助您发现.配置和管理微服务.Nacos 提供了一组简单易用的特性集,帮助您实现动态服务发现.服务配置管理.服务及流量管理. Na ...
- Vulnhub内网渗透DC-7靶场通关
个人博客: xzajyjs.cn DC系列共9个靶场,本次来试玩一下一个 DC-7,下载地址. 下载下来后是 .ova 格式,建议使用vitualbox进行搭建,vmware可能存在兼容性问题.靶场推 ...
- 字符串,format格式化及列表的相关进阶操作---day07
1.字符串相关操作 (1)字符串的拼接 (2)字符串的重复 (3)字符串跨行拼接 (4)字符串的索引 (5)字符串的切片:[开始索引:结束索引:步长] 2.字符串的格式化format (1)顺序传参 ...
- django学习第一天---MVC和MTV框架,request对象的属性,url路由系统
jinja2模板渲染简单使用 下载安装 pip install jinja2 使用示例 html文件中写法 <!DOCTYPE html> <html lang="zh-C ...
- 【LeetCode二叉树#02】二叉树层序遍历(广度优先搜索),十合一专题
二叉树层序遍历(广度优先搜索) 102 二叉树的层序遍历 力扣题目链接(opens new window) 给你一个二叉树,请你返回其按 层序遍历 得到的节点值. (即逐层地,从左到右访问所有节点). ...
- DataGear 制作自适应任意屏幕尺寸的数据可视化看板
DataGear 即支持以编写HTML.JavaScript.CSS源码的源码模式制作看板,也支持直观可见.友好快捷的可视模式制作看板. 本文将通过看板可视编辑模式提供的网格布局和样式设置功能,介绍如 ...
- fatal: bad object refs/remotes/origin/xxx
解决方案: 1.项目的.git文件内的目录.git/logs/refs/remotes/origin/,删除该错误的本地远程分支: 2.执行git pull --rebase即可 类似错误信息例子: ...
- C++ //案列-员工分组 ( 容器存放,查找,打印,统计,宏定义 ,随机)
//案列-员工分组//描述:公司招聘10个员工(ABCDEFGHIJ),10名指派员工进入公司,需要指派那个员工在那个部门工作//员工信息有:姓名 工资组成: 部门分为:策划 美术 研发//随机给10 ...