KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解

案例说明：

在KingbaseES V8R3集群，network_rewind.sh用于当节点数据库服务down时，实现数据库服务的自动恢复功能。在network_rewind.sh执行时，会对数据库的存储（data）所在的磁盘进行R/W的检查，默认如果读写检查失败，将会关闭数据库；在生产环境，磁盘I/O压力较大的情况下，可能会触发误判，导致数据库关闭，影响正常的应用。可以通过参数调整，在检测失败的情况下，不关闭数据库服务。

适用版本：

KingbaseES V8R3

一、集群架构

TEST=# show pool_nodes;

 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay

---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------

 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | false             | 0

 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | true              | 0

(2 rows)

二、测试默认磁盘检测功能

1、模拟磁盘检测故障

[root@node102 db]# pwd

/home/kingbase/cluster/HAR3/db

[root@node102 db]# ls -lhd data

drwx------ 20 kingbase kingbase 4.0K Mar 17 10:20 data

[root@node102 db]# chown root.root data

[root@node102 db]# chmod 700 data

[root@node102 db]# ls -lhd data

drwx------ 20 root root 4.0K Mar 17 10:20 data

---如上所示，对于数据存储data目录，数据库用户kingbase无读写权限。

2、查看节点recovery.log

Tips:

默认在KingbaseES V8R3集群，每过一分钟，crond调用network_rewind.sh脚本检测节点数据库状态，可以通过recovery.log获取详细执行信息。

2023-03-17 10:21:01 recover beging...

my pid is 9836,officially began to perform recovery

2023-03-17 10:21:01 check read/write on mount point

2023-03-17 10:21:01 check read/write on mount point (1 / 6).

2023-03-17 10:21:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:01 failed to check read/write on mount point (1 / 6).

2023-03-17 10:21:11 check read/write on mount point (2 / 6).

2023-03-17 10:21:11 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:11 failed to check read/write on mount point (2 / 6).

2023-03-17 10:21:21 check read/write on mount point (3 / 6).

2023-03-17 10:21:21 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:21 failed to check read/write on mount point (3 / 6).

2023-03-17 10:21:31 check read/write on mount point (4 / 6).

2023-03-17 10:21:31 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:31 failed to check read/write on mount point (4 / 6).

2023-03-17 10:21:41 check read/write on mount point (5 / 6).

2023-03-17 10:21:41 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:41 failed to check read/write on mount point (5 / 6).

2023-03-17 10:21:51 check read/write on mount point (6 / 6).

2023-03-17 10:21:51 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:51 failed to check read/write on mount point (6 / 6).

2023-03-17 10:22:01 execute check_mount_point() failed, maybe the disk is error

2023-03-17 10:22:01 USE_CHECK_DISK = on, will exit with stop db.

exit with error and stop db.....

sys_ctl: could not open PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid": Permission denied

2023-03-17 10:22:01 now will del vip [192.168.1.204/24]

I'm already recovery now pid[9836], return nothing to do,will exit script will success

now, there is no 192.168.1.204/24 on my DEV

......

---如上所示，对“/home/kingbase/cluster/HAR3/db/data”目录读写执行检测。

如下图所示：磁盘检测失败关闭数据库服务

三、调整磁盘检测功能

Tips：

默认" if failed in check_mount_point(), should stop the database? default is on, do stop db"，参数

USE_CHECK_DISK=1（默认），将关闭数据库服务；USE_CHECK_DISK=0，不关闭数据库服务。

1、配置磁盘检测参数

[root@node102 db]# cat etc/HAmodule.conf |grep -i disk

USE_CHECK_DISK=0

---在所有节点HAmodule.conf增加此参数配置（默认配置文件无此参数）。

2、模拟磁盘检测故障

[root@node102 db]# pwd

/home/kingbase/cluster/HAR3/db

[root@node102 db]# ls -lhd data

drwx------ 20 kingbase kingbase 4.0K Mar 17 10:20 data

[root@node102 db]# chown root.root data

[root@node102 db]# chmod 700 data

[root@node102 db]# ls -lhd data

drwx------ 20 root root 4.0K Mar 17 10:20 data

---如上所示，对于数据存储data目录，数据库用户kingbase无读写权限。

3、查看节点recovery.log

---------------------------------------------------------------------

2023-03-17 10:33:01 recover beging...

my pid is 16274,officially began to perform recovery

2023-03-17 10:33:01 check read/write on mount point

2023-03-17 10:33:01 check read/write on mount point (1 / 6).

2023-03-17 10:33:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

.......

2023-03-17 10:33:51 check read/write on mount point (6 / 6).

2023-03-17 10:33:51 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:33:51 failed to check read/write on mount point (6 / 6).

2023-03-17 10:34:01 execute check_mount_point() failed, maybe the disk is error

2023-03-17 10:34:01 USE_CHECK_DISK = off, do nothing.

2023-03-17 10:34:01 check read/write on mount point ... ok

2023-03-17 10:34:01 check if the network is ok

I'm already recovery now pid[16274], return nothing to do,will exit script will success

ping trust ip 192.168.1.1 success ping times :[3], success times:[2]

determine if i am master or standby

........

如下图所示：磁盘检测失败，但没有触发数据库关闭：

四、总结

磁盘检测功能有助于集群数据库数据的安全，但是在有的生产环境，磁盘I/O压力大情况下，有可能引起误判，可以根据生产应用环境，调整"USE_CHECK_DISK"参数，即保证集群的高可用性，又保证数据的安全。

KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解的更多相关文章

KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
KingbaseES V8R3集群运维案例之---用户自定义表空间管理
案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...
KingbaseES V8R3集群维护案例之---pcp_node_refresh应用
案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...
KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构
案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...
KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...
KingbaseES V8R3集群维护案例之---在线添加备库管理节点
案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...
Memcached集群/分布式/高可用及 Magent缓存代理搭建过程详解
当网站访问量达到一定时,如何做Memcached集群,又如何高可用,是接下来要讨论的问题. 有这么一段文字来描述“Memcached集群” Memcached如何处理容错的? 不处理!:) 在memc ...
KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例
案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...

随机推荐

Swoole从入门到入土(5)——TCP服务器[异步任务]
无论对于B/S还是C/S,程序再怎么变,唯一不变的是用户不想等太久的躁动心情.所以服务端对于客户的请求,能有多快就多快.如果服务端需要执行很耗时的操作,就需要异步任务处理机制,保证当前的响应速度不受影 ...
MySQL表锁定处理
研发要在一个ol_poster_sign表加字段,表比较大有400多万条,用gh-ost加字段时,在切换过程中一直报错: 无法完成最后的切换: INFO Magic cut-over table cr ...
oracle 游标变量ref cursor详解
一介绍像游标cursor一样,游标变量ref cursor指向指定查询结果集当前行.游标变量显得更加灵活因为其声明并不绑定指定查询. 其主要运用于PLSQL函数或存储过程以及其他编程语言 ...
OpenCV开发笔记（六十一）：红胖子8分钟带你深入了解Shi-Tomasi角点检测（图文并茂+浅显易懂+程序源码）
若该文为原创文章,未经允许不得转载原博主博客地址:https://blog.csdn.net/qq21497936原博主博客导航:https://blog.csdn.net/qq21497936/ar ...
【Azure 应用服务】在Azure上部署一套VUE框架的单页面应用，有什么可以参考的文档呢？
问题描述在Azure上部署一套VUE框架的单页面应用,有什么可以参考的文档呢? 问题回答 Azure官方上并没有VUE框架的实例代码,但是可以参考Node JS项目,来进行设置. 在 Azure 中 ...
开源：Taurus.Idempotent 分布式幂等性锁框架，支持 .Net 和 .Net Core 双系列版本
分布式幂等性锁介绍: 分布式幂等性框架的作用是确保在分布式系统中的操作具有幂等性,即无论操作被重复执行多少次,最终的结果都是一致的.幂等性是指对同一操作的多次执行所产生的效果与仅执行一次的效果相同. ...
新零售SaaS架构：订单履约系统架构设计（万字图文总结）
什么是订单履约系统? 订单履约系统用来管理从接收客户订单到将商品送达客户手中的全过程. 它连接了上游交易(客户在销售平台下单环)和下游仓储配送(如库存管理.物流配送),确保信息流顺畅.操作协同,提升整 ...
用几张图实战讲解MySQL主从复制
本文分享自华为云社区<结合实战,我为MySQL主从复制总结了几张图!>,作者: 冰河. MySQL官方文档 MySQL 主从复制官方文档链接地址如下所示: http://dev.mysq ...
使用go语言开发自动化API测试工具
前言上一篇文章说到我还开发了一个独立的自动测试工具,可以根据 OpenAPI 的文档来测试,并且在测试完成后输出测试报告,报告内容包括每个接口是否测试通过和响应时间等. 这个工具我使用了 go 语言 ...
Linux性能监控（一）-sar
sar是一个非常全面的一个分析工具,对文件的读写,系统调用的使用情况,磁盘IO,CPU相关使用情况,内存使用情况,进程活动等都可以进行有效的分析.sar工具将对系统当前的状态进行取样,然后通过计算数据 ...

KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解

KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解的更多相关文章

随机推荐

热门专题