KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解

案例说明：

在KingbaseES V8R3集群，network_rewind.sh用于当节点数据库服务down时，实现数据库服务的自动恢复功能。在network_rewind.sh执行时，会对数据库的存储（data）所在的磁盘进行R/W的检查，默认如果读写检查失败，将会关闭数据库；在生产环境，磁盘I/O压力较大的情况下，可能会触发误判，导致数据库关闭，影响正常的应用。可以通过参数调整，在检测失败的情况下，不关闭数据库服务。

适用版本：

KingbaseES V8R3

一、集群架构

TEST=# show pool_nodes;

 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay

---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------

 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 0          | false             | 0

 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | true              | 0

(2 rows)

二、测试默认磁盘检测功能

1、模拟磁盘检测故障

[root@node102 db]# pwd

/home/kingbase/cluster/HAR3/db

[root@node102 db]# ls -lhd data

drwx------ 20 kingbase kingbase 4.0K Mar 17 10:20 data

[root@node102 db]# chown root.root data

[root@node102 db]# chmod 700 data

[root@node102 db]# ls -lhd data

drwx------ 20 root root 4.0K Mar 17 10:20 data

---如上所示，对于数据存储data目录，数据库用户kingbase无读写权限。

2、查看节点recovery.log

Tips:

默认在KingbaseES V8R3集群，每过一分钟，crond调用network_rewind.sh脚本检测节点数据库状态，可以通过recovery.log获取详细执行信息。

2023-03-17 10:21:01 recover beging...

my pid is 9836,officially began to perform recovery

2023-03-17 10:21:01 check read/write on mount point

2023-03-17 10:21:01 check read/write on mount point (1 / 6).

2023-03-17 10:21:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:01 failed to check read/write on mount point (1 / 6).

2023-03-17 10:21:11 check read/write on mount point (2 / 6).

2023-03-17 10:21:11 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:11 failed to check read/write on mount point (2 / 6).

2023-03-17 10:21:21 check read/write on mount point (3 / 6).

2023-03-17 10:21:21 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:21 failed to check read/write on mount point (3 / 6).

2023-03-17 10:21:31 check read/write on mount point (4 / 6).

2023-03-17 10:21:31 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:31 failed to check read/write on mount point (4 / 6).

2023-03-17 10:21:41 check read/write on mount point (5 / 6).

2023-03-17 10:21:41 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:41 failed to check read/write on mount point (5 / 6).

2023-03-17 10:21:51 check read/write on mount point (6 / 6).

2023-03-17 10:21:51 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:21:51 failed to check read/write on mount point (6 / 6).

2023-03-17 10:22:01 execute check_mount_point() failed, maybe the disk is error

2023-03-17 10:22:01 USE_CHECK_DISK = on, will exit with stop db.

exit with error and stop db.....

sys_ctl: could not open PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid": Permission denied

2023-03-17 10:22:01 now will del vip [192.168.1.204/24]

I'm already recovery now pid[9836], return nothing to do,will exit script will success

now, there is no 192.168.1.204/24 on my DEV

......

---如上所示，对“/home/kingbase/cluster/HAR3/db/data”目录读写执行检测。

如下图所示：磁盘检测失败关闭数据库服务

三、调整磁盘检测功能

Tips：

默认" if failed in check_mount_point(), should stop the database? default is on, do stop db"，参数

USE_CHECK_DISK=1（默认），将关闭数据库服务；USE_CHECK_DISK=0，不关闭数据库服务。

1、配置磁盘检测参数

[root@node102 db]# cat etc/HAmodule.conf |grep -i disk

USE_CHECK_DISK=0

---在所有节点HAmodule.conf增加此参数配置（默认配置文件无此参数）。

2、模拟磁盘检测故障

[root@node102 db]# pwd

/home/kingbase/cluster/HAR3/db

[root@node102 db]# ls -lhd data

drwx------ 20 kingbase kingbase 4.0K Mar 17 10:20 data

[root@node102 db]# chown root.root data

[root@node102 db]# chmod 700 data

[root@node102 db]# ls -lhd data

drwx------ 20 root root 4.0K Mar 17 10:20 data

---如上所示，对于数据存储data目录，数据库用户kingbase无读写权限。

3、查看节点recovery.log

---------------------------------------------------------------------

2023-03-17 10:33:01 recover beging...

my pid is 16274,officially began to perform recovery

2023-03-17 10:33:01 check read/write on mount point

2023-03-17 10:33:01 check read/write on mount point (1 / 6).

2023-03-17 10:33:01 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

.......

2023-03-17 10:33:51 check read/write on mount point (6 / 6).

2023-03-17 10:33:51 stat the directory of the mount point "/home/kingbase/cluster/HAR3/db/data" ...

ls: cannot open directory /home/kingbase/cluster/HAR3/db/data: Permission denied

could not stat the mount point "/home/kingbase/cluster/HAR3/db/data", please check it

could not execute "ls /home/kingbase/cluster/HAR3/db/data".

2023-03-17 10:33:51 failed to check read/write on mount point (6 / 6).

2023-03-17 10:34:01 execute check_mount_point() failed, maybe the disk is error

2023-03-17 10:34:01 USE_CHECK_DISK = off, do nothing.

2023-03-17 10:34:01 check read/write on mount point ... ok

2023-03-17 10:34:01 check if the network is ok

I'm already recovery now pid[16274], return nothing to do,will exit script will success

ping trust ip 192.168.1.1 success ping times :[3], success times:[2]

determine if i am master or standby

........

如下图所示：磁盘检测失败，但没有触发数据库关闭：

四、总结

磁盘检测功能有助于集群数据库数据的安全，但是在有的生产环境，磁盘I/O压力大情况下，有可能引起误判，可以根据生产应用环境，调整"USE_CHECK_DISK"参数，即保证集群的高可用性，又保证数据的安全。

KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解的更多相关文章

KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例
案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...
KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...
KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed
案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...
KingbaseES V8R3集群运维案例之---用户自定义表空间管理
案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...
KingbaseES V8R3集群维护案例之---pcp_node_refresh应用
案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...
KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构
案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...
KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析
案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...
KingbaseES V8R3集群维护案例之---在线添加备库管理节点
案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...
Memcached集群/分布式/高可用及 Magent缓存代理搭建过程详解
当网站访问量达到一定时,如何做Memcached集群,又如何高可用,是接下来要讨论的问题. 有这么一段文字来描述“Memcached集群” Memcached如何处理容错的? 不处理!:) 在memc ...
KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例
案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...

随机推荐

bootstrap与javascript
1.bootstrap依赖 bootstrap依赖javascript类库,jQuery 下载jQuery,在页面上应用jQuery 在页面上应用bootstrap的js类库 <script s ...
python-获得特定程序的屏幕截图并保存为文件
import win32gui import win32ui import win32con name = "test.txt - Notepad" hwnd = win32gui ...
os.path.relpath和os.path.basename，返回文件路径中的文件名
from os import path print(path.relpath("/home/hpcadmin/lw/demo.py", start="/home/hpca ...
SSH 客户端
简介 OpenSSH 的客户端是二进制程序 ssh.它在 Linux/Unix 系统的位置是/usr/local/bin/ssh. Linux 系统一般都自带 ssh,如果没有就需要安装. # Ubu ...
.NET Core 集成微信支付签名错误
.NET Core 集成微信支付签名错误 The provided data is tagged with 'Universal' class value '16', but it should ha ...
iOS上拉边界下拉白色空白问题解决概述
表现手指按住屏幕下拉,屏幕顶部会多出一块白色区域.手指按住屏幕上拉,底部多出一块白色区域. 产生原因在 iOS 中,手指按住屏幕上下拖动,会触发 touchmove 事件.这个事件触发的对象是整个 ...
用Docker发布网站时，自动下载Directory.Build.props及其Import的文件
为Blazor网站项目,"添加Docker支持" 这时,网站项目根目录下会新增Dockerfile. 里面文字内容如下 #See https://aka.ms/customizec ...
我的第一个项目(八):(解决问题)图片资源无法加载(Error: Cannot find module "../../xxx" )
好家伙,问题一堆先开一个测试页模拟游戏模块的运行原先的图片初始化方法失效了,(vue里面自然是用不了这种方法的) function createImage(src) { let img; if ...
matrox的RAP4G4C12 CXP采集卡软件安装
Hello-FPGA info@hello-fpga.cOM matrox的RAP4G4C12 CXP采集卡软件安装目录 matrox的RAP4G4C12 CXP采集卡软件安装 4 1 前言 4 2 ...
Jenkins+maven+svn+tomcat持续集成环境
前言团队最近要把项目发布的工作拿过来,所以需要一个持续集成发布系统直接上步骤. 下载 http://mirrors.jenkins-ci.org/war/latest/ 直接下载war包,我下载的 ...

KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解

KingbaseES V8R3 集群运维系列之 -- network_rewind.sh磁盘检测功能详解的更多相关文章

随机推荐

热门专题