案例说明:

在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby;如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点;管理节点运行kingbasecluster服务 ,负责集群节点状态的监控及集群主备切换等操作。

本案例详细介绍在一主一备的架构下,其中一个管理节点宕机的情况下,如何在线添加新的管理节点,如果宕机的节点是主备复制中的primary主库,将自动发生主备切换,所以在线添加的节点都是管理节点的备节点。

管理节点的添加和普通数据节点的在线添加不同,相对较复杂,有关数据节点的添加可以参考以下文档:https://www.cnblogs.com/tiany1224/p/15749993.html KingbaseES V8R3集群维护案例之--- 在线添加数据节点

适用版本:

KingbaseES V8R3通用机环境(专业机可参考)

本次案例数据库版本:

TEST=# select version();
VERSION
-------------------------------------------------------------------------------------------------------------------------
Kingbase V008R003C002B0290 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
(1 row)

一、集群原节点信息

[kingbase@node101 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0290)
Type "help" for help. TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows) TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
25143 | 10 | SYSTEM | node102 | 192.168.1.102 | | 61622 | 2022-06-22 10:22:48.771995+08 | | streaming | 0/1A0000D0 | 0/1A0000D0 | 0/1A0000D0 | 0/1A0000D0 | 2 | sync
(1 row)

二、备库数据库主机宕机

TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 3 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows)

三、准备新的主机环境

Tips:

1)系统环境准备参考《KingbaseES 官方文档》

https://help.kingbase.com.cn/stage-api/profile/document/kes/v8r3/html/highly/clusterware/cluster-manage.html

2)通用机不需要安装数据库软件,专用机需要安装和主库相同版本软件。

3)创建和主库相同的集群目录结构。

1、主库集群目录结构:

[kingbase@node101 R3HA]$ pwd
/home/kingbase/cluster/R3HA [kingbase@node101 R3HA]$ ls -lh
total 33M
drwxrwxr-x 2 kingbase kingbase 4.0K Jun 21 19:04 archivedir
drwxrwxr-x 8 kingbase kingbase 95 Jun 13 19:28 db
-rw-r--r-- 1 kingbase root 29M Mar 29 15:10 db.zip
drwxrwxr-x 6 kingbase kingbase 48 Apr 1 2021 kingbasecluster
-rwxr-xr-x 1 kingbase root 4.3M Mar 29 15:10 kingbasecluster.zip
drwxr-xr-x 3 kingbase root 4.0K Apr 12 15:08 log
drwxr-xr-x 3 kingbase root 28 Mar 29 15:11 run
-rw------- 1 kingbase kingbase 8.0K Jun 22 10:42 template.bk

2、备库创建和主库相同的目录结构

[kingbase@node102 cluster]$ mkdir -p /home/kingbase/cluster/R3HA/
[kingbase@node102 cluster]$ cd R3HA
[kingbase@node102 R3HA]$ mkdir archivedir db kingbasecluster log run
[kingbase@node102 R3HA]$ ls -lh
total 0
drwxrwxr-x 2 kingbase kingbase 6 Jun 22 10:30 archivedir
drwxrwxr-x 2 kingbase kingbase 6 Jun 22 10:30 db
drwxrwxr-x 2 kingbase kingbase 6 Jun 22 10:30 kingbasecluster
drwxrwxr-x 2 kingbase kingbase 6 Jun 22 10:30 log
drwxrwxr-x 2 kingbase kingbase 6 Jun 22 10:30 run

3、从主库拷贝db目录下数据到备库(主库运行排除data目录)

[kingbase@node101 db]$ scp -r bin node102:/home/kingbase/cluster/R3HA/db
[kingbase@node101 db]$ scp -r etc node102:/home/kingbase/cluster/R3HA/db
[kingbase@node101 db]$ scp -r kb_scripts node102:/home/kingbase/cluster/R3HA/db
[kingbase@node101 db]$ scp -r lib node102:/home/kingbase/cluster/R3HA/db
[kingbase@node101 db]$ scp -r share node102:/home/kingbase/cluster/R3HA/db
[kingbase@node101 db]$ scp -r kingbase.log node102:/home/kingbase/cluster/R3HA/db

4、从主库拷贝kingbasecluster目录下所有数据到备库

[kingbase@node101 kingbasecluster]$ scp -r * node102:/home/kingbase/cluster/R3HA/kingbasecluster/

5、从主库拷贝log目录下所有数据到备库:(kingbasecluster.pid文件无需拷贝)

[kingbase@node101 log]$ scp -r * node102:/home/kingbase/cluster/R3HA/log/

6、从主库拷贝run目录下所有数据到备库:

[kingbase@node101 run]$ scp -r * node102:/home/kingbase/cluster/R3HA/run

7、从主库拷贝文件到备库

[kingbase@node101 R3HA]$ scp template.bk node102:/home/kingbase/cluster/R3HA

四、创建主备流复制

1、执行sys_basebackup克隆备库

[kingbase@node102 bin]$ ./sys_basebackup -h 192.168.1.101 -U SYSTEM -W 123456  -p 54321 -F p -x -v -P -D /home/kingbase/cluster/R3HA/db/data
transaction log start point: 0/1B000028 on timeline 7
108616/108616 kB (100%), 1/1 tablespace
transaction log end point: 0/1B0000F8
sys_basebackup: base backup completed

2、配置备库data目录权限

[kingbase@node102 db]$ chmod 700 data

3、配置recovery.conf

[kingbase@node102 data]$ cp ../etc/recovery.done ./recovery.conf

# 编辑recovery.conf:
[kingbase@node102 data]$ cat recovery.conf
standby_mode='on'
primary_conninfo='port=54321 host=192.168.1.101 user=SYSTEM password=MTIzNDU2 application_name=node101'
recovery_target_timeline='latest'
primary_slot_name ='slot_node101'

4、执行sys_ctl启动备库数据库服务

[kingbase@node102 bin]$ ./sys_ctl start -D /home/kingbase/cluster/R3HA/db/data
server starting
[kingbase@node102 bin]$ LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log". # 查看备库数据库服务
[kingbase@node102 bin]$ ps -ef|grep kingbase
kingbase 17885 1 0 11:10 pts/0 00:00:00 /home/kingbase/cluster/R3HA/db/bin/kingbase -D /home/kingbase/cluster/R3HA/db/data
kingbase 17886 17885 0 11:10 ? 00:00:00 kingbase: logger process
kingbase 17887 17885 0 11:10 ? 00:00:00 kingbase: startup process recovering 00000007000000000000001C
kingbase 17891 17885 0 11:10 ? 00:00:00 kingbase: checkpointer process
kingbase 17892 17885 0 11:10 ? 00:00:00 kingbase: writer process
kingbase 17893 17885 0 11:10 ? 00:00:00 kingbase: stats collector process
kingbase 17894 17885 0 11:10 ? 00:00:00 kingbase: wal receiver process streaming 0/1C000060

5、查看流复制状态

# 查看节点状态(此时备库状态为down)
TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 4 | true | 0
1 | 192.168.1.102 | 54321 | down | 0.500000 | standby | 0 | false | 0
(2 rows) # 查看流复制状态
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
25242 | 10 | SYSTEM | node101 | 192.168.1.102 | | 33495 | 2022-06-22 11:11:02.085186+08 | | streaming | 0/1C000060 | 0/1C000060 | 0/1C000060 | 0/1C000060 | 0 | async
(1 row) # 查看复制槽信息
TEST=# select * from sys_replication_slots;
SLOT_NAME | PLUGIN | SLOT_TYPE | DATOID | DATABASE | ACTIVE | ACTIVE_PID | XMIN | CATALOG_XMIN | RESTART_LSN | CONFIRMED_FLUSH_LSN
--------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
slot_node101 | | physical | | | t | 25242 | 2112 | | 0/1C000060 |
slot_node102 | | physical | | | f | | 2112 | | 0/1A0000D0 |
(2 rows) # 从以上可知,流复制状态正常。

五、备库配置kingbasecluster管理

Tips:

1)由于kingbasecluster的配置文件从主库复制而来,在备库需要做修改。

2)修改文件为kingbasecluster.conf 和 HAmodule.conf文件。

1、配置kingbasecluster.conf 文件

2、修改HAmodule.conf文件(包括kingbasecluster和db目录下)

[kingbase@node102 etc]$ vi HAmodule.conf
#the current node ip.example:KB_LOCALHOST_IP="192.168.28.128"
KB_LOCALHOST_IP="192.168.1.102" #recoord the names of local node.example:NODE_NAME="node1"
NODE_NAME="node102"

3、root用户手工启动kingbasecluster

[root@node102 bin]# ./kingbasecluster -n >/home/kingbase/cluster/R3HA/log/cluster.log 2>&1 &
[1] 23929

4、查看kingbasecluster服务启动状态

[root@node102 bin]# netstat -an |grep 9999
tcp 0 0 0.0.0.0:9999 0.0.0.0:* LISTEN
tcp6 0 0 :::9999 :::* LISTEN # 服务端口处于监听,kingbasecluster服务启动成功。

5、将备库节点注册到集群

[kingbase@node102 bin]$ ./pcp_attach_node -h 192.168.1.101 -U kingbase 1
Password:
pcp_attach_node -- Command Successful # 备库注册后,节点状态为”up“。
[kingbase@node101 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0290)
Type "help" for help. TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | primary | 6 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | standby | 0 | false | 0
(2 rows)

六、重启集群测试(生产环境在无业务时执行)

1、关闭备库的kingbasecluster服务

[root@node102 bin]# cd /home/kingbase/cluster/R3HA/kingbasecluster/bin
[root@node102 bin]# ./kingbasecluster -m fast stop
2022-06-22 11:41:01: pid 28321: LOG: stop request sent to kingbasecluster. waiting for termination...
.done.
[1]+ Done ./kingbasecluster -n > /home/kingbase/cluster/R3HA/log/cluster.log 2>&1

2、重启集群

[kingbase@node101 bin]$ ./kingbase_monitor.sh restart
-----------------------------------------------------------------------
2022-06-22 11:42:50 KingbaseES automation beging...
......
......................
all started..
...
now we check again
=======================================================================
| ip | program| [status]
[ 192.168.1.101]| [kingbasecluster]| [active]
[ 192.168.1.102]| [kingbasecluster]| [active]
[ 192.168.1.101]| [kingbase]| [active]
[ 192.168.1.102]| [kingbase]| [active]
======================================================================= # 如上所示,集群启动成功。

3、备库配置cron计划任务

[root@node101 ~]# cat /etc/cron.d/KINGBASECRON

*/1 * * * * kingbase  /home/kingbase/cluster/R3HA/db/bin/network_rewind.sh
*/1 * * * * root /home/kingbase/cluster/R3HA/kingbasecluster/bin/restartcluster.sh

七、集群failover切换测试(生产环境业务低峰测试)

1、配置主备库arping文件属主及权限(专用机不需要此操作)

[root@node102 bin]# cd /home/kingbase/cluster/R3HA/db/bin
[root@node102 bin]# chown root.root arping
[root@node102 bin]# chmod u+s arping [root@node102 bin]# ls -lh arping
-rwsr-xr-x 1 root root 33K Apr 1 2021 arping

2、failover主备切换测试

1)关闭主库数据库服务

[kingbase@node101 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

2)查看切换后的节点状态

[kingbase@node102 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0290)
Type "help" for help. TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | down | 0.500000 | standby | 0 | false | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | true | 0
(2 rows) # 如上所示,node102已经切换为primary。

3、recovery原主库作为备库加入集群

1)配置recovery.conf

[kingbase@node101 data]$ cp ../etc/recovery.done ./recovery.conf

[kingbase@node101 data]$ cat recovery.conf
standby_mode='on'
primary_conninfo='port=54321 host=192.168.1.102 user=SYSTEM password=MTIzNDU2 application_name=node101'
recovery_target_timeline='latest'
primary_slot_name ='slot_node101'

2)在新主库创建复制槽

TEST=# select * from sys_replication_slots;
SLOT_NAME | PLUGIN | SLOT_TYPE | DATOID | DATABASE | ACTIVE | ACTIVE_PID | XMIN | CATALOG_XMIN | RESTART_LSN | CONFIRMED_FLUSH_LSN
-----------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
(0 rows) TEST=# select sys_create_physical_replication_slot('slot_node101');
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node101,)
(1 row) TEST=# select sys_create_physical_replication_slot('slot_node102');
SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
(slot_node102,)
(1 row) TEST=# select * from sys_replication_slots;
SLOT_NAME | PLUGIN | SLOT_TYPE | DATOID | DATABASE | ACTIVE | ACTIVE_PID | XMIN | CATALOG_XMIN | RESTART_LSN | CONFIRMED_FLUSH_LSN
--------------+--------+-----------+--------+----------+--------+------------+------+--------------+-------------+---------------------
slot_node101 | | physical | | | f | | | | |
slot_node102 | | physical | | | f | | | | |
(2 rows)

3、启动原主库(新备库)数据库服务

[kingbase@node101 bin]$ ./sys_ctl start -D ../data
server starting
[kingbase@node101 bin]$ LOG: redirecting log output to logging collector process
HINT: Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log".

4、在新主库查看流复制和节点状态

# 查看流复制状态
TEST=# select * from sys_stat_replication;
PID | USESYSID | USENAME | APPLICATION_NAME | CLIENT_ADDR | CLIENT_HOSTNAME | CLIENT_PORT | BACKEND_START | BACKEND_XMIN | STATE | SENT_LOCATION | WRITE_LOCATION | FLUSH_LOCATION | REPLAY_LOCATION | SYNC_PRIORITY | SYNC_STATE
-------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
17271 | 10 | SYSTEM | node101 | 192.168.1.101 | | 17540 | 2022-06-22 14:09:44.848740+08 | | streaming | 0/22050C18 | 0/22050C18 | 0/22050C18 | 0/2204FFD8 | 0 | async
(1 row) # 查看集群节点状态
[kingbase@node102 bin]$ ./ksql -U SYSTEM -W 123456 TEST -p 9999
ksql (V008R003C002B0290)
Type "help" for help. TEST=# show pool_nodes;
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.500000 | standby | 0 | false | 0
1 | 192.168.1.102 | 54321 | up | 0.500000 | primary | 0 | true | 0
(2 rows) # 由以上可知,集群failover切换完成!!!

八、总结

对于KingbaseES V8R3集群在线添加新的管理节点,操作较复杂,尤其是对kingbasecluster配置文件的修改,注意细节部分,否则集群的启动和切换都会受到影响。

KingbaseES V8R3集群维护案例之---在线添加备库管理节点的更多相关文章

  1. KingbaseES V8R6集群维护案例之---停用集群node_export进程

    案例说明: 在KingbaseES V8R6集群启动时,会启动node_exporter进程,此进程主要用于向kmonitor监控服务输出节点状态信息.在系统安全漏洞扫描中,提示出现以下安全漏洞: 对 ...

  2. KingbaseES V8R6集群维护案例之---将securecmdd通讯改为ssh案例

    案例说明: 在KingbaseES V8R6的后期版本中,为了解决有的主机之间不允许root用户ssh登录的问题,使用了securecmdd作为集群部署分发和通讯的服务,有生产环境通过漏洞扫描,在88 ...

  3. KingbaseES V8R6集群维护案例之--单实例数据迁移到集群案例

    案例说明: 生产环境是单实例,测试环境是集群,现需要将生产环境的数据迁移到集群中运行,本文档详细介绍了从单实例环境恢复数据到集群环境的操作步骤,可以作为生产环境迁移数据的参考. 适用版本: Kingb ...

  4. KingbaseES V8R6集群维护案例之--修改securecmdd工具服务端口

    案例说明: 在一些生产环境,为了系统安全,不支持ssh互信,或限制root用户使用ssh登录,KingbaseES V8R6可以使用securecmdd工具支持主机之间的通讯.securecmdd工具 ...

  5. KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构

    案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...

  6. KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析

    ​ 案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...

  7. KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析

    ​ 案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...

  8. KingbaseES V8R3集群运维案例之---用户自定义表空间管理

    ​案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...

  9. KingbaseES V8R6集群维护之--修改数据库服务端口案例

    ​ 案例说明: 对于KingbaseES数据库单实例环境,只需要修改kingbase.conf文件的'port'参数即可,但是对于KingbaseES V8R6集群中涉及到多个配置文件的修改,并且在应 ...

随机推荐

  1. UiPath文本操作Get Text的介绍和使用

    一.Get Text操作的介绍 从指定的UI元素提取文本值 二.Get Text在UiPath中的使用 1. 打开设计器,在设计库中新建一个Sequence,为序列命名及设置Sequence存放的路径 ...

  2. Java 将HTML转为Word

    本文以Java代码为例介绍如何实现将HTML文件转为Word文档(.docx..doc).在实际开发场景中可参考此方法来转换.下面详细方法及步骤. 在编辑代码前,请先在程序中导入Spire.Doc.j ...

  3. windows下docker部署报错

    报错信息:Error response from daemon: Ports are not available: exposing port TCP 0.0.0.0:8848 -> 0.0.0 ...

  4. 实践GoF的23种设计模式:装饰者模式

    摘要:装饰者模式通过组合的方式,提供了能够动态地给对象/模块扩展新功能的能力.理论上,只要没有限制,它可以一直把功能叠加下去,具有很高的灵活性. 本文分享自华为云社区<[Go实现]实践GoF的2 ...

  5. 关于android sdk中monitor.exe报错的问题

    今天又是被坑的一上午.来总结一下: 1. 首先是找不到monitor的问题: 这个可能是一开始环境配置错误.所以我将android sdk重装了一下就好了 2. 第二个是找到monitor.bat发现 ...

  6. RapidEye快鸟、SPOT卫星遥感影像数据

    ​目前地理遥感生态网平台已发布高分辨率卫星遥感影像数据. 数据样例:百度云下载链接:https://pan.baidu.com/s/17ofPwpDM3OCHnE-LuhvUp 提取码:i0m4   ...

  7. PTA(BasicLevel)-1031 查验身份证

    一.问题定义 一个合法的身份证号码由17位地区.日期编号和顺序编号加1位校验码组成.校验码的计算规则如下:首先对前17位数字加权求和,权重分配为:{7,9,10,5,8,4,2,1,6,3,7,9,1 ...

  8. Redis 内存优化神技,小内存保存大数据

    大家好,我是「码哥」,大家可以叫我靓仔. 这次码哥跟大家分享一些优化神技,当你面试或者工作中你遇到如下问题,那就使出今天学到的绝招,一招定乾坤! 如何用更少的内存保存更多的数据? 我们应该从 Redi ...

  9. docker安装Nessus

    Nessus家庭版最大只支持扫描16个主机,但利用docker无限使用,当然虚拟机快照也可以. 关于网上其他的破解版,我是没有成功(显示成功了,其实是自慰版),所以才弄得这个镜像 提供两个镜像(不懂d ...

  10. Webpack干货系列 | Webpack5 怎么处理字体图标、图片资源

    程序员优雅哥(youyacoder)简介:十年程序员,呆过央企外企私企,做过前端后端架构.分享vue.Java等前后端技术和架构. 本文摘要:主要讲解在不需要引入额外的loader的条件下运用Webp ...