资源池化支持同城dorado双集群切换（非日志合一）

资源池化支持同城 dorado 双集群部署方式：dd 模拟(手动部署+无 cm)、cm 模拟(手动部署 dd 模拟+有 cm)、磁阵(手动部署)、集群管理工具部署

1.集群间切换

基于《资源池化+同城dorado双集群（非日志合一）》部署方式，集群间切换设计如下：

1.1.主备集群状态

前提条件：已经部署资源池化同城双集群环境

集群中心端节点类型 local role run mode

生产中心主端主节点0 primary primary (资源池化+传统主)

备节点1 standby normal (资源池化+传统单机)

容灾中心备端首备节点0 standby standby(资源池化+传统备)

从备节点1 standby normal (资源池化+传统单机)

local role 从系统函数 pg_stat_get_stream_replications 中获取的 local_role 参数：

openGauss=# select * from pg_stat_get_stream_replications();

local_role | static_connections | db_state | detail_information

------------+--------------------+----------+--------------------

Primary | 1 | Normal | Normal

(1 row)

Tips:run mode 指数据库内核运行模式是primary还是standby还是normal，是t_thrd.postmaster_cxt.HaShmData->current_mode或t_thrd.xlog_cxt.server_mode参数指代的主备运行模式类型

1.2.failover

以下提到的/home/omm/ss_hatest/dn0为数据库dn目录，解释如下：

集群中心端节点类型 local role dn目录

生产中心主端主节点0 primary /home/omm/ss_hatest/dn0

备节点1 standby /home/omm/ss_hatest/dn1

容灾中心备端首备节点0 Main Standby /home/omm/ss_hatest1/dn0

从备节点1 standby /home/omm/ss_hatest1/dn1

双集群间failover即主集群故障，备集群升为主集群的过程，操作过程如下：

(1) kill 主集群将主集群节点全部 kill 掉 (2) stop 备集群

gs_ctl stop -D /home/omm/ss_hatest1/dn0

gs_ctl stop -D /home/omm/ss_hatest1/dn1

(3) 备集群设置 cluster_run_mode

gs_guc set -Z datanode -D /home/omm/ss_hatest1/dn0 -c "cluster_run_mode=cluster_primary"

(4) 切换远程同步复制主从端如果是cm模拟部署方式(博客：博客资源池化同城dorado双集群部署二之cm模拟部署)，不需要在管控平台切换同步复制对方向的操作。

如果是om部署方式(博客：资源池化同城dorado双集群部署四之om部署)，则在拉起集群之前，需要在管控平台切换同步复制对方向的操作，操作如下：登录到备存储管控平台，操作data protection -> luns -> remote replication pairs(远程复制对) -> 找到远程同步复制xlog对应的lun -> More -> Primary/Standby Switchover，操作完后，即可看到Local Resource从Secondary变成Primary。

(5) 以主集群模式重启备集群的节点

gs_ctl start -D /home/omm/ss_hatest1/dn0 -M primary

gs_ctl start -D /home/omm/ss_hatest1/dn1

(5) 查询新主集群

gs_ctl query -D /home/omm/ss_hatest1/dn0

1.2.switchover

双集群间switchover即主集群降为备集群，备集群升为主集群的过程，操作过程如下：

(1) stop 主集群

gs_ctl stop -D /home/omm/ss_hatest/dn0

gs_ctl stop -D /home/omm/ss_hatest/dn1

(2) stop 备集群

gs_ctl stop -D /home/omm/ss_hatest1/dn0

gs_ctl stop -D /home/omm/ss_hatest1/dn1

(3) 备集群设置 cluster_run_mode

(5) 以主集群模式重启备集群的节点

gs_ctl start -D /home/omm/ss_hatest1/dn0 -M primary

gs_ctl start -D /home/omm/ss_hatest1/dn1

(5) 查询新主集群

gs_ctl query -D /home/omm/ss_hatest1/dn0

(6) 主集群设置 cluster_run_mode=cluster_standby

gs_guc set -Z datanode -D /home/zx/ss_hatest/dn0 -c "cluster_run_mode=cluster_standby"

(7) 以备集群模式重启备集群的节点

gs_ctl start -D /home/omm/ss_hatest/dn0 -M standby

gs_ctl start -D /home/omm/ss_hatest/dn1

(8) 查询新备集群

gs_ctl query -D /home/omm/ss_hatest/dn0

2. 主集群内切换

2.1.failover

该章节介绍基于cm模拟部署方式的集群内切换，om部署方式的双集群和资源池化原有集群内切换方法一样。主集群内failover即主集群主节点降为备节点，备节点升为主节点的过程，操作过程如下：

(1) 检查节点状态查询状态

主集群主节点0

gs_ctl query -D /home/omm/ss_hatest/dn0

HA state:

local_role : Primary

static_connections : 1

db_state : Normal

detail_information : Normal

Senders info:

sender_pid : 1456376

local_role : Primary

peer_role : StandbyCluster_Standby

peer_state : Normal

state : Streaming

sender_sent_location : 2/5C8

sender_write_location : 2/5C8

sender_flush_location : 2/5C8

sender_replay_location : 2/5C8

receiver_received_location : 2/5C8

receiver_write_location : 2/5C8

receiver_flush_location : 2/5C8

receiver_replay_location : 2/5C8

sync_percent : 100%

sync_state : Async

sync_priority : 0

sync_most_available : Off

channel : ...:6600-->...:43350

Receiver info:

No information

主集群备节点1

gs_ctl query -D /home/omm/ss_hatest/dn1

HA state:

local_role : Standby

static_connections : 0

db_state : Normal

detail_information : Normal

Senders info:

No information

Receiver info:

No information

备集群首备节点0

gs_ctl query -D /home/omm/ss_hatest1/dn0

HA state:

local_role : Main Standby

static_connections : 1

db_state : Normal

detail_information : Normal

Senders info:

No information

Receiver info:

receiver_pid : 1901181

local_role : Standby

peer_role : Primary

peer_state : Normal

state : Normal

sender_sent_location : 2/A458

sender_write_location : 2/A458

sender_flush_location : 2/A458

sender_replay_location : 2/A458

receiver_received_location : 2/A458

receiver_write_location : 2/A458

receiver_flush_location : 2/A458

receiver_replay_location : 2/A458

sync_percent : 100%

channel : ...:41952<--...:6600

备集群备节点1

gs_ctl query -D /home/omm/ss_hatest1/dn1

HA state:

local_role : Standby

static_connections : 0

db_state : Normal

detail_information : Normal

Senders info:

No information

Receiver info:

No information

(2) 配置参数主集群节点的postgresql.conf文件

主集群主节点0

port = 6600

xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'

xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'

application_name = 'dn_master_0'

cross_cluster_replconninfo1='localhost=... localport=6600 remotehost=... remoteport=9600'

cross_cluster_replconninfo2='localhost=... localport=6600 remotehost=... remoteport=9700'

cluster_run_mode = 'cluster_primary'

ha_module_debug = off

ss_log_level = 255

ss_log_backup_file_count = 100

ss_log_max_file_size = 1GB

主集群备节点1

port = 6700

xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'

xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'

application_name = 'dn_master_1'

cross_cluster_replconninfo1='localhost=... localport=6700 remotehost=... remoteport=9600'

cross_cluster_replconninfo2='localhost=... localport=6700 remotehost=... remoteport=9700'

cluster_run_mode = 'cluster_primary'

ha_module_debug = off

ss_log_level = 255

ss_log_backup_file_count = 100

ss_log_max_file_size = 1GB

备集群节点的postgresql.conf文件

备集群首备节点0

port = 9600

xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'

xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'

application_name = 'dn_standby_0'

cross_cluster_replconninfo1='localhost=... localport=9600 remotehost=... remoteport=6600'

cross_cluster_replconninfo2='localhost=... localport=9600 remotehost=... remoteport=6700'

cluster_run_mode = 'cluster_standby'

ha_module_debug = off

ss_log_level = 255

ss_log_backup_file_count = 100

ss_log_max_file_size = 1GB

备集群备节点1

port = 9700

xlog_file_path = '/home/zx/ss_hatest/dorado_shared_disk'

xlog_lock_file_path = '/home/zx/ss_hatest/shared_lock_primary'

application_name = 'dn_standby_1'

cross_cluster_replconninfo1='localhost=... localport=9700 remotehost=... remoteport=6600'

cross_cluster_replconninfo2='localhost=... localport=9700 remotehost=... remoteport=6700'

cluster_run_mode = 'cluster_standby'

ha_module_debug = off

ss_log_level = 255

ss_log_backup_file_count = 100

ss_log_max_file_size = 1GB

双集群所有节点必须提前都配置 xlog_file_path、xlog_lock_file_path、cross_cluster_replconninfo1、cluster_run_mode 这些容灾关系建立的参数

(3) 导入用于切换的环境变量 CM_CONFIG_PATH

export CM_CONFIG_PATH=/opt/omm/openGauss-server/src/test/ss/cm_config.ini

(4) 模拟failover

当前节点0是主节点，kill -9 pid (pid是主节点0的进程号)

修改 cm_config.ini

REFORMER_ID = 1

BITMAP_ONLINE = 2

说明：模拟主节点 0 故障，REFORMER_ID 模拟 reform 锁被备节点 1 抢到，即为将要做 failover 的节点，BITMAP_ONLINE 模拟 cm 获取的在线节点是节点 1(bitmap = 2 = 0b10)

2.2.switchover

基于cm模拟部署方式主集群内failover即主集群主节点降为备节点，备节点升为主节点的过程，操作过程如下：

(1) 检查节点状态同 failover 检查一致

(2) 配置参数同 failover 配置一致

(3) 执行 switchover 命令

[omm@nodename dn0]$ gs_ctl switchover -D /home/zx/ss_hatest/dn1

[2023-04-24 15:49:04.785][3815633][][gs_ctl]: gs_ctl switchover ,datadir is /home/zx/ss_hatest/dn1

[2023-04-24 15:49:04.786][3815633][][gs_ctl]: switchover term (1)

[2023-04-24 15:49:04.954][3815633][][gs_ctl]: waiting for server to switchover....[2023-04-24 15:49:06.122][3815633][][gs_ctl]: Getting state from gaussdb.state!

.[2023-04-24 15:49:07.123][3815633][][gs_ctl]: Getting state from gaussdb.state!

.[2023-04-24 15:49:08.125][3815633][][gs_ctl]: Getting state from gaussdb.state!

.[2023-04-24 15:49:09.126][3815633][][gs_ctl]: Getting state from gaussdb.state!

.[2023-04-24 15:49:10.198][3815633][][gs_ctl]: Getting state from gaussdb.state!

...

[2023-04-24 15:49:13.353][3815633][][gs_ctl]: done

[2023-04-24 15:49:13.353][3815633][][gs_ctl]: switchover completed (/home/zx/ss_hatest/dn1)

说明：/home/zx/ss_hatest/dn1是主集群备节点1的数据库，做switchover将主集群主节点0降备，将主集群备节点1升主

查看目录/opt/omm/openGauss-server/src/test/ss/：

[omm@nodename ss]$ ll

总用量 56

-rwxrwxrwx 1 zx zx 3749 4月 24 14:29 build_ss_database_common.sh

-rwxrwxrwx 1 zx zx 2952 4月 24 14:29 build_ss_database.sh

-rw------- 1 zx zx 34 4月 24 15:49 cm_config.ini

-rw------- 1 zx zx 33 4月 24 15:49 cm_config.ini_bak

cm_config.ini 是 switchcover 后的新生成的集群列表，主节点 REFORMER_ID 是 1

BITMAP_ONLINE = 3

REFORMER_ID = 1

cm_config.ini_bak 是 switchcover 前的集群列表，主节点 REFORMER_ID 是 0

REFORMER_ID = 0

BITMAP_ONLINE = 3

(4) 双集群状态查询

主集群备节点0

[omm@nodename dn0]$ gs_ctl query -D /home/zx/ss_hatest/dn0

[2023-04-24 15:52:33.134][3862235][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest/dn0

HA state:

local_role : Standby

static_connections : 2

db_state : Normal

detail_information : Normal

Senders info:

No information

Receiver info:

No information

主集群主节点1

[zx@node1host54 dn0]$ gs_ctl query -D /home/zx/ss_hatest/dn1

[2023-04-24 15:52:35.777][3862851][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest/dn1

HA state:

local_role : Primary

static_connections : 2

db_state : Normal

detail_information : Normal

Senders info:

sender_pid : 3817397

local_role : Primary

peer_role : StandbyCluster_Standby

peer_state : Normal

state : Streaming

sender_sent_location : 2/43EA678

sender_write_location : 2/43EA678

sender_flush_location : 2/43EA678

sender_replay_location : 2/43EA678

receiver_received_location : 2/43EA678

receiver_write_location : 2/43EA678

receiver_flush_location : 2/43EA678

receiver_replay_location : 2/43EA678

sync_percent : 100%

sync_state : Async

sync_priority : 0

sync_most_available : Off

channel : ...:9700-->...:37904

Receiver info:

No information

备集群首备节点0

[zx@node1host54 pg_log]$ gs_ctl query -D /home/zx/ss_hatest1/dn0

[2023-04-24 15:53:44.305][3878378][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest1/dn0

HA state:

local_role : Main Standby

static_connections : 2

db_state : Normal

detail_information : Normal

Senders info:

No information

Receiver info:

receiver_pid : 3816277

local_role : Standby

peer_role : Primary

peer_state : Normal

state : Normal

sender_sent_location : 2/43EA798

sender_write_location : 2/43EA798

sender_flush_location : 2/43EA798

sender_replay_location : 2/43EA798

receiver_received_location : 2/43EA798

receiver_write_location : 2/43EA798

receiver_flush_location : 2/43EA798

receiver_replay_location : 2/43EA798

sync_percent : 100%

channel : ...:37904<--...:9700

备集群从备节点1

[omm@nodename pg_log]$ gs_ctl query -D /home/zx/ss_hatest1/dn1

[2023-04-24 15:53:46.779][3879076][][gs_ctl]: gs_ctl query ,datadir is /home/zx/ss_hatest1/dn1

HA state:

local_role : Standby

static_connections : 1

db_state : Normal

detail_information : Normal

Senders info:

No information

Receiver info:

No information

说明：switchover成功后，备集群的首备节点0与主集群新主节点1容灾关系自动连接成功，同步复制功能正常，备集群首备回放正常

Notice:不推荐直接用于生产环境作者：Shirley_zhengx

资源池化支持同城dorado双集群切换（非日志合一）的更多相关文章

从数据仓库双集群系统模式探讨，看GaussDB(DWS)的容灾设计
摘要:本文主要是探讨OLAP关系型数据库框架的数据仓库平台如何设计双集群系统,即增强系统高可用的保障水准,然后讨论一下GaussDB(DWS)的容灾应该如何设计. 当前社会.企业运行当中,大数据分析. ...
4.安装fluentd用于收集集群内部应用日志
作者微信:tangy8080 电子邮箱:914661180@qq.com 更新时间:2019-06-13 11:02:14 星期四欢迎您订阅和分享我的订阅号,订阅号内会不定期分享一些我自己学习过程 ...
Oracle RAC 集群环境下日志文件结构
Oracle RAC 集群环境下日志文件结构在Oracle RAC环境中,对集群中的日志的定期检查是必不可少的.通过查看集群日志,可以早期定位集群环境中出现的问题,以便将问题消灭在萌芽状态.简单介绍 ...
KingbaseES V8R6 集群环境wal日志清理
案例说明: 1.对于集群中的wal日志,除了需要在备库执行recovery外,在集群主备切换(switchover或failover)时,sys_rewind都要读取wal日志,将数据库恢复到一致性状 ...
redis之（十八）redis的支持水平扩容的集群特性，以及插槽的相关操作
［一］主从集群的缺点,客户端分片的缺点 (1)主从+哨兵的redis集群,只是做主从备份,数据冗余的一种处理.但在存储空间的扩展上还是有限制.因为集群中的节点都是存储同样的数据.单一节点的容量,就可以 ...
Centos下部署最后一版支持Docker的k8s集群
部署版本首先要确定部署的版本查询Kubernetes对Docker支持的情况 kubernetes/dependencies.yaml at master · kubernetes/kuberne ...
容器化部署Cassandra高可用集群
前提: 三台装有docker的虚拟机,这里用VM1,VM2,VM3表达(当然生产环境要用三个独立物理机,否则无高可用可言),装docker可参见Ubuntu离线安装docker. 开始部署: 部署图 ...
【故障公告】没有龙卷风，k8s集群翻船3次，投用双集群恢复
今天没有龙卷风(异常的高并发请求),故障却依然出现,问题非常奇怪. 某种异常情况会造成短时间内, k8s 集群中大量 pod (超过60%)因健康检查失败而处于 CrashLoopBackOff 状态 ...
容器化｜自建 MySQL 集群迁移到 Kubernetes
背景如果你有自建的 MySQL 集群,并且已经感受到了云原生的春风拂面,想将数据迁移到 Kubernetes 上,那么这篇文章可以给你一些思路. 文中将自建 MySQL 集群数据,在线迁移到 Kub ...
HUE配置文件hue.ini 的hdfs_clusters模块详解（图文详解）（分HA集群和非HA集群）
不多说,直接上干货! 我的集群机器情况是 bigdatamaster(192.168.80.10).bigdataslave1(192.168.80.11)和bigdataslave2(192.168 ...

随机推荐

opencv库图像基础2-python
opencv库图像基础2-python 图像的简单变换先导入库 import cv2 import matplotlib.pyplot as plt import numpy as np 1.图像的 ...
Java 练习题看起来很简单写起来却有点难度
1 import java.io.PrintStream; 2 3 /* 4 * 5 * public class ValueTransferTest4 6 *{ 7 * public static ...
tmux使用--同步多终端输入
最近一直需要同时操作多个远程机器,就简单学习了下tmux的使用.tmux(terminal multiplexer)是终端复用神器.对多个窗格同时使用特别好用,同步操作多台机器特别方便. tmux安装 ...
将MindSpore运行结果输出到log文件
技术背景我们在Linux系统下使用一些深度学习框架(如MindSpore)运行脚本的时候,经常会用一些打印输出来判断当前执行的步骤,或者是使用打印输出来定位算法问题.但是在Linux系统下程序输出其 ...
[VueJsDev] 快速入门 - vscode 设置推荐
[VueJsDev] 目录列表 https://www.cnblogs.com/pengchenggang/p/17037320.html vscode设置推荐 ::: details 目录目录 v ...
idea branch 分支比较 | git 查看分支命令 `git branch -vv`
git 查看分支命令 git branch -vv
python3 Crypto模块实例解析
一模块简介 1.简介 python的Crypto模块是安全hash函数(例如SHA256 和RIPEMD160)以及各种主流的加解密算法的((AES, DES, RSA, ElGamal等)的集合. ...
MyEclipse之各个版本的区别
跟Eclipse一样,MyEclipse的各个版本也是有区别的,他们所集成的插件是不同的. 从插件数量和功能的强大程度上讲:Blue>Professional>Standard MyEcl ...
Harris/Shi-Tomasi角点检测
机器视觉--角点检测什么是角点检测在几何学里,我们会看到各种各样的三角形.多边形等,它们都有一个显著的特征:包含了角点信息.比如在三角形里,我们有三个角:在矩形里,我们有四个角.我们将找到这些图像 ...
专访深职院 XR 专家 | 实时云渲染赋能虚拟仿真实训，打造 5G+XR 智慧教育平台
近年,国家高度重视职业教育,为主动应对新一轮科技革命与产业变革,支撑服务创新驱动发展,教育部积极推进新工科建设.加快教育改革创新.在职业教育上,XR 技术与教育的结合,的的确确弥补了传统职业教育中&q ...

资源池化支持同城dorado双集群切换（非日志合一）

资源池化支持同城dorado双集群切换（非日志合一）的更多相关文章

随机推荐

热门专题