案例说明:

对KingbaseES V8R3集群,主库数据库服务down后,failover切换进行分析,详解其执行切换的过程,本案例可用于对KingbaseES V8R3集群failover故障的分析参考。

适用版本:

KingbaseES V8R3

集群架构:

 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
0 | 192.168.1.101 | 54321 | up | 0.333333 | standby | 0 | true | 0
1 | 192.168.1.102 | 54321 | up | 0.333333 | primary | 0 | false | 0
2 | 192.168.1.103 | 54321 | up | 0.333333 | standby | 0 | false | 0

一、主库数据库服务down

[kingbase@node102 bin]$ ./sys_ctl stop -D ../data

二、查看master节点cluster.log

1、此节点为kingbasecluster的master节点(一般和流复制的primary节点在同一节点)

2023-05-07 02:18:21: pid 11666: WARNING:  checking setuid bit of arping command
2023-05-07 02:18:21: pid 11666: DETAIL: arping[/home/kingbase/cluster/HAR3/db/bin//arping] doesn't have setuid bit
2023-05-07 02:18:21: pid 11666: LOG: Backend status file /home/kingbase/cluster/HAR3/run/kingbasecluster/kingbasecluster_status does not exist
......
2023-05-07 02:18:22: pid 11706: LOG: watchdog node state changed from [INITIALIZING] to [STANDING FOR MASTER]

2、检测到和主库数据库的health checking失败次数达到阈值(HEALTH_CHECK_MAX_RETRIES=6)。

2023-05-07 02:23:24: pid 11666: LOG:  health checking retry count 1
2023-05-07 02:23:24: pid 11666: LOG: failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
.......
2023-05-07 02:24:04: pid 11666: LOG: health checking retry count 5
2023-05-07 02:24:04: pid 11666: LOG: failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
2023-05-07 02:24:04: pid 11666: ERROR: failed to make persistent db connection
2023-05-07 02:24:04: pid 11666: DETAIL: connection to host:"192.168.1.102:54321" failed

3、执行failover切换前,master节点需要只有failover lock;从kingbasecluster的standby节点接收到failover lock request,默认只有master节点可以持有failover lock 。

2023-05-07 02:24:14: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:24:14: pid 11706: LOG: remote kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to become a lock holder for failover ID: 0
2023-05-07 02:24:14: pid 11706: LOG: request to become a lock holder is denied to remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:24:14: pid 11706: DETAIL: only master/coordinator can become a lock holder
2023-05-07 02:24:14: pid 11666: LOG: Kingbasecluster-II parent process has received failover request

4、master节点执行failover_stream.sh脚本触发failover切换。(在此时间点可以在failover.log看到详细的failover切换过程)

2023-05-07 02:24:14: pid 11666: LOG:  execute command: /home/kingbase/cluster/HAR3/kingbasecluster/bin/failover_stream.sh 192.168.1.101 1 1 192.168.1.102 192.168.1.102 0 0 /home/kingbase/cluster/HAR3/db/data
2023-05-07 02:24:14: pid 11706: LOG: received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:24:14: pid 11706: LOG: remote kingbasecluster node "192.168.1.102:9999 Linux node102" is checking the status of [FAILOVER] lock for failover ID 0
2023-05-07 02:24:14: pid 11706: LOG: FAILOVER lock is currently LOCKED

5、切换完成后,将其他备库recovery,连接到新的主库节点。

2023-05-07 02:25:28: pid 11706: LOG:  received the failover command lock request from remote kingbasecluster node "192.168.1.102:9999 Linux node102"
2023-05-07 02:25:28: pid 11706: LOG: remote kingbasecluster node "192.168.1.102:9999 Linux node102" is checking the status of [FAILOVER] lock for failover ID 55
2023-05-07 02:25:28: pid 11706: LOG: FAILOVER lock is currently LOCKED
.......
2023-05-07 02:25:45: pid 11666: LOG: starting fail back. reconnect host 192.168.1.103(54321)

三、master节点的failover.log

Tips:

日志信息记录的时间点和在cluster.log日志中记录的执行failover_stream.sh的时间点相同,两个日志可以配合查询。

1、执行failover切换

-----------------2023-05-07 02:24:14 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [1], %d = node id[1], %h = host name [192.168.1.102], %O = old primary host[192.168.1.102] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
master down, let 192.168.1.101 become new primary.....
2023-05-07 02:24:16 del old primary VIP on 192.168.1.102
es_client connect host:192.168.1.102 success, will stop old primary db and del the vip
stop the old primary db
sys_ctl: PID file "/home/kingbase/cluster/HAR3/db/data/kingbase.pid" does not exist
Is server running?
DEL VIP NOW AT 2023-05-07 02:24:02 ON enp0s3
execute: [/sbin/ip addr del 192.168.1.204/24 dev enp0s3]
Oprate del ip cmd end.
2023-05-07 02:24:16 add VIP on 192.168.1.101
ADD VIP NOW AT 2023-05-07 02:24:17 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.204/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/HAR3/db/bin//arping -U 192.168.1.204 -I enp0s3 -w 1
Success to send 1 packets
2023-05-07 02:24:17 promote begin...let 192.168.1.101 become master
check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-05-07 02:24:17 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-05-07 02:24:17 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2023-05-07 02:24:17 sync to async
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row) 2023-05-07 02:24:17 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-05-07 02:24:17 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10" -c "select 33333;"
2023-05-07 02:24:17 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2023-05-07 02:24:17 failover end---------------------------------------

2、执行其他备库节点的recovery

-----------------2023-05-07 02:25:28 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.101], %P = old primary node id [0], %d = node id[2], %h = host name [192.168.1.103], %O = old primary host[192.168.1.101] %m = new master node id [0], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/HAR3/db/data].
----ping trust ip
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]
----determine whether the faulty db is master or standby
standby down, master still 192.168.1.101
The sys_stat_replication view result is : []
2023-05-07 02:25:30 sync to async
ALTER SYSTEM
SYS_RELOAD_CONF
-----------------
t
(1 row) -----------------2023-05-07 02:25:30 failover end---------------------------------------

四、standby 节点的cluster.log

1、此节点为kingbasecluster的standby节点(一般和流复制的standby节点在同一节点)

---- Sun May 7 02:18:06 CST 2023 monitor up ----
2023-05-07 02:18:06: pid 31862: WARNING: checking setuid bit of arping command
.......
2023-05-07 02:18:07: pid 31886: LOG: setting the remote node "192.168.1.101:9999 Linux node101" as watchdog cluster master
2023-05-07 02:18:08: pid 31886: LOG: watchdog node state changed from [INITIALIZING] to [STANDBY]
2023-05-07 02:18:08: pid 31886: LOG: successfully joined the watchdog cluster as standby node
.......

2、检测到和主库数据库的health checking达到阈值(HEALTH_CHECK_MAX_RETRIES=6)

2023-05-07 02:23:09: pid 31862: LOG:  health checking retry count 1
2023-05-07 02:23:09: pid 31862: LOG: failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
2023-05-07 02:23:09: pid 31862: ERROR: failed to make persistent db connection
2023-05-07 02:23:09: pid 31862: DETAIL: connection to host:"192.168.1.102:54321" failed
.......
2023-05-07 02:23:59: pid 31862: LOG: health checking retry count 6
2023-05-07 02:23:59: pid 31862: LOG: failed to connect to kingbase server on "192.168.1.102:54321", getsockopt() detected error "Connection refused"
........

3、kingbasecluster的standby节点,向master节点发出持有failover lock的request,等待master节点的响应。

2023-05-07 02:23:59: pid 31886: LOG:  failover request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node "192.168.1.101:9999 Linux node101"
2023-05-07 02:23:59: pid 31886: DETAIL: waiting for the reply...
.......
2023-05-07 02:25:16: pid 31886: LOG: failover command lock request from local kingbasecluster node received on IPC interface is forwarded to master watchdog node "192.168.1.101:9999 Linux node101"
2023-05-07 02:25:16: pid 31886: DETAIL: waiting for the reply...

3、failover切换完成后,将其他备库节点recovery连接到新的主库节点。

failover done. shutdown host 192.168.1.103(54321)2023-05-07 02:25:16: pid 31862: LOG:  failover done. shutdown host 192.168.1.103(54321)
.......
2023-05-07 02:25:30: pid 31862: LOG: failback done. reconnect host 192.168.1.103(54321)

五、总结

KingbaseES V8R3集群failover切换流程:

1、集群启动后,kingbasecluster服务选举master节点和standby节点,master和standby节点之间通过watchdog传递心跳。

2、当master和standby节点,检测到主库的数据库服务(healthy check)次数超过阈值后,触发failover切换。

3、failover切换前,master节点需要持有failover lock。如果是主库主机down或重启,kingbasecluster的standby节点将切换为master,并获取failover lock。

4、master节点持有failover lock后,执行failover_stream.sh触发failover切换,如果master节点主机hang住,有可能导致无法执行failover_stream.sh,导致切换失败。

5、failover切换完成后,一个流复制备库节点切换为primary(默认管理备库节点),还将recovery其他备库节点到新主库。

6、failover切换过程可以从cluster.log和failover.log获取到详细的信息。

7、也只有获得锁的master KingbaseCluster可以进行选主,切换等操作。为standby的KingbaseCluster,当且仅当重新选举为新的master后,才会生效。

KingbaseES V8R3集群运维案例之---主库数据库服务down后failover切换详解的更多相关文章

  1. KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析

    ​ 案例说明: KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的 ...

  2. KingbaseES V8R3集群运维案例之---kingbase_monitor.sh启动”two master“案例

    案例说明: KingbaseES V8R3集群,执行kingbase_monitor.sh启动集群,出现"two master"节点的故障,启动集群失败:通过手工sys_ctl启动 ...

  3. KingbaseES V8R3集群运维案例之---cluster.log ERROR: md5 authentication failed

    案例说明: 在KingbaseES V8R3集群的cluster.log日志中,经常会出现"ERROR: md5 authentication failed:DETAIL: password ...

  4. KingbaseES V8R3集群运维案例之---用户自定义表空间管理

    ​案例说明: KingbaseES 数据库支持用户自定义表空间的创建,并建议表空间的文件存储路径配置到数据库的data目录之外.本案例复现了,当用户自定义表空间存储路径配置到data下时,出现的故障问 ...

  5. KingbaseES V8R6集群运维案例之---repmgr standby promote应用案例

    案例说明: 在容灾环境中,跨区域部署的异地备节点不会自主提升为主节点,在主节点发生故障或者人为需要切换时需要手动执行切换操作.若主节点已经失效,希望将异地备机提升为主节点. $bin/repmgr s ...

  6. KingbaseES V8R3集群管理维护案例之---集群迁移单实例架构

    案例说明: 在生产中,需要将KingbaseES V8R3集群转换为单实例架构,可以采用以下方式快速完成集群架构的迁移. 适用版本: KingbaseES V8R3 当前数据库版本: TEST=# s ...

  7. KingbaseES V8R3集群维护案例之---pcp_node_refresh应用

    案例说明: 在一次KingbaseES V8R3集群切换分析中,运维人员执行了pcp_node_refresh,导致集群发生了failover的切换.此文档对pcp_node_refresh工具做了应 ...

  8. KingbaseES V8R3集群管理和维护案例之---failover切换wal日志变化分析

    ​ 案例说明: 本案例通过对KingbaseES V8R3集群failover切换过程进行观察,分析了主备库切换后wal日志的变化,对应用者了解KingbaseES V8R3(R6) failover ...

  9. KingbaseES V8R3集群维护案例之---在线添加备库管理节点

    案例说明: 在KingbaseES V8R3主备流复制的集群中 ,一般有两个节点是集群的管理节点,分为master和standby:如对于一主二备的架构,其中有两个节点是管理节点,三个数据节点:管理节 ...

  10. PB 级大规模 Elasticsearch 集群运维与调优实践

    PB 级大规模 Elasticsearch 集群运维与调优实践 https://mp.weixin.qq.com/s/PDyHT9IuRij20JBgbPTjFA | 导语 腾讯云 Elasticse ...

随机推荐

  1. Ubuntu 20.04 出现 SSL_connect: error:1425F102 .. unsupported protocol问题的解决

    在安装完Ubuntu 20.04后, 这个问题影响了好几个软件, 包括MySQL Workbench, Openfortigui等等, 出现的错误都是 ERROR: SSL_connect: erro ...

  2. 【Unity3D】UGUI之Dropdown

    1 Dropdown属性面板 ​ 在 Hierarchy 窗口右键,选择 UI 列表里的 Dwondown (下拉列表)控件,即可创建 Dwondown 控件,选中创建的 Dwondown 控件,按键 ...

  3. Mysql表读写、索引等操作的sql语句效率优化问题

    上次我们说到mysql的一些sql查询方面的优化,包括查看explain执行计划,分析索引等等.今天我们分享一些 分析mysql表读写.索引等等操作的sql语句. 闲话不多说,直接上代码: 反映表的读 ...

  4. Spring Boot图书管理系统项目实战-4.基础信息管理

    导航: pre:  3.用户登录 next:5.读者管理 只挑重点的讲,具体的请看项目源码. 1.项目源码 需要源码的朋友,请捐赠任意金额后留下邮箱发送:) 2.页面设计 出版社管理.语种管理.书架管 ...

  5. Java并发编程实例--15.在同步代码块中使用条件

    并发编程中有个经典问题: 生产消费者问题. 我们有一个数据缓冲区,一个或多个生产者往其中存入对象,另外一个或多个消费者从中取走. 因此,该数据缓冲区是一个共享数据结构,我们需要对其添加读取同步机制,但 ...

  6. 【Android 逆向】【攻防世界】app2

    1. 手机安装apk,随便点击,进入到第二个页面就停了 2. jadx打开apk,发现一共有三个activity,其中第三个activity: FileDataActivity 里面有东西 publi ...

  7. 【应用服务 App Service】 App Service Rewrite 实例 -- 限制站点的访问

    问题描述 在Azure App Service中,当需要限制某些特殊的情况对其进行访问时候,可以通过IP限制,逻辑代码判断,或者Rewrite规则.通过IP限制则需要知道客户端访问的IP,而通过逻辑代 ...

  8. opencv库图像基础1-python

    opencv库图像基础-python 基本操作 图片颜色通道 非灰度图的颜色通道是红绿蓝,在opencv中默认是BGR的顺序 argparse模块 argparse 库是 Python 标准库中用于命 ...

  9. Java 面向对象的特征一: * 封装与隐藏

    1 * @ 面向对象的特征一: 2 * 封装与隐藏 3 * 创建一个类的对象以后,我们可以通过"对象.属性"的方式,对 4 * 对象的属性进行赋值,这里,赋值操作要受到属性的数据类 ...

  10. 我的闲鱼Python爬虫接单总结和经验,最高600元一单

    最近,我在闲鱼上利用 Python 爬虫技术接了一些任务,想必你一定好奇,通过这样的方式,到底能不能挣钱,能挣多少钱?今天我就来分享一下我的经验和总结. 一.接单经历 之前 Vue 的作者尤大在微博上 ...