KingbaseES R6 集群主机锁冲突导致的主备切换案例
案例说明:
主库在业务高峰期间,客户执行建表等DDL操作,主库产生“AccessExclusiveLock ”锁,导致大量的事务产生锁冲突,大量的会话堆积,客户端session访问主库失败。备库和主库之间的PQping的心跳通讯测试也受到阻塞,在指定阈值(10次)范围内,尝试连接失败后,触发主备切换。
一、案例故障信息
1)主库sys_log日志
=如下图所示,在主库上产生了“AccessExclusiveLock ”锁,出现大量锁冲突=
2022-03-23 11:36:00 CST [3165572]: [2241-1] user=bjoa,db=gzwwwoa,app=Kingbase8 JDBC Driver,client=192.3.89.9LOG: process 3165572 still waiting for AccessExclusiveLock on tuple (551717,56) of relation 20499 of database 16461 after 1012.724 ms
2022-03-23 11:36:00 CST [3165572]: [2242-1] user=bjoa,db=gzwwwoa,app=Kingbase8 JDBC Driver,client=192.3.89.9DETAIL: Process holding the lock: 3219816. Wait queue: 3172465, 3220570, 3211000, 3194976, 3195098, 3189287, 3194983, 3148868, 3219348, 3219365, 3219380, 3219672, 3220580, 3218107, 3219360, 3217745, 3211003, 3219456, 3219687, 3217717, 3211892, 3079421, 3218763, 3220067, 3217706, 3217630, 3217372, 3217375, 3194995, 3220143, 3124556, 3117826, 3194934, 3194832, 3218969, 3217628, 3119221, 3219787, 3219385, 3219345, 3217769, 3216508, 3214193, 3218108, 3216459, 3220223, 3218058, 3220627, 3220413, 3215490, 3220591, 3215493, 3120204, 3218892, 3168511, 3194835, 3220588, 3220546, 3217718, 3217629, 3217386, 3220586, 3217778, 3220178, 3218789, 3217707, 3218839, 3150225, 3167040, 3219539, 3169071, 3194836, 3195048, 3194988, 3220523, 3220100, 3217711, 3215655, 3218068, 3168514, 3220589, 3220583, 3218721, 3133538, 3216464, 3218089, 3219051, 3212018, 3161515, 3220060, 3216453, 3217761, 3220176, 3216469, 3218843, 3218492, 3212020, 3210997, 3217715, 3172444, 3217674, 3157246, 3217682, 3220212, 3191504, 3216431, 3220410, 3194889, 3219677, 3220099, 3218810, 3219364, 3218099, 3182051, 3220574, 3218029, 3212166, 3219418, 3220106, 3219772, 3216447, 3215484, 3214194, 3219366, 3195081, 3170956, 3159849, 3170355, 3219405, 3218967, 3217626, 3218115, 3173965, 3220056, 3216482, 3220381, 3219349, 3194845, 3220389, 3217064, 3161902, 3220057, 3218055, 3211895, 3217770, 3220119, 3220227, 3220529, 3220052, 3219065, 3176922, 3194369, 3218054, 3220065, 3212136, 3218105, 3217775, 3218060, 3195091, 3058330, 3220571, 3217739, 3220511, 3220059, 3219400, 3219412, 3219379, 3211893, 3220192, 3220557, 3219676, 3220112, 3219420, 3165574, 3220216, 3217767, 3219533, 3220602, 3194828, 3220097, 3219783, 3219327, 3219329, 3212247, 3220053, 3219766, 3192177, 3219673, 3212162, 3220568, 3212106, 3220527, 3215487, 3220046, 3220537, 3121709, 3153617, 3217422, 3218073, 3218804, 3216434, 3220168, 3140776, 3156690, 3220199, 3212201, 3194380, 3194821, 3220516, 3217371, 3212015, 3194824, 3219346, 3216427, 3216506, 3219347, 3218102, 3215687, 3194816, 3220600, 3210998, 3218483, 3220593, 3220364, 3220407, 3220396, 3218046, 3220544, 3219417, 3218968, 3220380, 3136699, 3220555, 3218110, 3171869, 3220111, 3220084, 3220569, 3195041, 3217768, 3212151, 3179922, 3220167, 3219388, 3215489, 3194699, 3220162, 3160067, 3194809, 3220115, 3220285, 3218771, 3218837, 3218104, 3216437, 3219769, 3217676, 3215481, 3215645, 3218768, 3220418, 3218812, 3175055, 3219799, 3218777, 3153616, 3215488, 3217771, 3218770, 2983432, 3218806, 3216483, 3220130, 3215674, 3218772, 3215672, 3219746, 3211897, 3217374, 3218823, 3220133, 3220228, 3219745, 3220181, 3217765, 3219810, 3215686, 3217727, 3170682, 3219536, 3194808, 3217390, 3212163, 3179926, 3218059, 3218026, 3218824, 3216432, 3212128, 3220506, 3169074, 3220556, 3172449, 3220522, 3220366, 3233821, 3218795, 3218808, 3220287, 3220338, 3219780, 3165572, 3218832, 3220538, 3220499, 3220224, 3215654, 3218841, 3217776, 3220374, 3219675, 3218756, 3215677, 3195087, 3219754.
2022-03-23 11:36:00 CST [3165572]: [2243-1] user=bjoa,db=gzwwwoa,app=Kingbase8 JDBC Driver,client=192.3.89.9CONTEXT: SQL statement "DELETE FROM tbtemp_gettodolistmsg1"
PL/SQL function fngettodolistmsg(integer,integer,refcursor) line 268 at SQL statement
2)备库hamgr.log日志
=如以下日志所示,备库在通过PQping() 函数和主库执行心跳通讯测试,在超过阈值(10次)后,触发主备切换。=
[2022-03-23 11:36:51] [WARNING] unable to ping "host=192.3.89.16 user=esrep dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
[2022-03-23 11:36:51] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-23 11:36:51] [WARNING] unable to connect to upstream node "node16" (ID: 1)
[2022-03-23 11:36:51] [INFO] sleeping 6 seconds until next reconnection attempt
[2022-03-23 11:36:57] [INFO] checking state of node 1, 1 of 10 attempts
[2022-03-23 11:37:07] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.3.89.16 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-23 11:37:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-23 11:37:07] [INFO] sleeping 6 seconds until next reconnection attempt
......
[2022-03-23 11:38:57] [INFO] checking state of node 1, 10 of 10 attempts
[2022-03-23 11:39:07] [WARNING] unable to ping "user=esrep connect_timeout=10 dbname=esrep host=192.3.89.16 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr"
[2022-03-23 11:39:07] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2022-03-23 11:39:07] [WARNING] unable to reconnect to node 1 after 10 attempts
[2022-03-23 11:39:07] [NOTICE] setting "wal_retrieve_retry_interval" to 86405000 milliseconds
[2022-03-23 11:39:07] [INFO] sleeping 5 seconds
[2022-03-23 11:39:12] [NOTICE] killing WAL receiver with PID 1794980
[2022-03-23 11:39:13] [INFO] WAL receiver with pid 1794980 killed
[2022-03-23 11:39:14] [NOTICE] killing WAL receiver with PID 1657239
[2022-03-23 11:39:15] [INFO] WAL receiver with pid 1657239 killed
[2022-03-23 11:39:16] [NOTICE] WAL receiver disconnected on all sibling nodes
[2022-03-23 11:39:16] [INFO] WAL receiver disconnected on all 0 sibling nodes
[2022-03-23 11:39:16] [INFO] 0 active sibling nodes registered
[2022-03-23 11:39:16] [INFO] primary and this node have the same location ("default")
[2022-03-23 11:39:16] [INFO] no other sibling nodes - we win by default
[2022-03-23 11:39:16] [NOTICE] setting "wal_retrieve_retry_interval" to 5000 ms
[2022-03-23 11:39:16] [NOTICE] this node is the only available candidate and will now promote itself
[2022-03-23 11:39:16] [INFO] try to ping the trusted_servers "192.3.89.1" before execute promote_command
[2022-03-23 11:39:18] [NOTICE] PING 192.3.89.1 (192.3.89.1) 56(84) bytes of data.
--- 192.3.89.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.152/0.159/0.166/0.007 ms
[2022-03-23 11:39:18] [NOTICE] successfully ping one or more of the trusted_servers "192.3.89.1"
[2022-03-23 11:39:20] [NOTICE] try to stop old primary db (host: "192.3.89.16")
[2022-03-23 11:39:20] [NOTICE] PING 192.3.89.60 (192.3.89.60) 56(84) bytes of data.
--- 192.3.89.60 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1023ms
rtt min/avg/max/mdev = 0.423/0.575/0.728/0.152 ms
[2022-03-23 11:39:20] [WARNING] the virtual ip is already on other host, try to release it on old primary node (host: "192.3.89.16")
[2022-03-23 11:39:21] [INFO] stop database ...
[2022-03-23 11:39:21] [INFO] ES connection to host "192.3.89.16" succeeded, ready to release vip on it
[2022-03-23 11:39:23] [INFO] kill signal 9 to pid 2091799
[2022-03-23 11:39:23] [INFO] stop db done.
[2022-03-23 11:39:23] [NOTICE] old primary node (host: "192.3.89.16") does not hold the virtual ip 192.3.89.60/24 before we release it
[2022-03-23 11:39:23] [NOTICE] will acquire the virtual ip again
[2022-03-23 11:39:25] [NOTICE] PING 192.3.89.60 (192.3.89.60) 56(84) bytes of data.
--- 192.3.89.60 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1001ms
[2022-03-23 11:39:25] [WARNING] ping host"192.3.89.60" failed
[2022-03-23 11:39:25] [DETAIL] average RTT value is not greater than zero
[2022-03-23 11:39:25] [INFO] loadvip result: 1, arping result: 1
[2022-03-23 11:39:25] [NOTICE] new primary node (ID: 2) acquire the virtual ip 192.3.89.60/24 success
[2022-03-23 11:39:25] [INFO] promote_command is:
"/home/kingbase/cluster/cluster/KingbaseCluster/kingbase/bin/repmgr standby promote -f /home/kingbase/cluster/cluster/KingbaseCluster/kingbase/etc/repmgr.conf"
NOTICE: promoting standby to primary
DETAIL: promoting server "node17" (ID: 2) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node17" (ID: 2) was successfully promoted to primary
三、案例分析
1)AccessExclusiveLock 锁,一般是因为执行了以下操作(DDL)产生
2)PQping()函数
repmgr备库对主库状态的监控是通过PQping()函数,PQping()函数是libpq接口下的函数:libpq库是PostgreSQL的C接口。它是一组库函数,允许客户端程序与PostgreSQL交互。简单地说,Libpq 可做三件事:与数据库建立连接、发送 操作数据库的语句并处理结果。
3) 主备切换分析
主备之间通过libpq接口的PQping()函数做心跳测试,由于主库产生锁冲突,导致访问主库的会话堆积,致使备库连接主库PQping()失败,在超过阈值(10次)后,发生主备切换。
4)确定现场业务操作
四、总结
本次主备切换可以判断为,客户在业务高峰期,执行DDL操作,主库大量的锁冲突,会话阻塞,导致主备之间心跳通讯测试失败而引起。在生产环境业务高峰期间,应禁止执行DDL等操作,应该安排在业务低峰期间执行。
KingbaseES R6 集群主机锁冲突导致的主备切换案例的更多相关文章
- KingbaseES R6 集群物理copy方式手工添加新备库节点
案例说明: 对于主库数据量比较大的环境,在添加新节点是可以采用在线clone方式创建新的备库节点,也可以在离线的状态下,直接拷贝其中一个备库的所有集群相关目录来创建新的备库节点.本案例介绍了通过离线物 ...
- KingbaseES R6 集群创建流复制只读副本库案例
一.环境概述 [kingbase@node2 bin]$ ./ksql -U system test ksql (V8.0) Type "help" for help. test= ...
- KingbaseES R6 集群修改物理IP和VIP案例
在用户的实际环境里,可能有时需要修改主机的IP,这就涉及到集群的配置修改.以下以例子的方式,介绍下KingbaseES R6集群如何修改IP. 一.案例测试环境 操作系统: [KINGBASE@nod ...
- KingbaseES R6 集群 recovery 参数对切换的影响
案例说明:在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库节点系统恢复正常后,如何对原主库节点进行处理,保证集群数据的一致性和安全,可以通过对repmg ...
- KingbaseES R6 集群通过备库clone在线添加新节点
案例说明: KingbaseES R6集群可以通过图形化方式在线添加新节点,但是在添加新节点clone环节时,是从主库copy数据到新的节点,这样在生产环境,如果数据量大,将会对主库的网络I/O造成压 ...
- KingbaseES R6 集群repmgr.conf参数'recovery'测试案例(一)
KingbaseES R6集群repmgr.conf参数'recovery'测试案例(一) 案例说明: 在KingbaseES R6集群中,主库节点出现宕机(如重启或关机),会产生主备切换,但是当主库 ...
- KingbaseES R6 集群修改data目录
案例说明: 本案例是在部署完成KingbaseES R6集群后,由于业务的需求,集群需要修改data(数据存储)目录的测试.本案例分两种修改方式,第一种是离线修改data目录,即关闭整个集群后,修改数 ...
- KingbaseES R6 集群启动‘incorrect command permissions for the virtual ip’故障案例
案例说明: KingbaseES R6集群启动时,出现"incorrect command permissions for the virtual ip"故障,本案例介绍了如何分析 ...
- KingbaseES R6 集群sys_monitor.sh change_password一键修改集群用户密码
案例说明: kingbaseES R6集群用户密码修改,需要修改两处: 1)修改数据库用户密码(alter user): 2)修改.encpwd文件中用户密码: 可以通过sys_monitor.sh ...
随机推荐
- 跟着Vam一起学习Typescript(第一期)
一.安装环境与配置1.命令行安装 npm i -g typescript 2.快捷打开Vs Code编辑器 创建一个项目文件夹,在该文件夹下打开命令行工具,使用code .命令快速打开编辑器(如果计算 ...
- Python自动化办公:批量将文件按分类保存,文件再多,只需一秒钟解决
序言 (https://jq.qq.com/?_wv=1027&k=GmeRhIX0) 当我们电脑里面的文本或者或者文件夹太多了,有时候想找到自己想要的文件,只能通过去搜索文件名,要是名字忘记 ...
- Java创建TXT文件并写入 内容
public static void main(String[] args) { String filePath = "E:/" + "1.txt"; Stri ...
- NC207028 第k小数
NC207028 第k小数 题目 题目描述 给你一个长度为 \(n\) 的序列,求序列中第 \(k\) 小数的多少. 输入描述 多组输入,第一行读入一个整数 \(T\) 表示有 \(T\) 组数据. ...
- win10设置Python程序定时运行(设置计划任务)
今天来设置一下定时执行Pycharm内的脚本: 这个要基于win10 的任务计划程序(设置 > 控制面板 > 系统和安全 > 管理工具 > 任务计划程序) 1. create ...
- javascript基本属性访问对象的属性和方法
var myName = "Shelley"; //字符串基本类型 alert(myName.length); //隐式创建String对象,数值与myName相同,并执行len ...
- 【docker专栏4】使用docker安装nginx提供web服务
一般学习一项技术,会先用一个最简单的例子或最典型的例子来向大家讲解入门内容,所以此文为大家介绍使用docker安装nginx容器服务.从基础使用的角度来讲,此文几乎涵盖了docker最核心的内容:镜像 ...
- 本机通过IP地址连接Ubuntu18.04+ on Vmware
一.Vmware-顶部菜单栏-编辑-虚拟网络编辑器: 点一下 添加一个NAT模式的网络:要记住名称,比如这里我的是VMnet8 子网ip可以自己写,建议全程就都按我这个写,后续方便校对. 点一下 NA ...
- day11 Java反射机制
java反射机制 反射是java中的动态机制,它允许我们在程序运行期间再确定类的实例化,方法的调用,属性的调用等,而不是传统意义上的在编码期间确定. 因此,反射可以大大的提高代码的灵活度,但是随之而来 ...
- 优雅哥学 Webpack - 01 - Webpack 5 快速体验
程序员优雅哥简介:十年程序员,呆过央企外企私企,做过前端后端架构.分享vue.Java等前后端技术和架构 本文摘要:主要讲解 webpack 5 初体验.从webpack 5 初识到便写代码.优雅哥将 ...