作者:梁行 万里数据库DBA,擅长数据库性能问题诊断、事务与锁问题的分析等,负责处理客户MySQL日常运维中的问题,对开源数据库相关技术非常感兴趣。

  • GreatSQL社区原创内容未经授权不得随意使用,转载请联系小编并注明来源。

一、现象说明

在排查问题时发现MySQL主备做了切换,而查看MySQL服务是正常的,DBA也没有做切换操作,服务器也无维护操作,万幸的是业务还没有受到大的波及。这到底是是为什么呢?

假设原主服务器地址为:172.16.87.72,原备主服务器地址为:172.16.87.123。

二、排查思路

通过监控查看MySQL的各个指标

查看双主(keepalived)服务切换日志

MySQL错误日志信息

三、问题排查

3.1 通过监控系统查看MySQL监控指标,判断故障发生的具体时间(通过流量判断大致切换时间点)

通过监控查看现在主库MySQL(支撑业务)的监控指标,看一天的、十五天的 (以每秒查询数量速率为例)

发现从22号开始陡增,初步怀疑,可能22号发生了主备切换,再看一下备库的折线图,进一步确认 查看双主(keepalived)服务切换日志,确定主从切换时间

3.2 查看keepalived判断为什么会做主从切换

先登录87.72查看keepalived的切换日志,日志信息如下:注:系统是ubuntu而且keepalived没有指定输出日志输出,所以keepalived会将日志输出到系统默认日志文件syslog.log中

shell> cat syslog.6|grep "vrrp"
Oct 23 15:50:34 bjuc-mysql1 Keepalived_vrrp: VRRP_Script(check_mysql) failed
Oct 23 15:50:34 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 Keepalived_vrrp: VRRP_Instance(VI_74) Received higher prio advert
Oct 23 15:50:35 bjuc-mysql1 Keepalived_vrrp: VRRP_Instance(VI_74) Entering BACKUP STATE
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:35 bjuc-mysql1 syslog-ng[31634]: Error opening file for writing; filename='/var/log/tang/Keepalived_vrrp.log', error='No such file or directory (2)'
Oct 23 15:50:56 bjuc-mysql1 Keepalived_vrrp: VRRP_Script(check_mysql) succeeded

日志分析结果

从日志中可以发现,在Oct 23 15:50:34 因为检测脚本检测MySQL服务失败,导致了vip切换 再去87.123上查看keepalvied的切换,确定一下切换时间

Oct 23 15:50:35 prod-UC-yewu-db-slave2 Keepalived_vrrp: VRRP_Instance(VI_74) forcing a new MASTER election
Oct 23 15:50:36 prod-UC-yewu-db-slave2 Keepalived_vrrp: VRRP_Instance(VI_74) Transition to MASTER STATE
Oct 23 15:50:37 prod-UC-yewu-db-slave2 Keepalived_vrrp: VRRP_Instance(VI_74) Entering MASTER STATE

通过以上日志信息,我们发现双主是在Oct 23 15:50:35发生了切换

3.3 疑问?那么检测脚本为什么会检测失败呢?

查看检测脚本内容

shell> cat check_mysql.sh
#!/bin/bash
export PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
USER=xxx
PASSWORD=xxx
PORT=3306
IP=127.0.0.1 mysql --connect-timeout=1 -u$USER -p$PASSWORD -h$IP -P$PORT -e "select 1;"
exit $?

我们发现,检测脚本很简单,执行“select 1”命令,超时时间为1s,这个sql不影响MySQL任何操作,耗费性能很小,那为什么还会失败呢?我们去MySQL查看一下日志,看看有没有发现

3.4 查看MySQL日志

191023 15:50:54 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:55 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:55 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:56 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:56 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:57 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:58 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:58 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:50:59 [ERROR] /usr/sbin/mysqld: Sort aborted: Server shutdown in progress
191023 15:51:28 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'statusnet'. Query: 'UPDATE ticket SET max_id=max_id+step WHERE biz_tag='p2p_file'', Error_code: 1205
191023 15:54:43 [ERROR] Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
191023 15:54:43 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
191023 15:54:43 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.001432' position 973809430
191023 16:04:13 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
191023 16:04:13 [Note] Slave I/O thread killed while reading event
191023 16:04:13 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.001432', position 998926454
191023 16:04:20 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.001432' at position 973809430, relay log './mysql-relay-bin.000188' position: 54422
191023 16:04:20 [Note] 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='./mysql-relay-bin.000188', relay_log_pos='54422', master_log_name='mysql-bin.001432', master_log_pos='973809430' and new position at relay_log_file='./mysql-relay-bin.000188', relay_log_pos='54661', master_log_name='mysql-bin.001432', master_log_pos='973809669'
191023 16:04:20 [Note] Slave I/O thread: connected to master 'zhu@x:3306',replication started in log 'mysql-bin.001432' at position 998926454

从日志中可以发现,MySQL在15:50:54报 Server shutdown in progress错误,在15:54:43 主从关系挂了,在16:04:20的时候主从关系自动恢复正常。那么这个报错是什么呢,Server shutdown in progress是什么意思呢?这个报错信息是因为在MySQL 5.5时,如果主动kill一个长的慢查询,而且这个sql还用到了sort buffer,就会出现类似MySQL重启的情况,连接也会关闭。

官方bug地址:https://bugs.mysql.com/bug.php?id=18256

赶紧去线上环境看看自己的环境是不是MySQL 5.5

mysql> \s
--------------
mysql Ver 14.14 Distrib 5.5.22, for debian-linux-gnu (x86_64) using readline 6.2 Connection id: 76593111
Current database:
Current user: xxx@127.0.0.1
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 5.5.22-0ubuntu1-log (Ubuntu)
Protocol version: 10
Connection: 127.0.0.1 via TCP/IP
Server characterset: utf8
Db characterset: utf8
Client characterset: utf8
Conn. characterset: utf8
TCP port: 3306
Uptime: 462 days 11 hours 50 min 11 sec Threads: 9 Questions: 14266409536 Slow queries: 216936 Opens: 581411 Flush tables: 605 Open tables: 825 Queries per second avg: 357.022

我们看到线上环境是MySQL 5.5版本,运行时间为462 days 11 hours 50 min 11 sec,证明MySQL进程没有重启,那么通过刚才error日志(主从关系重启)验证了MySQL的这个bug so,keepalived在这一个发现MySQL不可用,就产生了failed。

3.5 那么我们再深挖一下,是什么慢sql导致被kill呢?

我们继续查找相应时间点的slow日志

# Time: 191023 15:50:39
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 35.571088 Lock_time: 0.742899 Rows_sent: 1 Rows_examined: 72
SET timestamp=1571817039;
SELECT g.id,g.group_pinyin, g.external_type,g.external_group_id,g.mount_id,g.is_display as group_type,g.nickname,g.site_id,g.group_logo,g.created,g.modified,g.misc,g.name_update_flag as name_flag, g.logo_level,g.group_status,g.conversation, g.only_admin_invite,g.reach_limit_count,g.water_mark FROM user_group AS g LEFT JOIN group_member as gm ON g.id = gm.group_id WHERE g.is_display in (1,2,3) AND gm.profile_id = 7382782 AND gm.join_state != 2 AND UNIX_TIMESTAMP(g.modified) > 1571722305 ORDER BY g.modified ASC LIMIT 0, 1000;
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 34.276504 Lock_time: 8.440094 Rows_sent: 1 Rows_examined: 1
SET timestamp=1571817039;
SELECT conversation, is_display as group_type, group_status FROM user_group WHERE id=295414;
# Time: 191023 15:50:40
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 31.410245 Lock_time: 0.561083 Rows_sent: 0 Rows_examined: 7
SET timestamp=1571817040;
SELECT id, clientScope, msgTop FROM uc_application where site_id=106762 AND msgTop=1;
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 38.442446 Lock_time: 2.762945 Rows_sent: 2 Rows_examined: 38
SET timestamp=1571817040;
SELECT g.id,g.group_pinyin, g.external_type,g.external_group_id,g.mount_id,g.is_display as group_type,g.nickname,g.site_id,g.group_logo,g.created,g.modified,g.misc,g.name_update_flag as name_flag, g.logo_level,g.group_status,g.conversation, g.only_admin_invite,g.reach_limit_count,g.water_mark FROM user_group AS g LEFT JOIN group_member as gm ON g.id = gm.group_id WHERE g.is_display in (1,2,3) AND gm.profile_id = 9867566 AND gm.join_state != 2 AND UNIX_TIMESTAMP(g.modified) > 1571796860 ORDER BY g.modified ASC LIMIT 0, 1000;
# Time: 191023 15:50:46
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 40.481602 Lock_time: 8.530303 Rows_sent: 0 Rows_examined: 1
SET timestamp=1571817046;
update heartbeat set modify=1571817004 where session_id='89b480936c1384305e35b531221b705332a4aebf';
# Time: 191023 15:50:41
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 32.748755 Lock_time: 0.000140 Rows_sent: 0 Rows_examined: 3
SET timestamp=1571817041;
SELECT id,nickname,mount_id,is_display,external_type,misc,site_id,conversation as conv_id,only_admin_invite,reach_limit_count,water_mark, group_logo,logo_level,created,modified,name_update_flag as name_flag, group_pinyin, group_status,external_group_id,event_id FROM user_group WHERE id in(1316262,1352179,1338753) AND UNIX_TIMESTAMP(modified) > 1571023021 ORDER BY created ASC;
# Time: 191023 15:50:46
# User@Host: dstatusnet[dstatusnet] @ [x]
# Query_time: 40.764872 Lock_time: 4.829707 Rows_sent: 0 Rows_examined: 7
SET timestamp=1571817046;
SELECT id,nickname,mount_id,is_display,external_type,misc,site_id,conversation as conv_id,only_admin_invite,reach_limit_count,water_mark, group_logo,logo_level,created,modified,name_update_flag as name_flag, group_pinyin, group_status,external_group_id,event_id FROM user_group WHERE id in(673950,1261923,1261748,1262047,1038545,1184675,1261825) AND UNIX_TIMESTAMP(modified) > 1571744114 ORDER BY created ASC;

如上所示: 发现在这个时间段这条sql被执行了多次,时间都超过了30~60s,这个属于严重慢sql了

3.6 那么是什么操作执行了kill呢?

经过询问及排查,发现我们通过pt-kill用到了自动查杀功能,配置的查询超时时间为30s

shell> ps -aux|grep 87.72
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root 3813 0.0 0.0 218988 8072 ? Ss Oct19 9:54 perl /usr/bin/pt-kill --busy-time 30 --interval 10 --match-command Query --ignore-user rep|repl|dba|ptkill --ignore-info 40001 SQL_NO_CACHE --victims all --print --kill --daemonize --charset utf8 --log=/tmp/ptkill.log.kill_____xxx_____172.16.87.72 -h172.16.87.72 -uxxxx -pxxxx shell> cat /tmp/ptkill.log.kill_____xxx_____172.16.87.72
2019-10-23T15:45:37 KILL 75278028 (Query 39 sec) statusnet SELECT t.conv_id, t.notice_id, t.thread_type, t.parent_conv_id, t.seq, t.reply_count, t.last_reply_seq1, t.last_reply_seq2, t.revocation_seq, t.creator_id, t.status, t.operator_id, t.extra1 FROM notice_thread t WHERE parent_conv_id = 3075773 and notice_id in (1571816367355,1571814620251,1571814611880,1571814601809,1571814589604,1571814578823,1571814568507,1571814559399,1571812785487,1571812769810) 2019-10-23T15:45:37 KILL 75277932 (Query 39 sec) statusnet SELECT t.conv_id, t.notice_id, t.thread_type, t.parent_conv_id, t.seq, t.reply_count, t.last_reply_seq1, t.last_reply_seq2, t.revocation_seq, t.creator_id, t.status, t.operator_id, t.extra1 FROM notice_thread t WHERE parent_conv_id = 6396698 and notice_id in (18606128) 2019-10-23T15:45:37 KILL 75191686 (Query 39 sec) statusnet SELECT id,nickname,mount_id,is_display,external_type,misc,site_id,conversation as conv_id,only_admin_invite,reach_limit_count,water_mark, group_logo,logo_level,created,modified,name_update_flag as name_flag, group_pinyin, group_status,external_group_id,event_id FROM user_group WHERE id in(988091,736931,843788,739447,737242,1252702,736180,1150776) AND UNIX_TIMESTAMP(modified) > 1571740050 ORDER BY created ASC

发现果然是pt-kill主动去干掉了,结合之前的发现,真相已经在我们眼前了

3.7 那为什么这个sql跑了这么长时间,就单单这个时间报了呢?

我们结合其他时间的slow日志和当前的时间做个对比

# User@Host: dstatusnet[dstatusnet] @  [x]
# Query_time: 2.046068 Lock_time: 0.000079 Rows_sent: 1 Rows_examined: 1
SET timestamp=1572306496;
SELECT ... FROM user_group WHERE id in(693790) AND UNIX_TIMESTAMP(modified) > 0 ORDER BY created ASC;

我们发现一个规律,当where id in(单个值时)时,基本上是两秒,多个值时,时间就不固定了,这个结论有待其他证据进行进一步确认

四、那么我们如何解决呢?

解决及避免办法:

  • 可以选择升级MySQL的版本(5.7的最高版本)

  • 结合业务优化sql

Enjoy GreatSQL

本文由博客一文多发平台 OpenWrite 发布!

故障案例 | 慢SQL引发MySQL高可用切换排查全过程的更多相关文章

  1. MySQL高可用方案--MHA部署及故障转移

    架构设计及必要配置 主机环境 IP                 主机名             担任角色 192.168.192.128  node_master    MySQL-Master| ...

  2. MySQL高可用架构故障自动转移插件MHA

    mha高可用架构是目前mysql高可用故障转移比较成熟的解决方案.MHA插件复杂监控mysql主节点的健康情况.在主节点宕机后,MHA把binlog通过ssh传到从节点进行重做补齐.并提升其中一个从节 ...

  3. MySQL 高可用:mysql+Lvs+Keepalived 负载均衡及故障转移

    系统信息: mysql主库 mysql从库 VIP 192.168.1.150 mysql 主主同步都设置 auto-increment-offset,auto-increment-increment ...

  4. MySQL 高可用架构在业务层面的应用分析

    MySQL 高可用架构在业务层面的应用分析 http://mp.weixin.qq.com/s?__biz=MzAxNjAzMTQyMA==&mid=208312443&idx=1&a ...

  5. (5.8)mysql高可用系列——MySQL中的GTID复制(实践篇)

    一.基于GTID的异步复制(一主一从)无数据/少数据搭建 二.基于GTID的无损半同步复制(一主一从)(mysql5.7)基于大数据量的初始化 正文: [0]概念 [0.5]GTID 复制(mysql ...

  6. (5.1)mysql高可用系列——高可用架构方案概述

    关键词:mysql高可用概述,mysql高可用架构 常用高可用方案 20190918 现在业内常用的MySQL高可用方案有哪些?目前来说,用的比较多的开源方案分内置高可用与外部实现,内置高可用有如下: ...

  7. MySQL高可用集群方案

    一.Mysql高可用解决方案 方案一:共享存储 一般共享存储采用比较多的是 SAN/NAS 方案. 方案二:操作系统实时数据块复制 这个方案的典型场景是 DRBD,DRBD架构(MySQL+DRBD+ ...

  8. MySQL高可用架构之MHA

    简介: MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于Facebook公司)开发,是 ...

  9. [转载] MySQL高可用方案选型参考

    原文: http://imysql.com/2015/09/14/solutions-of-mysql-ha.shtml?hmsr=toutiao.io&utm_medium=toutiao. ...

随机推荐

  1. OAuth2.0笔记

    OAuth2.0笔记 角色 一般资源服务器和授权服务器是一个 资源拥有者 客户端应用 资源服务器 授权服务器 客户端类型 OAuth 2.0规范定义了两种客户端类型: 保密的:web应用 公有的:用户 ...

  2. 345. Reverse Vowels of a String - LeetCode

    Question 345. Reverse Vowels of a String Solution 思路:交换元音,第一次遍历,先把出现元音的索引位置记录下来,第二遍遍历元音的索引并替换. Java实 ...

  3. 程序分析与优化 - 4 工作列表(worklist)算法

    本章是系列文章的第四章,介绍了worklist算法.Worklist算法是图分析的核心算法,可以说学会了worklist算法,编译器的优化方法才算入门.这章学习起来比较吃力,想要融汇贯通的同学,建议多 ...

  4. SpringMVC请求流程源码分析

    一.SpringMVC使用 1.工程创建 创建maven工程. 添加java.resources目录. 引入Spring-webmvc 依赖. <dependency> <group ...

  5. ES6 - promise(3)

    上一篇熟悉了promise的具体过程: promise的过程: 启动异步任务 => 返回promise对象 =>给promise对象绑定回调函数(甚至可以在异步任务结束后指定多个). 从p ...

  6. Node.js躬行记(21)——花10分钟入门Node.js

    Node.js 不是一门语言,而是一个基于 V8 引擎的运行时环境,下图是一张架构图. 由图可知,Node.js 底层除了 JavaScript 代码之外,还有大量的 C/C++ 代码. 常说 Nod ...

  7. 商户编号[Merchant Id]是什么

    1. Merchant Id是什么 2. Merchant Id 是有哪几个部分构成的 2.1 收单机构代码 2.2 商户地区代码 2.3 Merchant Category Code(MCC) 本文 ...

  8. dubbo容错机制

    dubbo的容错机制 Failover Cluster(默认) 失败自动切换,当出现失败,重试其它服务器.通常用于读操作,但重试会带来更长延迟. Failfast Cluster 快速失败,只发起一次 ...

  9. JDBC、ORM、JPA、Spring Data JPA,傻傻分不清楚?一文带你厘清个中曲直,给你个选择SpringDataJPA的理由!

    序言 Spring Data JPA作为Spring Data中对于关系型数据库支持的一种框架技术,属于ORM的一种,通过得当的使用,可以大大简化开发过程中对于数据操作的复杂度. 本文档隶属于< ...

  10. Linux shell 2>&1的意思

    在脚本里经常看到 ./xxx.sh > /dev/null 2>&1 ./xxx.sh > log.file 2>&1 在shell中输入输出都有对应的文件描述 ...