最近有一台CentOS 6.4的服务器发生多次crash,kernel version 是Linux 2.6.32-431.29.2.el6.x86_64。从vmcore-dmesg日志内容及crash工具解析vmcore文件来看,与已知bug比较匹配。

【分析过程】

1、查看vmcore-dmesg日志

<3>CIFS VFS: Error -11 sending data on socket to server

<4>NOHZ: local_softirq_pending 100

<3>CIFS VFS: Error connecting to socket. Aborting operation

<3>CIFS VFS: cifs_mount failed w/return code = -512

<3>BUG: scheduling while atomic: mysqld/23447/0xffff8801

<4>Modules linked in: des_generic ecb md4 nls_utf8 cifs xfs exportfs tcp_diag inet_diag bluetooth rfkill sunrpc bonding 8021q garp stp llc ipv6 microcode power_meter iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core igb dca i2c_algo_bit i2c_core ptp pps_core sg ses enclosure ext4 jbd2 mbcache sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

<4>Pid: 23447, comm: mysqld Not tainted 2.6.32-431.29.2.el6.x86_64 #1

<4>Call Trace:

<4> [<ffffffff8105da66>] ? __schedule_bug+0x66/0x70

<4> [<ffffffff81529440>] ? thread_return+0x640/0x770

<4> [<ffffffffa000439c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]

<4> [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0

<4> [<ffffffff815295e3>] ? io_schedule+0x73/0xc0

<4> [<ffffffff811c933d>] ? __blockdev_direct_IO_newtrunc+0xb7d/0x1270

<4> [<ffffffff8116e2da>] ? kmem_getpages+0xba/0x170

<4> [<ffffffff81142029>] ? zone_statistics+0x99/0xc0

<4> [<ffffffffa0352180>] ? xfs_get_blocks_direct+0x0/0x20 [xfs]

<4> [<ffffffff81121eb5>] ? mempool_alloc_slab+0x15/0x20

<4> [<ffffffff811c9aa7>] ? __blockdev_direct_IO+0x77/0xe0

<4> [<ffffffffa0352180>] ? xfs_get_blocks_direct+0x0/0x20 [xfs]

<4> [<ffffffffa0351b00>] ? xfs_end_io_direct_write+0x0/0x80 [xfs]

<4> [<ffffffffa03518dc>] ? xfs_vm_direct_IO+0xec/0x170 [xfs]

<4> [<ffffffffa0352180>] ? xfs_get_blocks_direct+0x0/0x20 [xfs]

<4> [<ffffffffa0351b00>] ? xfs_end_io_direct_write+0x0/0x80 [xfs]

<4> [<ffffffff81226cd6>] ? security_inode_need_killpriv+0x16/0x20

<4> [<ffffffff81120322>] ? generic_file_direct_write+0xc2/0x190

<4> [<ffffffffa03583ba>] ? xfs_file_dio_aio_write+0x12a/0x280 [xfs]

<4> [<ffffffffa035883f>] ? xfs_file_aio_write+0x17f/0x190 [xfs]

<4> [<ffffffff81188b4a>] ? do_sync_write+0xfa/0x140

<4> [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40

<4> [<ffffffff811a7d74>] ? notify_change+0x234/0x340

<4> [<ffffffff81226d36>] ? security_file_permission+0x16/0x20

<4> [<ffffffff81188e48>] ? vfs_write+0xb8/0x1a0

<4> [<ffffffff811898d2>] ? sys_pwrite64+0x82/0xa0

<1>BUG: unable to handle kernel paging request at 00000001a1d9d1a0

2、用crash工具解析vmcore文件,可以看到系统在执行task_tick_fair调用update_curr函数时发生异常。

关于update_curr函数的描述可参考:

https://blog.csdn.net/kickxxx/article/details/9704575

https://blog.csdn.net/arm7star/article/details/78648718

crash> bt

PID: 23447  TASK: ffff881063e38040  CPU: 10  COMMAND: "mysqld"

#0 [ffff8810b8803980] machine_kexec at ffffffff81038f3b

#1 [ffff8810b88039e0] crash_kexec at ffffffff810c5af2

#2 [ffff8810b8803ab0] oops_end at ffffffff8152ca50

#3 [ffff8810b8803ae0] no_context at ffffffff8104a00b

#4 [ffff8810b8803b30] __bad_area_nosemaphore at ffffffff8104a295

#5 [ffff8810b8803b80] bad_area_nosemaphore at ffffffff8104a363

#6 [ffff8810b8803b90] __do_page_fault at ffffffff8104aabf

#7 [ffff8810b8803cb0] do_page_fault at ffffffff8152e99e

#8 [ffff8810b8803ce0] page_fault at ffffffff8152bd55

[exception RIP: update_curr+324]

RIP: ffffffff8105a6e4  RSP: ffff8810b8803d98  RFLAGS: 00010082

RAX: ffff881063e38040  RBX: 0000000044034048  RCX: ffff88107fef3140

RDX: 0000000000019410  RSI: 0000000000000000  RDI: ffff881063e38078

RBP: ffff8810b8803dc8   R8: ffffffff8160c325   R9: 0000000000000421

R10: 0000000000000010  R11: 0000000000000421  R12: ffff8810b88168a8

R13: 00000000000d8be5  R14: 0000000000000001  R15: ffff881063e38040

ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018

#9 [ffff8810b8803dd0] task_tick_fair at ffffffff8105ab2b

#10 [ffff8810b8803e00] scheduler_tick at ffffffff8105dc21

#11 [ffff8810b8803e40] update_process_times at ffffffff8108404e

#12 [ffff8810b8803e70] tick_sched_timer at ffffffff810acad6

#13 [ffff8810b8803ea0] __run_hrtimer at ffffffff8109f6be

#14 [ffff8810b8803ef0] hrtimer_interrupt at ffffffff8109fa26

#15 [ffff8810b8803f70] local_apic_timer_interrupt at ffffffff81031f1d

#16 [ffff8810b8803f90] smp_apic_timer_interrupt at ffffffff815325e5

#17 [ffff8810b8803fb0] apic_timer_interrupt at ffffffff8100bb93

--- <IRQ stack> ---

#18 [ffff880144035528] apic_timer_interrupt at ffffffff8100bb93

[exception RIP: vprintk+593]

RIP: ffffffff81072a91  RSP: ffff8801440355d8  RFLAGS: 00000246

RAX: 00000000000114c0  RBX: ffff880144035668  RCX: 0000000000007c8c

RDX: ffff8810b8800000  RSI: 0000000000000046  RDI: 0000000000000246

RBP: ffffffff8100bb8e   R8: 000000000001bdf6   R9: 00000000fffffffb

R10: 0000000000000001  R11: 000000000000000a  R12: 81eb2000812991a9

R13: 0000000000000001  R14: ffffffffa003f1fe  R15: ffffffffffffffff

ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018

#19 [ffff880144035670] printk at ffffffff81528845

#20 [ffff8801440356d0] printk_address at ffffffff81010da1

#21 [ffff8801440356e0] print_trace_address at ffffffff81010f2c

#22 [ffff880144035710] print_context_stack at ffffffff81011141

#23 [ffff880144035780] dump_trace at ffffffff81010110

#24 [ffff880144035820] show_trace_log_lvl at ffffffff81010eb5

#25 [ffff880144035850] show_trace at ffffffff81010ee5

#26 [ffff880144035860] dump_stack at ffffffff8152868e

#27 [ffff8801440358a0] __schedule_bug at ffffffff8105da66

...skipping...

#10 [ffff8810b8803e00] scheduler_tick at ffffffff8105dc21

#11 [ffff8810b8803e40] update_process_times at ffffffff8108404e

#12 [ffff8810b8803e70] tick_sched_timer at ffffffff810acad6

#13 [ffff8810b8803ea0] __run_hrtimer at ffffffff8109f6be

#14 [ffff8810b8803ef0] hrtimer_interrupt at ffffffff8109fa26

#15 [ffff8810b8803f70] local_apic_timer_interrupt at ffffffff81031f1d

#16 [ffff8810b8803f90] smp_apic_timer_interrupt at ffffffff815325e5

#17 [ffff8810b8803fb0] apic_timer_interrupt at ffffffff8100bb93

--- <IRQ stack> ---

#18 [ffff880144035528] apic_timer_interrupt at ffffffff8100bb93

[exception RIP: vprintk+593]

RIP: ffffffff81072a91  RSP: ffff8801440355d8  RFLAGS: 00000246

RAX: 00000000000114c0  RBX: ffff880144035668  RCX: 0000000000007c8c

RDX: ffff8810b8800000  RSI: 0000000000000046  RDI: 0000000000000246

RBP: ffffffff8100bb8e   R8: 000000000001bdf6   R9: 00000000fffffffb

R10: 0000000000000001  R11: 000000000000000a  R12: 81eb2000812991a9

R13: 0000000000000001  R14: ffffffffa003f1fe  R15: ffffffffffffffff

ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018

#19 [ffff880144035670] printk at ffffffff81528845

#20 [ffff8801440356d0] printk_address at ffffffff81010da1

#21 [ffff8801440356e0] print_trace_address at ffffffff81010f2c

#22 [ffff880144035710] print_context_stack at ffffffff81011141

#23 [ffff880144035780] dump_trace at ffffffff81010110

#24 [ffff880144035820] show_trace_log_lvl at ffffffff81010eb5

#25 [ffff880144035850] show_trace at ffffffff81010ee5

#26 [ffff880144035860] dump_stack at ffffffff8152868e

#27 [ffff8801440358a0] __schedule_bug at ffffffff8105da66

#28 [ffff8801440358c0] thread_return at ffffffff81529440

#29 [ffff880144035980] io_schedule at ffffffff815295e3

#30 [ffff8801440359a0] __blockdev_direct_IO_newtrunc at ffffffff811c933d

#31 [ffff880144035b50] __blockdev_direct_IO at ffffffff811c9aa7

#32 [ffff880144035bd0] xfs_vm_direct_IO at ffffffffa03518dc [xfs]

#33 [ffff880144035c60] generic_file_direct_write at ffffffff81120322

#34 [ffff880144035cd0] xfs_file_dio_aio_write at ffffffffa03583ba [xfs]

#35 [ffff880144035d60] xfs_file_aio_write at ffffffffa035883f [xfs]

#36 [ffff880144035dc0] do_sync_write at ffffffff81188b4a

#37 [ffff880144035ef0] vfs_write at ffffffff81188e48

#38 [ffff880144035f30] sys_pwrite64 at ffffffff811898d2

#39 [ffff880144035f80] system_call_fastpath at ffffffff8100b072

RIP: 00000038fe00f1a3  RSP: 00007f0bd85c3a48  RFLAGS: 00010206

RAX: 0000000000000012  RBX: ffffffff8100b072  RCX: 0000000000000000

RDX: 0000000000100000  RSI: 00007f0b60c40000  RDI: 0000000000000479

RBP: 00007f0bd85c39a0   R8: 0000000000100000   R9: 000000007aa00000

R10: 000000007aa00000  R11: 0000000000000293  R12: 0000000001350c40

R13: 00007f0bad491b98  R14: 000000007aa00000  R15: 00007f0bad491b98

ORIG_RAX: 0000000000000012  CS: 0033  SS: 002b

3、这个问题官方有详细的解决方案描述,Thread overran stack, or stack corrupted while running XFS,可参考
https://access.redhat.com/solutions/54544

【结论】

CentOS6.4 Linux 2.6.32-431.29.2.el6.x86_64这个版本之前有两个相关的bug,693280和918359,如下

https://gist.githubusercontent.com/namikawa/8677d8742994e3c19ee1/raw/06da129ab4483c0c3ede65319c784b5698e49a1c/CentOS_6.5_kernel_CHANGELOG

问题在kernel-2.6.32-431.46.1.el6后修复,建议升级到CentOS 7.4后可以解决。

CentOS 6.4在运行XFS时系统crash的bug分析的更多相关文章

  1. iOS学习之Objective-C 2.0 运行时系统编程

    0 导言 本主主要内容包括: 1.概述2.参考3.运行时系统的版本和平台4.和运行时系统的交互5.消息6.动态方法解析7.消息转发8.类型编码9.属性声明 1 概述 Objective-C语言将决定尽 ...

  2. 关于在64位win7下运行Virtualbox安装系统时出错(提示VBoxDD.DLL错误)的解决方

    安装没有问题,安装了最新版VirtualBox-4.3.18-96516-Win,一点运行想安装系统时就出错. 这是提示的错误: 运行Virtualbox去安装系统时出错:Failed to open ...

  3. [Xcode 实际操作]八、网络与多线程-(18)PerformSelector消息处理方法:由运行时系统,负责去调用对象的指定方法

    目录:[Swift]Xcode实际操作 本文将演示PerformSelector消息处理方法. 在项目文件夹上点击鼠标右键弹出文件菜单. [New File]->[Swift File]-> ...

  4. Java 内存管理、JVM 工作原理与 Java 运行时系统

    Java 虚拟机规范中说明:所有的对象实例(all class instances)以及数组都要在堆上分配: the heap is the runtime data area from which ...

  5. Objective-C NSObject与运行时系统

    创建: 2018/01/26 完成: 2018/02/02 遗留: TODO  NSObject  运行时系统 程序运行时候起作用的系统? NSObject是运行时系统的接口(API)  属性 只有一 ...

  6. CentOS 6.5安装部署Zabbix监控系统

    CentOS 6.5安装部署Zabbix监控系统 先说一点废话,我没有用centos7做实验,讲真,centos 7我也不常用,喜欢新版本的同学其实可以尝试下,注意一点的就是centos 6.5只支持 ...

  7. Ubuntu 13.04/CentOS 6.4 下C++开发时的相关设置

    Ubuntu 13.04/CentOS 6.4 下C++开发时的相关设置 一.基本设置 首先,为了可以使我们的c++ 可以找到 iostream类,std标准库,我们需要在C/C++ General- ...

  8. 运行WampServer时,提示Exception Exception in module wampmanager.exe at 000F15A0.解决办法

    出现问题:运行WampServer时,提示Exception Exception in module wampmanager.exe at 000F15A0.解决办法 出现问题原因: ①:缺少Visu ...

  9. win7以管理员身份运行bat提示系统找不到指定的路径。

    windows7“以管理员身份运行”bat提示“系统找不到指定的路径.” 使用批处理安装服务,直接双击运行没有权限,右键“以管理员身份运行”却提示“系统找不到指定的路径.”,反复查看路径是正确的. 打 ...

随机推荐

  1. Django 利用管理器实现文章归档

    Django管理器:class Manager 管理器是Django的模型进行数据库查询的接口,Django应用的每个模型都拥有至少一个管理器.默认情况下,Django为每个模型类添加一个名为obje ...

  2. [ VB ] OrElse, AndAlso [ C# ] ||, && 运算符

    条件演算子も当然のように C# と VB では記述方法が異なる.比較すると下表のようになる. VB              C#OrElse        ||AndAlso     &&a ...

  3. nodejs的process模块如何获取其他进程的pid

    var cmd=process.platform=='win32'?'tasklist':'ps aux'; var exec = require('child_process').exec; var ...

  4. ES系列五、ES6.3常用api之搜索类api

    1.搜索api 1.1.routing:路由 执行搜索时,它将广播到所有索引/索引分片(副本之间的循环).可以通过提供routing参数来控制将搜索哪些分片.例如,在索引book时,路由值可以是nam ...

  5. SharePoint 2010:“&”作为SharePoint账号密码引起的错误

    一朋友修改了SharePoint 2010系统账号密码,导致无法登陆.他的环境如下: 两台服务器:AD+SharePoint 2010 ,Sql Server 2008 r2 目标站点开启了Form登 ...

  6. 默认以管理员身份运行VS2013/15/17

    方法如下: 1.右击VS的快捷方式,选择[属性],打开属性对话框,再点击[高级]按钮,如下图所示: 2.再勾选[用管理员身份运行],点击[确定]即可: 然后就可以双击VS快捷方式,直接以管理员身份运行 ...

  7. RMAN-06059(转)

    change archivelog all crosscheck;

  8. 通达OA数据库优化方案之_历史数据清理

    整体思路: 1.备份现有系统数据,提供一个可以查询历史的入口(可以查询2015年6月份以前的所有OA产生的流程) 2.删除生产环境中2015年6月份以前的流程 为避免影响考勤,暂定在本月考勤定稿后实施 ...

  9. 使用Eclipse创建Web Services

    正文: 项目源文件: 百度云盘/博客园/project/wsServerExample/wsServerExample.rar 参考文献: http://www.ibm.com/developerwo ...

  10. WM8960音频播放

    第一节 音频播放原理首先需要申明一下,本章的代码来自网络,参考了亚嵌教育李明老师(论坛ID:limingth)的帖子: http://www.arm9home.net/read.php?tid=205 ...