An “Oops” is what the kernel throws at us when it finds something faulty, or an exception, in the kernel code. It’s somewhat like the segfaults of user-space. An Oops dumps its message on the console; it contains the processor status and the CPU registers of when the fault occurred. The offending process that triggered this Oops gets killed without releasing locks or cleaning up structures. The system may not even resume its normal operations sometimes; this is called an unstable state. Once an Oops has occurred, the system cannot be trusted any further.

Let’s try to generate an Oops message with sample code, and try to understand the dump.

Setting up the machine to capture an Oops

The running kernel should be compiled with CONFIG_DEBUG_INFO, and syslogd should be running. To generate and understand an Oops message, Let’s write a  sample kernel module,oops.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/init.h>
 
static void create_oops() {
        *(int *)0 = 0;
}
 
static int __init my_oops_init(void) {
        printk("oops from the module\n");
        create_oops();
       return (0);
}
static void __exit my_oops_exit(void) {
        printk("Goodbye world\n");
}
 
module_init(my_oops_init);
module_exit(my_oops_exit);

The associated Makefile for this module is as follows:

obj-m   := oops.o
KDIR    := /lib/modules/$(shell uname -r)/build
PWD     := $(shell pwd)
SYM=$(PWD)
 
all:
        $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

Once executed, the module generates the following Oops:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffffa03e1012>] my_oops_init+0x12/0x21 [oops]
PGD 7a719067 PUD 7b2b3067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/virtual/misc/kvm/uevent
CPU 1
Pid: 2248, comm: insmod Tainted: P           2.6.33.3-85.fc13.x86_64
RIP: 0010:[<ffffffffa03e1012>]  [<ffffffffa03e1012>] my_oops_init+0x12/0x21 [oops]
RSP: 0018:ffff88007ad4bf08  EFLAGS: 00010292
RAX: 0000000000000018 RBX: ffffffffa03e1000 RCX: 00000000000013b7
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
RBP: ffff88007ad4bf08 R08: ffff88007af1cba0 R09: 0000000000000004
R10: 0000000000000000 R11: ffff88007ad4bd68 R12: 0000000000000000
R13: 00000000016b0030 R14: 0000000000019db9 R15: 00000000016b0010
FS:  00007fb79dadf700(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000007a0f1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process insmod (pid: 2248, threadinfo ffff88007ad4a000, task ffff88007a222ea0)
Stack:
ffff88007ad4bf38 ffffffff8100205f ffffffffa03de060 ffffffffa03de060
 0000000000000000 00000000016b0030 ffff88007ad4bf78 ffffffff8107aac9
 ffff88007ad4bf78 00007fff69f3e814 0000000000019db9 0000000000020000
Call Trace:
[<ffffffff8100205f>] do_one_initcall+0x59/0x154
[<ffffffff8107aac9>] sys_init_module+0xd1/0x230
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: <c7> 04 25 00 00 00 00 00 00 00 00 31 c0 c9 c3 00 00 00 00 00 00 00
RIP  [<ffffffffa03e1012>] my_oops_init+0x12/0x21 [oops]
RSP <ffff88007ad4bf08>
CR2: 0000000000000000

Understanding the Oops dump

Let’s have a closer look at the above dump, to understand some of the important bits of information.

BUG: unable to handle kernel NULL pointer dereference at (null)

The first line indicates a pointer with a NULL value.

IP: [<ffffffffa03e1012>] my_oops_init+0x12/0x21 [oops]

IP is the instruction pointer.

Oops: 0002 [#1] SMP

This is the error code value in hex. Each bit has a significance of its own:

  • bit 0 == 0 means no page found, 1 means a protection fault
  • bit 1 == 0 means read, 1 means write
  • bit 2 == 0 means kernel, 1 means user-mode
  • [#1] — this value is the number of times the Oops occurred. Multiple Oops can be triggered as a cascading effect of the first one.
CPU 1

This denotes on which CPU the error occurred.

Pid: 2248, comm: insmod Tainted: P           2.6.33.3-85.fc13.x86_64

The Tainted flag points to P here. Each flag has its own meaning. A few other flags, and their meanings, picked up from kernel/panic.c:

  • P — Proprietary module has been loaded.
  • F — Module has been forcibly loaded.
  • S — SMP with a CPU not designed for SMP.
  • R — User forced a module unload.
  • M — System experienced a machine check exception.
  • B — System has hit bad_page.
  • U — Userspace-defined naughtiness.
  • A — ACPI table overridden.
  • W — Taint on warning.
RIP: 0010:[<ffffffffa03e1012>]  [<ffffffffa03e1012>] my_oops_init+0x12/0x21 [oops]

RIP is the CPU register containing the address of the instruction that is getting executed. 0010comes from the code segment register. my_oops_init+0x12/0x21 is the <symbol> + the offset/length.

RSP: 0018:ffff88007ad4bf08  EFLAGS: 00010292
RAX: 0000000000000018 RBX: ffffffffa03e1000 RCX: 00000000000013b7
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
RBP: ffff88007ad4bf08 R08: ffff88007af1cba0 R09: 0000000000000004
R10: 0000000000000000 R11: ffff88007ad4bd68 R12: 0000000000000000
R13: 00000000016b0030 R14: 0000000000019db9 R15: 00000000016b0010

This is a dump of the contents of some of the CPU registers.

Stack:
ffff88007ad4bf38 ffffffff8100205f ffffffffa03de060 ffffffffa03de060
 0000000000000000 00000000016b0030 ffff88007ad4bf78 ffffffff8107aac9
 ffff88007ad4bf78 00007fff69f3e814 0000000000019db9 0000000000020000

The above is the stack trace.

Call Trace:
[<ffffffff8100205f>] do_one_initcall+0x59/0x154
[<ffffffff8107aac9>] sys_init_module+0xd1/0x230
[<ffffffff81009b02>] system_call_fastpath+0x16/0x1b

The above is the call trace — the list of functions being called just before the Oops occurred.

Code: <c7> 04 25 00 00 00 00 00 00 00 00 31 c0 c9 c3 00 00 00 00 00 00 00

The Code is a hex-dump of the section of machine code that was being run at the time the Oops occurred.

Debugging an Oops dump

The first step is to load the offending module into the GDB debugger, as follows:

[root@DELL-RnD-India oops]# gdb oops.ko
GNU gdb (GDB) Fedora (7.1-18.fc13)
Reading symbols from /code/oops/oops.ko...done.
(gdb) add-symbol-file oops.o 0xffffffffa03e1000
add symbol table from file "oops.o" at
    .text_addr = 0xffffffffa03e1000

Next, add the symbol file to the debugger. The add-symbol-file command’s first argument isoops.o and the second argument is the address of the text section of the module. You can obtain this address from /sys/module/oops/sections/.init.text (where oops is the module name):

(gdb) add-symbol-file oops.o 0xffffffffa03e1000
add symbol table from file "oops.o" at
    .text_addr = 0xffffffffa03e1000
(y or n) y
Reading symbols from /code/oops/oops.o...done.

From the RIP instruction line, we can get the name of the offending function, and disassemble it.

(gdb) disassemble my_oops_init
Dump of assembler code for function my_oops_init:
   0x0000000000000038 <+0>:    push   %rbp
   0x0000000000000039 <+1>:    mov    $0x0,%rdi
   0x0000000000000040 <+8>:    xor    %eax,%eax
   0x0000000000000042 <+10>:    mov    %rsp,%rbp
   0x0000000000000045 <+13>:    callq  0x4a <my_oops_init+18>
   0x000000000000004a <+18>:    movl   $0x0,0x0
   0x0000000000000055 <+29>:    xor    %eax,%eax
   0x0000000000000057 <+31>:    leaveq
   0x0000000000000058 <+32>:    retq
End of assembler dump.

Now, to pin point the actual line of offending code, we add the starting address and the offset. The offset is available in the same RIP instruction line. In our case, we are adding0x0000000000000038 + 0x012 =  0x000000000000004a. This points to the movl instruction.

(gdb) list *0x000000000000004a
0x4a is in my_oops_init (/code/oops/oops.c:6).
1    #include <linux/kernel.h>
2    #include <linux/module.h>
3    #include <linux/init.h>
4    
5    static void create_oops() {
6        *(int *)0 = 0;
7    }

This gives the code of the offending function.

References

The kerneloops.org website can be used to pick up a lot of Oops messages to debug. The Linux kernel documentation directory has information about Oops — kernel/Documentation/oops-tracing.txt. This, and numerous other online resources, were used while creating this article.

[fw]Understanding a Kernel Oops!的更多相关文章

  1. Understanding a Kernel Oops!

    Understanding a kernel panic and doing the forensics to trace the bug is considered a hacker's job. ...

  2. Linux Kernel Oops异常分析

    1.PowerPC小系统内核异常分析 1.1  异常打印 Unable to handle kernel paging request for data at address 0x36fef31eFa ...

  3. 如何解读Linux Kernel OOPS信息

    OOPS信息解读 root@firefly:~/mnt/module# insmod oops_module.ko [ 867.140514] Unable to handle kernel NULL ...

  4. Understanding Linux Kernel version 3 读书笔记

    P30, preemptive  kernel .kernel threading 和Multithreaded application support没太好理解,我想如果设计个多线程的程序来运行运行 ...

  5. Linux Kernel PANIC(三)--Soft Panic/Oops调试及实例分析【转】

    转自:https://blog.csdn.net/gatieme/article/details/73715860 版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原 ...

  6. Kernel boot options

    There are three ways to pass options to the kernel and thus control its behavior: When building the ...

  7. 学习kernel编程的建议

    我把我学习kernel编程的过程介绍给大家,希望大家有个参考. 学习kernel编程需要阅读大量的kernel方面的书籍,在此我列举一下我读过的kernel书籍(按时间先后顺序),并给一些建议. 1. ...

  8. 绕过kernel模块版本校验检测

    kernel module version check bypass . 举例说明 . 内核是怎么实现的 . 怎样去突破 . 总结 . 举例说明 Linux内核版本很多,升级很快,2个小内核版本中内核 ...

  9. 依据linux Oops信息准确定位错误代码所在行

    在linux下调tvp5150am1的过程中,遇到了一kernel oops,内容如下: [   66.714603] Unable to handle kernel paging request a ...

随机推荐

  1. pg_controldata - 显示一个 PostgreSQL 集群的控制信息

    SYNOPSIS pg_controldata [ datadir] DESCRIPTION 描述 pg_controldata 打印那些在 initdb 过程中初始化的信息,比如表版本和服务器的区域 ...

  2. 基于cdn方式的vue+element-ui的单页面架构

    一.下载vue2.x,下载element-ui.js以及css 二.html文件 <!DOCTYPE html> <html> <head> <meta ch ...

  3. 利用docker制作一个带有redis软件的镜像,供其他人使用

    1. 宿主机在etc/apt/下创建一个haha的文件夹 2.宿主机将haha文件夹映射到容器的虚拟系统中etc/apt/ 3. 此时,可以在宿主机和容器虚拟机中同步创建和删除文件 4. 将宿主机中的 ...

  4. window.onload()和$(document).ready的区别( $(document).ready == $(function(){ }) )

    首先$(function(){}) 和 $(document).ready(function(){}) 是一个方法,$(function(){})为简写(用的多) $(document).ready和 ...

  5. hdu 6127 : Hard challenge (2017 多校第七场 1008)(计算几何)

    题目链接 题意:二维平面上有n个点(没有重叠,都不在原点,任意两点连线不过原点),每个点有一个权值,用一条过原点的直线把他们划分成两部分,使两部分的权值和的乘积最大.输出最大的乘积. 极角排序后,将原 ...

  6. 使用Windows service创建一个简单的定时器

    一.需求 我们有时候可能会想要做一些定时任务,例如每隔一段时间去访问某个网站,或者下载一些东西到我们服务器上等等之类的事情,这时候windows service 是一个不错的选择. 二.实现 1.打开 ...

  7. hdu 4641K-string SAM的O(n^2)算法 以及 SAM+并查集优化

    转载:http://www.cnblogs.com/hxer/p/5675149.html 题意:有一个长度为n(n < 5e4)的字符串,Q(Q<=2e5)次操作:操作分为:在末尾插入一 ...

  8. [CF959F]Mahmoud and Ehab and yet another xor task题解

    搞n个线性基,然后每次在上一次的基础上插入读入的数,前缀和线性基,或者说珂持久化线性基. 然后一个num数组记录当时线性基里有多少数 然后每次前缀操作一下就珂以了 代码 #include <cs ...

  9. ConcurrentModificationException异常

    介绍一个我今天在开发中遇到的异常:ConcurrentModificationException异常,当然它是一个非受检的异常,也就是运行时异常. 当我们在遍历集合对象的时候,不能够将集合删除.最佳实 ...

  10. instagram

    https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={ ...