监控作业发现一台服务器(Red Hat Enterprise Linux Server release 5.7)从凌晨1:32开始,有一小段时间无法响应,数据库也连接不上,后面又正常了。早上检查了监听日志,并没有发现错误信息。但是检查告警日志,发现有下面错误信息:

Thread 1 advanced to log sequence 19749 (LGWR switch)

  Current log# 2 seq# 19749 mem# 0: /u01/oradata/epps/redo02.log

  Current log# 2 seq# 19749 mem# 1: /u03/oradata/epps/redo021.log

Wed Aug 17 01:37:08 2016

kkjcre1p: unable to spawn jobq slave process 

Wed Aug 17 01:37:08 2016

Errors in file /u01/app/oracle/admin/epps/bdump/epps_cjq0_4094.trc:

 

Wed Aug 17 01:56:44 2016

kkjcre1p: unable to spawn jobq slave process 

Wed Aug 17 01:56:44 2016

Errors in file /u01/app/oracle/admin/epps/bdump/epps_cjq0_4094.trc:

more /u01/app/oracle/admin/epps/bdump/epps_cjq0_4094.trc

u01/app/oracle/admin/epps/bdump/epps_cjq0_4094.trc

racle Database 10g Release 10.2.0.4.0 - 64bit Production

RACLE_HOME = /u01/app/oracle/product/10.2.0/db_1

ystem name:    Linux

ode name:      ceglnx01

elease:        2.6.32-200.13.1.el5uek

ersion:        #1 SMP Wed Jul 27 21:02:33 EDT 2011

achine:        x86_64

nstance name: epps

edo thread mounted by this instance: 1

racle process number: 17

nix process pid: 4094, image: oracle@ceglnx01 (CJQ0)

 

** 2016-08-13 01:52:16.178

** SERVICE NAME:(SYS$BACKGROUND) 2016-08-13 01:52:16.155

** SESSION ID:(177.9) 2016-08-13 01:52:16.155

aited for process J000 to be spawned for 60 seconds

** 2016-08-13 01:52:26.187

aited for process J000 to be spawned for 70 seconds

** 2016-08-13 01:52:36.197

aited for process J000 to be spawned for 80 seconds

** 2016-08-13 01:52:46.207

aited for process J000 to be spawned for 90 seconds

** 2016-08-13 01:52:56.217

aited for process J000 to be spawned for 100 seconds

** 2016-08-13 01:53:06.227

aited for process J000 to be spawned for 110 seconds

** 2016-08-13 01:57:15.512

** 2016-08-14 01:30:52.159

aited for process J000 to be spawned for 60 seconds

** 2016-08-14 01:31:02.186

aited for process J000 to be spawned for 70 seconds

** 2016-08-14 01:31:12.196

aited for process J000 to be spawned for 80 seconds

** 2016-08-14 01:31:22.206

aited for process J000 to be spawned for 90 seconds

** 2016-08-14 01:31:32.217

aited for process J000 to be spawned for 100 seconds

** 2016-08-14 01:31:42.227

aited for process J000 to be spawned for 110 seconds

** 2016-08-14 01:35:36.155

** 2016-08-14 01:44:46.866

aited for process J000 to be spawned for 60 seconds

** 2016-08-14 01:44:56.876

aited for process J000 to be spawned for 70 seconds

** 2016-08-14 01:45:06.886

aited for process J000 to be spawned for 80 seconds

** 2016-08-14 01:45:16.896

aited for process J000 to be spawned for 90 seconds

** 2016-08-14 01:45:26.906

aited for process J000 to be spawned for 100 seconds

** 2016-08-14 01:45:36.916

aited for process J000 to be spawned for 110 seconds

** 2016-08-14 01:55:49.977

** 2016-08-14 01:58:22.156

aited for process J000 to be spawned for 60 seconds

** 2016-08-14 01:58:32.166

aited for process J000 to be spawned for 70 seconds

** 2016-08-14 01:58:42.176

aited for process J000 to be spawned for 80 seconds

** 2016-08-14 01:58:52.187

aited for process J000 to be spawned for 90 seconds

** 2016-08-14 01:59:02.196

aited for process J000 to be spawned for 100 seconds

** 2016-08-14 01:59:12.206

aited for process J000 to be spawned for 110 seconds

** 2016-08-14 02:17:55.867

** 2016-08-16 01:30:52.151

aited for process J000 to be spawned for 60 seconds

** 2016-08-16 01:31:02.180

aited for process J000 to be spawned for 70 seconds

** 2016-08-16 01:31:12.191

aited for process J000 to be spawned for 80 seconds

** 2016-08-16 01:31:22.202

aited for process J000 to be spawned for 90 seconds

** 2016-08-16 01:31:32.213

aited for process J000 to be spawned for 100 seconds

** 2016-08-16 01:31:42.223

aited for process J000 to be spawned for 110 seconds

** 2016-08-16 01:35:09.610

** 2016-08-16 01:51:50.369

aited for process J000 to be spawned for 60 seconds

** 2016-08-16 01:52:00.379

aited for process J000 to be spawned for 70 seconds

** 2016-08-16 01:52:10.389

aited for process J000 to be spawned for 80 seconds

** 2016-08-16 01:52:20.399

aited for process J000 to be spawned for 90 seconds

** 2016-08-16 01:52:30.409

aited for process J000 to be spawned for 100 seconds

** 2016-08-16 01:52:40.419

aited for process J000 to be spawned for 110 seconds

** 2016-08-16 01:56:11.022

** 2016-08-17 01:22:21.154

aited for process J000 to be spawned for 60 seconds

** 2016-08-17 01:22:31.190

aited for process J000 to be spawned for 70 seconds

** 2016-08-17 01:22:41.199

aited for process J000 to be spawned for 80 seconds

** 2016-08-17 01:22:51.209

aited for process J000 to be spawned for 90 seconds

** 2016-08-17 01:23:01.219

aited for process J000 to be spawned for 100 seconds

** 2016-08-17 01:23:11.229

aited for process J000 to be spawned for 110 seconds

** 2016-08-17 01:37:08.240

** 2016-08-17 01:38:08.401

aited for process J000 to be spawned for 60 seconds

** 2016-08-17 01:38:18.412

aited for process J000 to be spawned for 70 seconds

** 2016-08-17 01:38:28.421

aited for process J000 to be spawned for 80 seconds

** 2016-08-17 01:38:38.431

aited for process J000 to be spawned for 90 seconds

** 2016-08-17 01:38:48.441

aited for process J000 to be spawned for 100 seconds

** 2016-08-17 01:38:58.451

aited for process J000 to be spawned for 110 seconds

** 2016-08-17 01:56:44.526

几乎都是1:30左右发生的,很有意思的是错误出现的时间(工厂系统管理员反馈Symantec BE 收取RMAN备份的那个作业开始异常的时间也刚刚是13号出现,白天我去测试作业收带,发现又都是正常的,当时一直按照官方文档https://www.veritas.com/support/en_US/article.TECH73015在排除、验证),现在看来很有可能Symantec BE作业的异常也是这个引起的。检查了RMAN的备份记录,几乎在12:00到12:40直接就已经操作完成了。 J000一般为作业的进程。检查了一下,仅仅是一个系统收集统计信息的作业在这个时间点运行。暂且看不出什么问题。

关于Waited for process xxx to be spawned for xxx seconds, 在oracle metalink看到下面有意思的资料:

You need to review both the traditional as well as incident trace files to understand the load on the system and major waits on the database. Typically

 

The incident trace file will show you the database wide waits that the child process encountering when coming up.

The traditional trace file will show you the details of the load on the machine (below ones):

Load average

Memory consumption

Output of PS (process state)

Output of GDB (to view the function stack)

In order to investigate the issue, it might also help to understand the general stages of the process startup:

 

Queued phase

Forking Phase

Execution Phase

Initialization phase

In general, the forking and execution phases are directly linked to system load and resources. To check what phase the process startup is, open the traditional trace file (not incident) and look for the wording "Waited for process"

 

Waited for process XYZ to be spawned for nnn seconds

Waited for process XYZ to initialize for nnn seconds

 

If the message contains "to be spawned", it means the process is still at queued or forking phase (1 & 2)

If the message contains "to initialize", it means the process is at execution or Initialize phase (3 & 4)

 

Other useful information can be obtained from the trace files:

 

Open the traditional trace file and review the section which starts with 'Process diagnostic dump for' - 'load average','memory information','process state - ps output' and also 'gdb output' will provide initial insight on the load on the system.

Open the incident trace file and find the section 'PROCESS STATE' and within that section look for 'Current Wait Stack'. This will provide the database wide events that the child process encountered and may provide clues and generic direction on how to proceed.

 

Why does the error occur? - Potential causes and solutions/actions

 

The root cause of this issue mainly falls under the following 2 categories.

Contention among processes: The process which is coming up might require resources that are actually being contended for by different other processes and sometimes the parent process itself might contend for the same resource as the child process (indirectly).

OS and network level issues: The machine on which this is happening might be CPU/memory saturated and this may delay the process spawning. Network latency when your storage is on a network file system may also delay the process spawning.

Some of the common known issues and potential solutions are listed below:

 

Lack of OS resources or incorrect configuration

This error may be observed due to lack of OS resources or incorrect configuration, typically memory or swap space may be insufficient to spawn a new process. 

Accordingly, the following checks may help to identify the issue:

Check OS Error Log file for the time when the error is generated

OS Messages log can provide an indication whether there is a problem with the Operating System Itself

* AIX: the output of the "errpt" command and the "errpt -a" command 

* Linux: /var/log/messages

* Solaris: /var/adm/messages

* HP-UX: /var/adm/syslog/syslog.log

* Tru64: /var/adm/messages

* Windows: Save Application Log and System Log as txt files Using Event Viewer

Run HCVE script 

The HCVE script verifies whether OS resources are set as recommended by Oracle. Instructions on how to download and run the script are outlined in Document 250262.1.

Please note, that the script will only check whether your system is configured as per the recommended 'minimum' values. Depending on your environment these values may not be sufficient though.

Run OS Watcher 

OS Watcher is an Oracle tool that allows you to monitor the system from an OS perspective. Document 301137.1 outlines the usage of this tool.

Check the defined user limitation (ulimit) settings (UNIX-only)

Check the ulimit settings as the oracle user (or the owner of the oracle software) using 

# ulimit -a

Minimum values can be found in Document 169706.1.

Please note, that the values mentioned in the note are bare minimum values for a standard environment. Depending on your environment setup, you may need to increase these values accordingly.

Recommended actions:

 

Review the resource-related suggestions as reported from the output of the HCVE script and make changes accordingly. The following 2 articles may help in understanding these suggestion better:

Document 169706.1: Oracle Database on Unix AIX,HP-UX,Linux,Mac OS X,Solaris,Tru64 Unix Operating Systems Installation and Configuration Requirements Quick Reference (8.0.5 to 11.2)

Document 225349.1: Address Windowing Extensions (AWE) or VLM on Windows Platforms

(Typically on windows-system with more than 4Gb of RAM, enabling the /3GB switch in the boot.ini is required)

Check the define user limitation (ulimit) settings (UNIX-only)

 

ASLR LINUX feature is being used

ASLR is a feature that is activated by default on some of the newer LINUX distributions. It is designed to load shared memory objects in random addresses. In Oracle, multiple processes map a shared memory object at the same address across the processes. With ASLR turned on, Oracle cannot guarantee the availability of this shared memory address. This conflict in the address space means that a process trying to attach a shared memory object to a specific address may not be able to do so, resulting in a failure in SHMAT subroutine.

This problem is mainly reported in Redhat 5 and Oracle 11.2.0.2. You can verify whether ASLR is being used as follows:

# /sbin/sysctl -a | grep randomize

kernel.randomize_va_space = 1

-> If the parameter is set to any value other than 0 then ASLR is in use.

 

Recommended actions:

Disable the use of the ASLR Feature. 

On Redhat 5, to permanently disable ASLR, add/modify this parameter in /etc/sysctl.conf

 

kernel.randomize_va_space=0

kernel.exec-shield=0

 

You need to reboot, for kernel.exec-shield parameter to take effect. 

Note that both kernel parameters are required for ASLR to be switched off.

 

Incorrect Database Settings

Check whether PGA_AGGREGATE_TARGET is set to TRUE/FALSE

The parameter PGA_AGGREGATE_TARGET is a numeric value and not a Boolean value and therefore must be set to a number for it to function correctly.

Check whether PRE_PAGE_SGA is set to TRUE

PRE_PAGE_SGA instructs Oracle to read the entire SGA into active memory at instance startup. Operating system page table entries are then prebuilt for each page of the SGA. This setting can increase the amount of time necessary for instance startup, but it is likely to decrease the amount of time necessary for Oracle to reach its full performance capacity after startup. PRE_PAGE_SGA can increase the process startup duration, because every process that starts must access every page in the SGA, this can cause the PMON process to take longer to start and exceed the timeout which is by default 120 seconds causing the instance startup to fail.

Check output of SQL> select * from v$resource_limit;

V$resource_limit dynamic view provides the details of resources like sessions, processes, locks etc. It has the initialization values for the resources, maximum limit reached after the last database startup and current utilization of the resource.

Recommended actions:

 

Properly set PGA_AGGREGATE_TARGET to a numeric value.

Set PRE_PAGE_SGA to FALSE (recommended).

Check if limits were reached and accordingly increase the value of the resource in concern.

 

Other Causes or Known Issues

Other potential causes known to cause issues could be

Too much activity on your machine

NFS latency issues

Disk latency issue (that affects I/O)

Network latency

 

Recommended actions:

 

Since almost all these reasons are not related to the RDBMS itself, it is recommended you involve your Network, Storage and System Administrators in this investigation. 

也就是说J000这个进程老是在forking phase要花很长时间排队。上面资料也介绍了发生这种错误的场景,我们先检查一下系统资源是否存在不足的情况。

# sar -r

如上所示,内存各项指标不大,CPU利用率在这个时间段确实有所上升。但是总体来说,还是非常低的。

$sar -u

于是我们检查系统的message日志,如下所示,发现这个时间点出现了"INFO: task bdi-default:40 blocked for more than 120 seconds"错误(当然13-17号,只有17号出现了这个错误):

Aug 17 01:32:13 localhost kernel: INFO: task bdi-default:40 blocked for more than 120 seconds.

Aug 17 01:32:13 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Aug 17 01:32:13 localhost kernel: bdi-default   D 0000000000000000     0    40      2 0x00000000

Aug 17 01:32:13 localhost kernel:  ffff88042e7b5bf0 0000000000000046 0000000000000000 ffff88042e5a8080

Aug 17 01:32:13 localhost kernel:  ffff88042e7b2340 ffffffff81aa84c0 ffff88042e7b2710 0000000547046b50

Aug 17 01:32:13 localhost kernel:  0000000000000001 0000000000000000 0000000000000000 7fffffffffffffff

Aug 17 01:32:13 localhost kernel: Call Trace:

Aug 17 01:32:13 localhost kernel:  [<ffffffff814527e9>] schedule_timeout+0x36/0xe7

Aug 17 01:32:13 localhost kernel:  [<ffffffff8102bc30>] ? flat_send_IPI_mask+0x11/0x13

Aug 17 01:32:13 localhost kernel:  [<ffffffff8104367e>] ? need_resched+0x23/0x2d

Aug 17 01:32:13 localhost kernel:  [<ffffffff8145265b>] wait_for_common+0xb7/0x12c

Aug 17 01:32:13 localhost kernel:  [<ffffffff8104cf2f>] ? default_wake_function+0x0/0x19

Aug 17 01:32:13 localhost kernel:  [<ffffffff81452773>] wait_for_completion+0x1d/0x1f

Aug 17 01:32:13 localhost kernel:  [<ffffffff81076aab>] kthread_create+0x9c/0x143

Aug 17 01:32:13 localhost kernel:  [<ffffffff810edce9>] ? bdi_start_fn+0x0/0xd1

Aug 17 01:32:13 localhost kernel:  [<ffffffff810654cb>] ? lock_timer_base+0x2b/0x4f

Aug 17 01:32:13 localhost kernel:  [<ffffffff81065562>] ? try_to_del_timer_sync+0x73/0x81

Aug 17 01:32:13 localhost kernel:  [<ffffffff81065589>] ? del_timer_sync+0x19/0x26

Aug 17 01:32:13 localhost kernel:  [<ffffffff8145286a>] ? schedule_timeout+0xb7/0xe7

Aug 17 01:32:13 localhost kernel:  [<ffffffff81065596>] ? process_timeout+0x0/0x10

Aug 17 01:32:13 localhost kernel:  [<ffffffff810edf8b>] bdi_forker_task+0x1d1/0x263

Aug 17 01:32:13 localhost kernel:  [<ffffffff8103feee>] ? __wake_up_common+0x46/0x79

Aug 17 01:32:13 localhost kernel:  [<ffffffff810eddba>] ? bdi_forker_task+0x0/0x263

Aug 17 01:32:13 localhost kernel:  [<ffffffff810eddba>] ? bdi_forker_task+0x0/0x263

Aug 17 01:32:13 localhost kernel:  [<ffffffff81076a07>] kthread+0x6e/0x76

Aug 17 01:32:13 localhost kernel:  [<ffffffff81012dea>] child_rip+0xa/0x20

Aug 17 01:32:13 localhost kernel:  [<ffffffff81076999>] ? kthread+0x0/0x76

Aug 17 01:32:13 localhost kernel:  [<ffffffff81012de0>] ? child_rip+0x0/0x20

Aug 17 01:34:13 localhost kernel: INFO: task bdi-default:40 blocked for more than 120 seconds.

Aug 17 01:34:13 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Aug 17 01:34:13 localhost kernel: bdi-default   D 0000000000000000     0    40      2 0x00000000

Aug 17 01:34:13 localhost kernel:  ffff88042e7b5bf0 0000000000000046 0000000000000000 ffff88042e5a8080

Aug 17 01:34:13 localhost kernel:  ffff88042e7b2340 ffffffff81aa84c0 ffff88042e7b2710 0000000547046b50

Aug 17 01:34:13 localhost kernel:  0000000000000001 0000000000000000 0000000000000000 7fffffffffffffff

Aug 17 01:34:13 localhost kernel: Call Trace:

Aug 17 01:34:13 localhost kernel:  [<ffffffff814527e9>] schedule_timeout+0x36/0xe7

Aug 17 01:34:13 localhost kernel:  [<ffffffff8102bc30>] ? flat_send_IPI_mask+0x11/0x13

Aug 17 01:34:13 localhost kernel:  [<ffffffff8104367e>] ? need_resched+0x23/0x2d

Aug 17 01:34:13 localhost kernel:  [<ffffffff8145265b>] wait_for_common+0xb7/0x12c

Aug 17 01:34:13 localhost kernel:  [<ffffffff8104cf2f>] ? default_wake_function+0x0/0x19

Aug 17 01:34:13 localhost kernel:  [<ffffffff81452773>] wait_for_completion+0x1d/0x1f

Aug 17 01:34:13 localhost kernel:  [<ffffffff81076aab>] kthread_create+0x9c/0x143

Aug 17 01:34:13 localhost kernel:  [<ffffffff810edce9>] ? bdi_start_fn+0x0/0xd1

Aug 17 01:34:13 localhost kernel:  [<ffffffff810654cb>] ? lock_timer_base+0x2b/0x4f

Aug 17 01:34:13 localhost kernel:  [<ffffffff81065562>] ? try_to_del_timer_sync+0x73/0x81

Aug 17 01:34:13 localhost kernel:  [<ffffffff81065589>] ? del_timer_sync+0x19/0x26

Aug 17 01:34:13 localhost kernel:  [<ffffffff8145286a>] ? schedule_timeout+0xb7/0xe7

Aug 17 01:34:13 localhost kernel:  [<ffffffff81065596>] ? process_timeout+0x0/0x10

Aug 17 01:34:13 localhost kernel:  [<ffffffff810edf8b>] bdi_forker_task+0x1d1/0x263

Aug 17 01:34:13 localhost kernel:  [<ffffffff8103feee>] ? __wake_up_common+0x46/0x79

Aug 17 01:34:13 localhost kernel:  [<ffffffff810eddba>] ? bdi_forker_task+0x0/0x263

Aug 17 01:34:13 localhost kernel:  [<ffffffff810eddba>] ? bdi_forker_task+0x0/0x263

Aug 17 01:34:13 localhost kernel:  [<ffffffff81076a07>] kthread+0x6e/0x76

Aug 17 01:34:13 localhost kernel:  [<ffffffff81012dea>] child_rip+0xa/0x20

Aug 17 01:34:13 localhost kernel:  [<ffffffff81076999>] ? kthread+0x0/0x76

Aug 17 01:34:13 localhost kernel:  [<ffffffff81012de0>] ? child_rip+0x0/0x20

Aug 17 01:36:13 localhost kernel: INFO: task bdi-default:40 blocked for more than 120 seconds.

Aug 17 01:36:13 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

关于bdi-defualt进程,我从Linux内核进程详解之二:bdi-default这篇文章摘取了下面一段:

bdi,即是backing device info的缩写,根据英文单词全称可知其通指备用存储设备相关描述信息,这在内核代码里用一个结构体backing_dev_info来表示:http://lxr.linux.no/#linux+v2.6.38.8/include/linux/backing-dev.h#L62

bdi,备用存储设备,简单点说就是能够用来存储数据的设备,而这些设备存储的数据能够保证在计算机电源关闭时也不丢失。这样说来,软盘存储设备、光驱存储设备、USB存储设备、硬盘存储设备都是所谓的备用存储设备(后面都用bdi来指示),而内存显然不是,具体看下面这个链接:http://www.gordonschools.aberdeenshire.sch.uk/Departments/Computing/StandardGrade/SystemsWeb/6BackingStorage.htm

相对于内存来说,bdi设备(比如最常见的硬盘存储设备)的读写速度是非常慢的,因此为了提高系统整体性能,Linux系统对bdi设备的读写内容进行了缓冲,那些读写的数据会临时保存在内存里,以避免每次都直接操作bdi设备,但这就需要在一定的时机(比如每隔5秒、脏数据达到的一定的比率等)把它们同步到bdi设备,否则长久的呆在内存里容易丢失(比如机器突然宕机、重启),而进行间隔性同步工作的进程之前名叫pdflush,但后来在Kernel 2.6.2x/3x(没注意具体是哪个小版本的改动,比如:http://kernelnewbies.org/Linux_2_6_35#head-57d43d498509746df08f48b1f040d20d88d2b984,http://lwn.net/Articles/396757/)对此进行了优化改进,产生有多个内核进程,bdi-default、flush-x:y等

检查确认如下

# ps -ef | grep bdi-default | grep 40

root        40     2  0 Jan17 ?        00:00:31 [bdi-default]

root      4026   752  0 17:07 pts/3    00:00:00 grep bdi-default

看来确实是这个bdi-default进程的异常引起的。接下来就需要查证为什么出现“task bdi-default:40 blocked for more than 120 seconds”,关于这个网上有很多资料:

由于bug引起的(与我们的环境不符,直接pass)

Slow writes to ext4 partition - INFO: task flush-253:7:2137 blocked for more than 120 seconds

我自己也遇到过几次有关这样的案例,不过是不同的进程引起的,也有所区别。

TNS-12518 & Linux Error:32:Broken pipe

Linux服务器宕机案例第二则

在官方资料有介绍,但是我无订阅账号,无法查看具体内容。

https://access.redhat.com/solutions/31453

System becomes unresponsive with message “INFO: task : blocked for more than 120 seconds里面有如下介绍

INFO: task <process>:<pid> blocked for more than 120 seconds

What is the reason behind above messages and what kind of information is needed to troubleshoot them?

  • These messages typically mean that the system is experiencing disk or memory congestion and processes are being starved of available resources.

  • These messages serve as a warning that something may not be operating optimally. They do not necessarily indicate a serious problem and any blocked processes should eventually proceed when the system recovers.
  • To troubleshoot further, generate a vmcore
  • If the hung task is a third-party application, please engage the application vendor in the issue meanwhile.
  • If the hung task messages are known to be erroneous, they can be disabled. This is not advised, and it will not prevent any hangs. To disable the messages, use

sysctl kernel.hung_task_timeout_secs=0

  • If it causes the crash and the causes are known to be erroneous, please disable it by putting the below in /etc/sysctl.conf. Once have the below value in the file, please run ‘sysctl -p’ to apply the change.

kernel.hung_task_panic = 0

Root Cause:

The khungtaskd kernel thread monitors the process states and checks if there are any processes stuck in uninterruptible state for the period set in “kernel.hung_task_timeout_secs” sysctl parameter (default is 120 seconds). If so it will log the message along with a call trace of the process that is blocked. The hung task watchdog will by default only report the first 10 instances of detected hung tasks and then will disable reporting. This is to prevent the system log from being flooded with messages. This means that even though no more blocked task messages are being reported there may still be processes becoming stuck for long periods

Cautions:

  • If hung_task_panic is enabled, this can cause false positives to unnecessarily panic your machine.

    • Although usually, a task being blocked for > 2 minutes is a good indication that something has gone wrong inside the kernel, there are circumstances in which a task can be legitimately blocked for this long, for example:

      • flushing large amounts of file system data to a slow device duringumount() or sync() can take a longer than the defaulthung_task_timeout_secs

      • while the default hung_task_timeout_secs is 2x the default IO timeout value, different types of devices and different environments can change the default IO timeout value causing false hung task event triggers as well as panics if hung_task_panic is enabled.
  • As a general rule, the hung_task_timeout_secs should be set to a value larger than the longest expected IO timeout value to prevent false triggering of the stall task logic and its event output within messages. This is especially true ifhung_task_panic is enabled as false triggering of the stalled task logic will cause induced kernel panics.

打算先重启观察一下,如果不行再试试设置kernel.hung_task_panic = 0,结果昨天重启过后,一切正常了。真怀疑是VMware环境在某些条件下触发的一个bug!

参考资料:

http://www.lenky.info/archives/2012/02/1125

https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/

http://www.cnblogs.com/kerrycode/p/4164838.html

http://www.cnblogs.com/kerrycode/p/5653586.html

https://linuxraj.wordpress.com/2015/11/17/system-becomes-unresponsive-with-message-info-task-blocked-for-more-than-120-seconds/

https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=359621643018467&id=1989698.1&displayIndex=7&_afrWindowMode=0&_adf.ctrl-state=s96s1un90_182#title1

https://bugzilla.redhat.com/show_bug.cgi?id=652262

Linux 日志报错 xxx blocked for more than 120 seconds的更多相关文章

  1. linux 出错 “INFO: task java: xxx blocked for more than 120 seconds.” 的3种解决方案

    1 问题描述 最近搭建的一个linux最小系统在运行到241秒时在控制台自动打印如下图信息,并且以后每隔120秒打印一次. 仔细阅读打印信息发现关键信息是“hung_task_timeout_secs ...

  2. Linux系统出现hung_task_timeout_secs和blocked for more than 120 seconds的解决方法

    Linux系统出现系统没有响应. 在/var/log/message日志中出现大量的 “echo 0 > /proc/sys/kernel/hung_task_timeout_secs" ...

  3. php日志报错child exited with code 0 after seconds from start

    因为日志文件老是有这种提示: [27-May-2015 15:13:48] NOTICE: [pool www] child 3998 started [27-May-2015 15:13:59] N ...

  4. hung_task_timeout_secs和blocked for more than 120 seconds的解决方法

    Linux系统出现hung_task_timeout_secs和blocked for more than 120 seconds的解决方法 Linux系统出现系统没有响应. 在/var/log/me ...

  5. 11gR2数据库日志报错:Fatal NI connect error 12170、

    11gR2数据库日志报错:Fatal NI connect error 12170.TNS-12535.TNS-00505 [问题点数:100分,结帖人MarkIII]             不显示 ...

  6. 服务器卡死,重启报错: INFO: task blocked for more than 120 seconds

    问题:服务器负载很高,但是CPU利用率不高.服务器经常夯住,网站打不开,SSH连接非常不稳定,输入命令夯住. 重启服务器报错: INFO: task blocked for more than 120 ...

  7. 【zabbix监控问题】记录zabbix控制面板报错及日志报错的解决方法

    问题1: 上图是我已经解决了的截图.在百度查询的资料中,说是把zabbix_agentd.conf文件中server监听的主机127.0.0.1去掉,但是我去掉之后问题仍然没有解决,最后在这篇博客上发 ...

  8. linux 终端报错 Out of memory: Kill process[PID] [process name] score问题分析

    从Out of memory来看是内存超出了,后面的 Kill process[PID] [process name] score好像和进程有关了,下面我们就一起来看看linux 终端报错 Out o ...

  9. Linux 下报错:A Java RunTime Environment (JRE) or Java Development Kit (JDK) must解决方案

    一.报错环境:在Linux mint下,前几天还用得很好的的eclipse,今天开机不知为什么这样. Linux 下报错:A Java RunTime Environment (JRE) or Jav ...

随机推荐

  1. LeetCode - 404. Sum of Left Leaves

    Find the sum of all left leaves in a given binary tree. Example: 3 / \ 9 20 / \ 15 7 There are two l ...

  2. MVC发布后项目存在于根目录中的子目录中时的css与js、图片路径问题

    加载固定资源js与css <script src="@Url.Content("~/Scripts/js/jquery.min.js")" type=&q ...

  3. 一个基于mysql构建的队列表

    通常大家都会使用redis作为应用的任务队列表,redis的List结构,在一段进行任务的插入,在另一端进行任务的提取. 任务的插入 $redis->lPush("key:task:l ...

  4. 轻松理解AOP思想(面向切面编程)

    本文旨在帮助还没有理解AOP的童鞋看透弄懂AOP,也欢迎高手批评指正. 先说一个Spring是什么吧,大家都是它是一个框架,但框架这个词对新手有点抽象,以致于越解释越模糊,不过它确实是个框架的,但那是 ...

  5. rem、em、px的区别

    px 像素(Pixel).相对长度单位.像素px是相对于显示器屏幕分辨率而言的. 特点: 1. IE无法调整那些使用px作为单位的字体大小: 2. 国外的大部分网站能够调整的原因在于其使用了em或re ...

  6. MySQL一个语句查出各种整形占用字节数及最大最小值

    直接上码: as min_num union , union , union , union , union ,) union ,) union ,) union ,) union ,); +---- ...

  7. Rafy 领域实体框架 - 领域模型设计器(建模工具)设计方案

    去年4月,我们为 Rafy 框架添加了领域模型设计器组件.时隔一年,谨以本文,简要说明该领域模型设计器的设计思想. 设计目标 Rafy 实体框架中以领域驱动设计作为指导思想.所以在开发时,以领域建模为 ...

  8. System.Security.Cryptography.CryptographicException,密钥集不存在

    非常感谢,已经解决了.是当前用户没有权限访问证书的私钥文件的问题,之前尝试去解决,但是在对:C:\Documents and Settings\All Users\Application Data\M ...

  9. C#中 字符串转换为计算公式,并计算结果

     根据总结,大概分为以下三种: 第一种: 用DataTable中的Compute方法. 例如:" 1*2*3 " 代码如下: var a = new System.Data.Dat ...

  10. form表单提交和阻止

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/ ...