凌晨收到同事电话,反馈应用程序访问Oracle数据库时报错,当时现场现象确认:

1. 应用程序访问不了数据库,使用SQL Developer测试发现访问不了数据库。报ORA-12570 TNS:packet reader failure

2. 使用lsnrctl status检查监听,一直没有响应,这个是极少见的情况。

3. 检查数据库状态为OPEN,使用nmon检查系统资源。如下一张截图所示,CPU利用率不高,但是CPU Wait%非常高。这意味着I/O不正常。可能出现了IO等待和争用(IO waits and contention)

CPU Wait%:显示采集间隔内所有CPU处于空闲且等待I/O完成的时间比例,Wait%是CPU空闲状态的一种,当CPU处于空闲状态而又有进程处于D状态(不可中断睡眠)时,系统会统计这时的时间,并计算到Wait%里,Wait%不是一个时间值,而是时间的比例,因此在同样I/O Wait时间下,服务器CPU越多,Wait%越低,它体现了I/O操作与计算操作之间的比例。对I/O密集型的应用来说一般Wait%较高.)

4.打开邮件发现收到大量的监控告警日志作业发出的邮件,检查告警日志,发现里面有大量ORA错误信息,部分内容如下:

3 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

10 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

17 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

24 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

31 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

38 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds 

 

41 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

48 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds 

关于“ORA-00494: enqueue [CF] held for too long (more than 900 seconds).....”这个错误,我们先看看这个错误的相关描述:

[oracle@DB-Server ~]$ oerr ora 494

 

00494, 00000, "enqueue %s held for too long (more than %s seconds) by 'inst %s, osid %s'"

 

// *Cause: The specified process did not release the enqueue within

 

// the maximum allowed time.

 

// *Action: Reissue any commands that failed and contact Oracle Support

 

// Services with the incident information.

 

出现ORA-00494 意味这Instance Crash了,可以参考官方文档 Database Crashes With ORA-00494 (文档 ID 753290.1):

 

This error can also be accompanied by ORA-600 [2103] which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds).

This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.

Cause#1: The lgwr has killed the ckpt process, causing the instance to crash.

From the alert.log we can see:

The database has waited too long for a CF enqueue, so the next error is reported:

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 38356'

Then the LGWR killed the blocker, which was in this case the CKPT process which then causes the instance to crash.

Checking the alert.log further we can see that the frequency of redo log files switch is very high (almost every 1 min).

Cause#2: Checking the I/O State in the AWR report we find that:

Average Read per ms (Av Rd(ms)) for the database files which are located on this mount point " /oracle/oa1l/data/" is facing I/O issue as per the data collection which was perform

Cause#3: The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'

and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'

The ORA-00494 error occurs during periods of super-high stress, activity to the point there the server becomes unresponsive due to overloaded disk I/O, CPU or RAM.

从上面分析看,这三种原因都存在可能性。但是需要跟多的信息和证据来确认到底是什么原因导致ORA-00494错误, 以至数据库实例Crash。

1:告警日志里面有“ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'” 错误,CF指Control file schema global enqueue。如果一个进程在指定的时间(默认900秒)内无法获得CF锁,则CF锁的执行进程会被kill。这个参数为_controlfile_enqueue_timeout

SQL> COL NAME  FOR A45 ;

SQL> COL VALUE FOR A32 ;

SQL> COL DESCR FOR A80 ;

SQL> SELECT x.ksppinm  NAME

  2       , y.ksppstvl VALUE

  3       , x.ksppdesc DESCR

  4  FROM SYS.x$ksppi x, SYS.x$ksppcv y

  5  WHERE x.indx = y.indx

  6  AND x.ksppinm LIKE '%&par%';

Enter value for par: controlfile_enqueue

old   6: AND x.ksppinm LIKE '%&par%'

new   6: AND x.ksppinm LIKE '%controlfile_enqueue%'

 

NAME                                          VALUE                            DESCR

--------------------------------------------- -------------------------------- --------------------------------------------------------------------------------

_controlfile_enqueue_timeout                  900                              control file enqueue timeout in seconds

_controlfile_enqueue_holding_time             120                              control file enqueue max holding time in seconds

_controlfile_enqueue_dump                     FALSE                            dump the system states after controlfile enqueue timeout

_kill_controlfile_enqueue_blocker             TRUE                             enable killing controlfile enqueue blocker on timeout

 

SQL> 

检查redo log的切换频率,发现在2016-11-09 零点到2点,以及2016-11-08 22:00~ 24:00的redo log 切换频率都很低。排除有大量DML操作的可能性, 根据以上一些分析,我们还不能完全排除Cause#1。我们接着分析其他信息

SELECT 

TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'99') "00",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'99') "01",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'99') "02",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'99') "03",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'99') "04",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'99') "05",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'99') "06",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'99') "07",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'99') "08",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'99') "09",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'99') "10",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'99') "11",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'99') "12",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'99') "13",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'99') "14",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'99') "15",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'99') "16",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'99') "17",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'99') "18",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'99') "19",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'99') "20",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'99') "21",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'99') "22",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'99') "23"

FROM

V$LOG_HISTORY

GROUP BY 

TO_CHAR(FIRST_TIME,'YYYY-MM-DD') 

ORDER BY 1 DESC;

2:关于 The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'

and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'

告警日志里面出现ORA-00239,但是没有出现ORA-603、ORA-00470之类的错误。按照官方文档Disk I/O Contention/Slow Can Lead to ORA-239 and Instance Crash (文档 ID 1068799.1)

I/O contention or slowness leads to control file enqueue timeout.

One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :

From the AWR:

1) high "log file parallel write" and "control file sequential read" waits

2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)

3) very high %iowait : 98.57%.

4) confirmed IO peak during that time

Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.

这台服务器已经正常运行了很多年,所以我们更倾向是IO问题导致。结合当时CPU Wait%非常高。这意味着可能出现了严重的IO等待和争用(IO waits and contention)

3:我们来看看监控工具OSWather生成这段时间的一些报告,如下,CPU资源非常空闲

 

Operating System CPU Utilization

CPU等待IO资源(Wait IO)也是从10:45 PM(22:45)之后变大。CPU利用率一直不高,最多20%多的样子。

 

Operating System CPU Other

然后,我们看看Operating System I/O吧,如下截图所示,可以看出在11点开始,系统IO设备非常繁忙 由此我们可以判断IO异常导致数据库出现ORA-00494错误的可能性很大。

Operating System I/O

 

Operating System I/O Throughput

然后我们检查一下操作系统的日志,如下所示:

如下截图所示,“INFO: task kjournald:xxx blocked for more than 120 seconds.”从23:22开始,在这之前,出现大量这类日志信息。这个是因为PlateSpin的作业复制导致(后面确认该作业在22:40启动)。所以至此,我们更倾向是因为第二个源于引起数据库Instance Crash。后面和系统管理员确认,PlateSpin的复制作业也是失败了。所以种种分析,非常怀疑是PlateSpin的作业引起了IO异常。而IO发生短暂或长时间停止响应的时候,就导致数据库实例崩溃。

 

后续处理解决

此时使用shutdown immediate关闭不了数据库,没有任何响应。只能shutdown abort,然后启动数据库实例,但是在startup时出现异常,报下面一些错误

ORA-01102: cannot mount database in EXCLUSIVE mode

 

ORA-00205: error in identifying control file, check alert log for more info

 

ORA-00202: control file: '/u01/app/oracle/oradata/epps/control01.ctl'

 

ORA-27086: unable to lock file - already in use

关于这个错误,此处不做展开,可以参考ORA-01102: cannot mount database in EXCLUSIVE mode,kill掉大部分进程后,发现有三个进程使用kill -9 kill不掉,如下截图所示:

 

kill -9发送SIGKILL信号将其终止,但是以下两种情况不起作用:

 

a、该进程处于”Zombie”状态(使用ps命令返回defunct的进程)。此时进程已经释放所有资源,但还未得到其父进程的确认。”Zombie”进程要等到下次重启时才会消失,但它的存在不会影响系统性能。

b、 该进程处于”kernel mode”(核心态)且在等待不可获得的资源。处于核心态的进程忽略所有信号处理,因此对于这些一直处于核心态的进程只能通过重启系统实现。进程在Linux中会处于两种状态,即用户态和核心态。只有处于用户态的进程才可以用“kill”命令将其终止。

由于这些进程已经陷入核心态,而且很难自动唤醒,又不接受信号指令。不得已只能reboot系统了。 重启后问题解决。后面和系统管理员协商暂时停用PlateSpin作业,待周日重新做一个完整备份后,继续观察IO影响。

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'的更多相关文章

  1. 记录一则ASM实例阻塞,rbal进程异常的案例

    1.故障现象描述 2.确认故障现象 3.排查ASM层面 4.解决问题 1.故障现象描述 环境:AIX 7.1 + Standalone Oracle 11.2.0.4 现象:客户反映某11g版本的AD ...

  2. Oracle 性能之 Enq: CF - contention

    Oracle 性能之 Enq: CF - contention Table of Contents 1. 原因 2. 解决问题 2.1. 针对持有锁进程类型处理 2.1.1. 查看持有锁会话的进程类型 ...

  3. Oracle11g版本中未归档隐藏参数

    In this post, I will give a list of all undocumented parameters in Oracle 11g. Here is a query to se ...

  4. ORA-01591 锁定已被有问题的分配事务处理--解决方法(转)

    转载自love wife & love life —Roger 的Oracle技术博客 本文链接地址: ORA-01591: lock held by in-doubt distributed ...

  5. Oracle EBS R12的启停脚本

    以下脚本用root用户登录执行: 一.DB启停使用EBS提供的脚本ebs_start.shsu - oraprod -c "/d01/oracle/PROD/db/tech_st/10.2. ...

  6. REORG TABLESPACE on z/os

    这个困扰了我两天的问题终于解决了,在运行这个job时:总是提示 A REQUIRED DD CARD OR TEMPLATE IS MISSING NAME=SYSDISC A REQUIRED DD ...

  7. Centos6.5里安装Hbase(伪分布式)

    首先我们到官方网站下载Hbase,而我使用的版本是hbase-0.94.27.tar.gz 解压下来: tar zxvf hbase-.tar.gz 寻找java安装路径 [root@localhos ...

  8. hbase 使用

    hbase shell命令的使用 再使用hbase 命令之前先检查一下hbase是否运行正常 hadoop@Master:/usr/hbase/bin$ jps HMaster NameNode Se ...

  9. Hbase学习记录(2)| Shell操作

    查看表结构 describe '表名' 查看版本 get '表名','zhangsan'{COLUMN=>'info:age',VERSIONS=>3} 删除整行 deleteall '表 ...

随机推荐

  1. ABP(现代ASP.NET样板开发框架)系列之22、ABP展现层——导航栏设置

    点这里进入ABP系列文章总目录 基于DDD的现代ASP.NET开发框架--ABP系列之22.ABP展现层——导航栏设置 ABP是“ASP.NET Boilerplate Project (ASP.NE ...

  2. Entity Framework 6 Recipes 2nd Edition(13-9)译 -> 避免Include

    问题 你想不用Include()方法,立即加载一下相关的集合,并想通过EF的CodeFirst方式实现. 解决方案 假设你有一个如Figure 13-14所示的模型: Figure 13-14. A ...

  3. [转]Linux下g++编译与使用静态库(.a)和动态库(.os) (+修正与解释)

    在windows环境下,我们通常在IDE如VS的工程中开发C++项目,对于生成和使用静态库(*.lib)与动态库(*.dll)可能都已经比较熟悉,但是,在linux环境下,则是另一套模式,对应的静态库 ...

  4. <a>与文件下载-(下载一)

    <a>可直接下载xls,doc,rar,zip,exe,js文件(图片跟txt文件是直接打开的) <a href="wKioJlJolKeCIzkCADd3Wf7OPI42 ...

  5. 存储程序(2)——MYSQL

    1.触发器 触发器是与特定数据表相关联的存储过程,当相应的数据表被INSERT.DELETE或UPDATE语句修改时,触发器将自动执行.触发器可以被设置成在这几种语句处理每个数据行之前或之后触发.触发 ...

  6. CSS画图

    The Shapes of CSS All of the below use only a single HTML element. Any kind of CSS goes, as long as ...

  7. owner:轻松管理java项目配置

    前段时间,一同事说在 github 上“活捉了”一个很有趣的开源项目,它是一个超轻量级的 jar 包,能够帮助你在 java 项目中摒弃样板式的 properties 配置代码,让你轻松自如地管理和使 ...

  8. seajs3.0.0源码分析记录

    自己边读变加了一些注释,理解了一下seajs3.0.0工作的流程.正则没有一个个去理解,插件模块也没看, 以后有时间了可以补充完整~ 事件系统中事件队列的获取&定义方法 var list = ...

  9. 实验:Oracle直接拷贝物理存储文件迁移

    实验目的:Oracle直接拷贝物理文件迁移,生产库有类似施工需求,故在实验环境简单验证一下. 实验环境: A主机:192.168.1.200 Solaris10 + Oracle 11.2.0.1 B ...

  10. Android来电监听和去电监听

    我觉得写文章就得写得有用一些的,必须要有自己的思想,关于来电去电监听将按照下面三个问题展开 1.监听来电去电有什么用? 2.怎么监听,来电去电监听方式一样吗? 3.实战,有什么需要特别注意地方? 监听 ...