凌晨收到同事电话,反馈应用程序访问Oracle数据库时报错,当时现场现象确认:

1. 应用程序访问不了数据库,使用SQL Developer测试发现访问不了数据库。报ORA-12570 TNS:packet reader failure

2. 使用lsnrctl status检查监听,一直没有响应,这个是极少见的情况。

3. 检查数据库状态为OPEN,使用nmon检查系统资源。如下一张截图所示,CPU利用率不高,但是CPU Wait%非常高。这意味着I/O不正常。可能出现了IO等待和争用(IO waits and contention)

CPU Wait%:显示采集间隔内所有CPU处于空闲且等待I/O完成的时间比例,Wait%是CPU空闲状态的一种,当CPU处于空闲状态而又有进程处于D状态(不可中断睡眠)时,系统会统计这时的时间,并计算到Wait%里,Wait%不是一个时间值,而是时间的比例,因此在同样I/O Wait时间下,服务器CPU越多,Wait%越低,它体现了I/O操作与计算操作之间的比例。对I/O密集型的应用来说一般Wait%较高.)

4.打开邮件发现收到大量的监控告警日志作业发出的邮件,检查告警日志,发现里面有大量ORA错误信息,部分内容如下:

3 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

10 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

17 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

24 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

31 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

38 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds 

 

41 | | ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166' 

 

48 | | ORA-00239: timeout waiting for control file enqueue: held by 'inst 1, osid 5166' for more than 900 seconds 

关于“ORA-00494: enqueue [CF] held for too long (more than 900 seconds).....”这个错误,我们先看看这个错误的相关描述:

[oracle@DB-Server ~]$ oerr ora 494

 

00494, 00000, "enqueue %s held for too long (more than %s seconds) by 'inst %s, osid %s'"

 

// *Cause: The specified process did not release the enqueue within

 

// the maximum allowed time.

 

// *Action: Reissue any commands that failed and contact Oracle Support

 

// Services with the incident information.

 

出现ORA-00494 意味这Instance Crash了,可以参考官方文档 Database Crashes With ORA-00494 (文档 ID 753290.1):

 

This error can also be accompanied by ORA-600 [2103] which is basically the same problem - a process was unable to obtain the CF enqueue within the specified timeout (default 900 seconds).

This behavior can be correlated with server high load and high concurrency on resources, IO waits and contention, which keep the Oracle background processes from receiving the necessary resources.

Cause#1: The lgwr has killed the ckpt process, causing the instance to crash.

From the alert.log we can see:

The database has waited too long for a CF enqueue, so the next error is reported:

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 38356'

Then the LGWR killed the blocker, which was in this case the CKPT process which then causes the instance to crash.

Checking the alert.log further we can see that the frequency of redo log files switch is very high (almost every 1 min).

Cause#2: Checking the I/O State in the AWR report we find that:

Average Read per ms (Av Rd(ms)) for the database files which are located on this mount point " /oracle/oa1l/data/" is facing I/O issue as per the data collection which was perform

Cause#3: The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'

and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'

The ORA-00494 error occurs during periods of super-high stress, activity to the point there the server becomes unresponsive due to overloaded disk I/O, CPU or RAM.

从上面分析看,这三种原因都存在可能性。但是需要跟多的信息和证据来确认到底是什么原因导致ORA-00494错误, 以至数据库实例Crash。

1:告警日志里面有“ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'” 错误,CF指Control file schema global enqueue。如果一个进程在指定的时间(默认900秒)内无法获得CF锁,则CF锁的执行进程会被kill。这个参数为_controlfile_enqueue_timeout

SQL> COL NAME  FOR A45 ;

SQL> COL VALUE FOR A32 ;

SQL> COL DESCR FOR A80 ;

SQL> SELECT x.ksppinm  NAME

  2       , y.ksppstvl VALUE

  3       , x.ksppdesc DESCR

  4  FROM SYS.x$ksppi x, SYS.x$ksppcv y

  5  WHERE x.indx = y.indx

  6  AND x.ksppinm LIKE '%&par%';

Enter value for par: controlfile_enqueue

old   6: AND x.ksppinm LIKE '%&par%'

new   6: AND x.ksppinm LIKE '%controlfile_enqueue%'

 

NAME                                          VALUE                            DESCR

--------------------------------------------- -------------------------------- --------------------------------------------------------------------------------

_controlfile_enqueue_timeout                  900                              control file enqueue timeout in seconds

_controlfile_enqueue_holding_time             120                              control file enqueue max holding time in seconds

_controlfile_enqueue_dump                     FALSE                            dump the system states after controlfile enqueue timeout

_kill_controlfile_enqueue_blocker             TRUE                             enable killing controlfile enqueue blocker on timeout

 

SQL> 

检查redo log的切换频率,发现在2016-11-09 零点到2点,以及2016-11-08 22:00~ 24:00的redo log 切换频率都很低。排除有大量DML操作的可能性, 根据以上一些分析,我们还不能完全排除Cause#1。我们接着分析其他信息

SELECT 

TO_CHAR(FIRST_TIME,'YYYY-MM-DD') DAY,

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'00',1,0)),'99') "00",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'01',1,0)),'99') "01",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'02',1,0)),'99') "02",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'03',1,0)),'99') "03",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'04',1,0)),'99') "04",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'05',1,0)),'99') "05",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'06',1,0)),'99') "06",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'07',1,0)),'99') "07",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'08',1,0)),'99') "08",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'09',1,0)),'99') "09",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'10',1,0)),'99') "10",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'11',1,0)),'99') "11",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'12',1,0)),'99') "12",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'13',1,0)),'99') "13",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'14',1,0)),'99') "14",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'15',1,0)),'99') "15",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'16',1,0)),'99') "16",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'17',1,0)),'99') "17",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'18',1,0)),'99') "18",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'19',1,0)),'99') "19",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'20',1,0)),'99') "20",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'21',1,0)),'99') "21",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'22',1,0)),'99') "22",

TO_CHAR(SUM(DECODE(TO_CHAR(FIRST_TIME,'HH24'),'23',1,0)),'99') "23"

FROM

V$LOG_HISTORY

GROUP BY 

TO_CHAR(FIRST_TIME,'YYYY-MM-DD') 

ORDER BY 1 DESC;

2:关于 The problem has been investigated in Bug 7692631 - 'DATABASE CRASHES WITH ORA-494 AFTER UPGRADE TO 10.2.0.4'

and unpublished Bug 7914003 'KILL BLOCKER AFTER ORA-494 LEADS TO FATAL BG PROCESS BEING KILLED'

告警日志里面出现ORA-00239,但是没有出现ORA-603、ORA-00470之类的错误。按照官方文档Disk I/O Contention/Slow Can Lead to ORA-239 and Instance Crash (文档 ID 1068799.1)

I/O contention or slowness leads to control file enqueue timeout.

One particular situation that can be seen is LGWR timeout while waiting for control file enqueue, and the blocker is CKPT :

From the AWR:

1) high "log file parallel write" and "control file sequential read" waits

2) Very slow Tablespace I/O, Av Rd(ms) of 1000-4000 ms (when lower than 20 ms is acceptable)

3) very high %iowait : 98.57%.

4) confirmed IO peak during that time

Please note: Remote archive destination is also a possible cause. Networking issues can also cause this type of issue when a remote archive destination is in use for a standby database.

这台服务器已经正常运行了很多年,所以我们更倾向是IO问题导致。结合当时CPU Wait%非常高。这意味着可能出现了严重的IO等待和争用(IO waits and contention)

3:我们来看看监控工具OSWather生成这段时间的一些报告,如下,CPU资源非常空闲

 

Operating System CPU Utilization

CPU等待IO资源(Wait IO)也是从10:45 PM(22:45)之后变大。CPU利用率一直不高,最多20%多的样子。

 

Operating System CPU Other

然后,我们看看Operating System I/O吧,如下截图所示,可以看出在11点开始,系统IO设备非常繁忙 由此我们可以判断IO异常导致数据库出现ORA-00494错误的可能性很大。

Operating System I/O

 

Operating System I/O Throughput

然后我们检查一下操作系统的日志,如下所示:

如下截图所示,“INFO: task kjournald:xxx blocked for more than 120 seconds.”从23:22开始,在这之前,出现大量这类日志信息。这个是因为PlateSpin的作业复制导致(后面确认该作业在22:40启动)。所以至此,我们更倾向是因为第二个源于引起数据库Instance Crash。后面和系统管理员确认,PlateSpin的复制作业也是失败了。所以种种分析,非常怀疑是PlateSpin的作业引起了IO异常。而IO发生短暂或长时间停止响应的时候,就导致数据库实例崩溃。

 

后续处理解决

此时使用shutdown immediate关闭不了数据库,没有任何响应。只能shutdown abort,然后启动数据库实例,但是在startup时出现异常,报下面一些错误

ORA-01102: cannot mount database in EXCLUSIVE mode

 

ORA-00205: error in identifying control file, check alert log for more info

 

ORA-00202: control file: '/u01/app/oracle/oradata/epps/control01.ctl'

 

ORA-27086: unable to lock file - already in use

关于这个错误,此处不做展开,可以参考ORA-01102: cannot mount database in EXCLUSIVE mode,kill掉大部分进程后,发现有三个进程使用kill -9 kill不掉,如下截图所示:

 

kill -9发送SIGKILL信号将其终止,但是以下两种情况不起作用:

 

a、该进程处于”Zombie”状态(使用ps命令返回defunct的进程)。此时进程已经释放所有资源,但还未得到其父进程的确认。”Zombie”进程要等到下次重启时才会消失,但它的存在不会影响系统性能。

b、 该进程处于”kernel mode”(核心态)且在等待不可获得的资源。处于核心态的进程忽略所有信号处理,因此对于这些一直处于核心态的进程只能通过重启系统实现。进程在Linux中会处于两种状态,即用户态和核心态。只有处于用户态的进程才可以用“kill”命令将其终止。

由于这些进程已经陷入核心态,而且很难自动唤醒,又不接受信号指令。不得已只能reboot系统了。 重启后问题解决。后面和系统管理员协商暂时停用PlateSpin作业,待周日重新做一个完整备份后,继续观察IO影响。

ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 5166'的更多相关文章

  1. 记录一则ASM实例阻塞,rbal进程异常的案例

    1.故障现象描述 2.确认故障现象 3.排查ASM层面 4.解决问题 1.故障现象描述 环境:AIX 7.1 + Standalone Oracle 11.2.0.4 现象:客户反映某11g版本的AD ...

  2. Oracle 性能之 Enq: CF - contention

    Oracle 性能之 Enq: CF - contention Table of Contents 1. 原因 2. 解决问题 2.1. 针对持有锁进程类型处理 2.1.1. 查看持有锁会话的进程类型 ...

  3. Oracle11g版本中未归档隐藏参数

    In this post, I will give a list of all undocumented parameters in Oracle 11g. Here is a query to se ...

  4. ORA-01591 锁定已被有问题的分配事务处理--解决方法(转)

    转载自love wife & love life —Roger 的Oracle技术博客 本文链接地址: ORA-01591: lock held by in-doubt distributed ...

  5. Oracle EBS R12的启停脚本

    以下脚本用root用户登录执行: 一.DB启停使用EBS提供的脚本ebs_start.shsu - oraprod -c "/d01/oracle/PROD/db/tech_st/10.2. ...

  6. REORG TABLESPACE on z/os

    这个困扰了我两天的问题终于解决了,在运行这个job时:总是提示 A REQUIRED DD CARD OR TEMPLATE IS MISSING NAME=SYSDISC A REQUIRED DD ...

  7. Centos6.5里安装Hbase(伪分布式)

    首先我们到官方网站下载Hbase,而我使用的版本是hbase-0.94.27.tar.gz 解压下来: tar zxvf hbase-.tar.gz 寻找java安装路径 [root@localhos ...

  8. hbase 使用

    hbase shell命令的使用 再使用hbase 命令之前先检查一下hbase是否运行正常 hadoop@Master:/usr/hbase/bin$ jps HMaster NameNode Se ...

  9. Hbase学习记录(2)| Shell操作

    查看表结构 describe '表名' 查看版本 get '表名','zhangsan'{COLUMN=>'info:age',VERSIONS=>3} 删除整行 deleteall '表 ...

随机推荐

  1. Go语言实战 - 使用SendCloud群发邮件

    山坡网需要能够每周给注册用户发送一封名为"本周最热书籍"的邮件,而之前一直使用的腾讯企业邮箱罢工了,提示说发送请求太多太密集. 一番寻找之后发现了大家口碑不错的搜狐SendClou ...

  2. ABP源码分析三十:ABP.RedisCache

    ABP 通过StackExchange.Redis类库来操作Redis数据库. AbpRedisCacheModule:完成ABP.RedisCache模块的初始化(完成常规的依赖注入) AbpRed ...

  3. 复化梯形求积分——用Python进行数值计算

    用程序来求积分的方法有很多,这篇文章主要是有关牛顿-科特斯公式. 学过插值算法的同学最容易想到的就是用插值函数代替被积分函数来求积分,但实际上在大部分场景下这是行不通的. 插值函数一般是一个不超过n次 ...

  4. Android开发学习之路-插件安装、检查应用是否安装解决方案

    使用Bmob的时候,如果需要用到支付功能,就需要让应用去安装一个支付插件.而一般的做法是将插件放置在assets目录中,当用户需要支付,先检查是否能支付,不能的话,提示安装插件.代码: public ...

  5. css权威指南-基本视觉格式化(水平与垂直)

    1.基本概念     (1)正常流:是指西方语言文本从左向右,从上向下显示.如果要让一个元素不在正常流中国,唯一的办法                     就是使之成为浮动或定位元素.     ( ...

  6. geotrellis使用(二十四)将Geotrellis移植到CDH中必须要填的若干个坑

    目录 前言 若干坑 总结 一.前言        近期干了一件事情,将geotrellis程序移植到CDH中(关于CDH,可以参考安装ClouderaManager以及使用ClouderaManage ...

  7. mongodb安装&简单使用

    转自Mac下使用brew安装mongodb,按着步骤已成功安装. brew常用命令 1.更新brew本身 brew update 2.使用brew安装软件 1 brew install soft_na ...

  8. Android keycode列表

    整理备忘! 基本按键 KEYCODE_0 按键'0' 7 KEYCODE_1 按键'1' 8 KEYCODE_2 按键'2' 9 KEYCODE_3 按键'3' 10 KEYCODE_4 按键'4' ...

  9. ASP.NET Core 中文文档 第一章 入门

    原文:Getting Started 翻译:娄宇(Lyrics) 校对:刘怡(AlexLEWIS) 1.安装 .NET Core 2.创建一个新的 .NET Core 项目: mkdir aspnet ...

  10. ASP.NET Core 中文文档 第三章 原理(10)依赖注入

    原文:Dependency Injection 作者:Steve Smith 翻译:刘浩杨 校对:许登洋(Seay).高嵩 ASP.NET Core 的底层设计支持和使用依赖注入.ASP.NET Co ...