RAC 单节点实例异常关闭,关键报错ORA--29770
监控系统监控到RAC 的一个实例异常关闭 ,时间是凌晨1点多,还好没有影响到业务
之后就是分析原因
这套RAC搭建在虚拟化环境OS SUSE11
查看oracel alert log信息
Mon Dec 04 01:16:20 2017
Thread 1 advanced to log sequence 26371 (LGWR switch)
Current log# 2 seq# 26371 mem# 0: +DATA/mgr/onlinelog/group_2.258.887111011
Current log# 2 seq# 26371 mem# 1: +FRA/mgr/onlinelog/group_2.258.887111015
Mon Dec 04 01:16:20 2017
Archived Log entry 52713 added for thread 1 sequence 26370 ID 0x6c454da0 dest 1:
Mon Dec 04 01:24:14 2017
LMS0 (ospid: 17259) has not called a wait for 97 secs. <<<<<<<<<<<===========
Mon Dec 04 01:24:29 2017
Errors in file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_lmhb_17273.trc (incident=252137):
ORA-29770: ?????? LMS0 (OSID 17259) ???????? 70 ?
Incident details in: /oracle/product/diag/rdbms/mgr/mgr1/incident/incdir_252137/mgr1_lmhb_17273_i252137.trc <<<<<<<<<<<===========
Mon Dec 04 01:24:40 2017
ERROR: Some process(s) is not making progress. <<<<<<<<<<<===========
LMHB (ospid: 17273) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior <<<<<<<<<<<===========
ERROR: Some process(s) is not making progress.
LMHB (ospid: 17273): terminating the instance due to error 29770 <<<<<<<<<<<===========
Mon Dec 04 01:24:41 2017
System state dump requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/product/diag/rdbms/mgr/mgr1/trace/mgr1_diag_17245.trc
Mon Dec 04 01:24:41 2017
ORA-1092 : opitsk aborting process
Dumping diagnostic data in directory=[cdmp_20171204012441], requested by (instance=1, osid=17273 (LMHB)), summary=[abnormal instance termination].
Mon Dec 04 01:24:46 2017
ORA-1092 : opitsk aborting process
Termination issued to instance processes. Waiting for the processes to exit
Mon Dec 04 01:24:53 2017
ORA-1092 : opitsk aborting process
Mon Dec 04 01:24:53 2017
License high water mark = 43
Mon Dec 04 01:24:54 2017
Instance termination failed to kill one or more processes
Instance terminated by LMHB, pid = 17273
USER (ospid: 3513): terminating the instance
Instance terminated by USER, pid = 3513
Mon Dec 04 01:25:18 2017
Starting ORACLE instance (normal)
从日志信息来看,LMS0进程1:24:14时已经等待97s,之后生成trace文件,之后报错ORA-29770,之后实例被终止。
科普一下LMS进程:
LMSn进程会维护在Global Resource Directory (GRD)中的数据文件以及每个cached block的状态。LMSn用于在RAC的实例间进行message以及数据块的传输,这个对应的服务也就是GCS(Global Cache Service),LMS是Cache Fusion的一个重要部分。LMS进程可以说是RAC上最活跃的后台进程,会消耗较多的CPU.一般每个实例会有多个LMS进程,每个Oracle版本的默认的LMS进程数目会有所不同,大部分版本的默认值是:MIN(CPU_COUNT/2, 2))
ORACLE报错信息
ORA-29770: global enqueue process LMS0 (OSID 17259) is hung for more than 70 seconds
报错生成了trace文件
Relevant Information Collection
---------------------------------------
System name: Linux
Node name: mgrdb01
Release: 3.0.76-0.11-default
Version: #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
Machine: x86_64
VM name: VMWare Version: 6 <<<<<<<=============
Instance name: mgr1
Redo thread mounted by this instance: 1
Oracle process number: 17
Unix process pid: 17273, image: oracle@mgrdb01 (LMHB) <<<<<<<============= *** 2017-12-04 01:23:14.315
*** SESSION ID:(1599.1) 2017-12-04 01:23:14.315
*** CLIENT ID:() 2017-12-04 01:23:14.315
*** SERVICE NAME:(SYS$BACKGROUND) 2017-12-04 01:23:14.315
*** MODULE NAME:() 2017-12-04 01:23:14.315
*** ACTION NAME:() 2017-12-04 01:23:14.315 *** TRACE FILE RECREATED AFTER BEING REMOVED *** *** 2017-12-04 01:23:14.314
==============================
LMS0 (ospid: 17259) has not moved for 41 sec (1512321794.1512321753) <<<<<<<=============
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2
: Not in wait; last wait ended 37 secs ago.
: last wait_id 570620887 at 'gcs remote message'. <<<<<<<=============
==============================
Dumping PROCESS LMS0 (ospid: 17259) States
==============================
===[ Callstack ]=== *** 2017-12-04 01:23:14.316
Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259,
pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 01:23:13 ]
NOTE: scheduling delay has not been sampled for 0.935154 secs 0.000000 secs from [ 01:23:09 - 01:23:14 ], 5 sec avg
0.000000 secs from [ 01:22:15 - 01:23:14 ], 1 min avg
0.000000 secs from [ 01:18:15 - 01:23:14 ], 5 min avg
loadavg : 22.91 7.20 2.77
Memory (Avail / Total) = 92815.82M / 129069.95M
Swap (Avail / Total) = 16384.00M / 16384.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= lms0 is in status 'D'.uninterruptible I/O . *** 2017-12-04 01:23:19.322
Short stack dump: ORA-32516: ?????? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)' ??? ORADEBUG ?? 'SHORT_STACK'; ?????? 4920 ?? -------------------------------------------------------------------------------
Process diagnostic dump actual duration=5.010000 sec
(max dump time=5.000000 sec) *** 2017-12-04 01:23:19.322
kjgcr_SlaveReqBegin: message queued to slave
kjgcr_Main: KJGCR_ACTION - id 3
CPU is high. Top oracle users listed below: <<<<<<<============= CPU is high , but no oracle user consuming high CPU .
Session Serial CPU
1882 1 0
1 28043 0
95 1 0
97 24343 0
189 1 0 *** 2017-12-04 01:23:24.390
kjgcr_Main: Reset called for action high cpu, identify users, count 0 *** 2017-12-04 01:23:24.391
kjgcr_Main: Reset called for action high cpu, kill users, count 0 *** 2017-12-04 01:23:24.391
kjgcr_Main: Reset called for action high cpu, activate RM plan, count 0 *** 2017-12-04 01:23:24.391
kjgcr_Main: Reset called for action high cpu, set BG into RT, count 0 *** 2017-12-04 01:23:34.391
==============================
LMS0 (ospid: 17259) has not moved for 61 sec (1512321814.1512321753) <<<<<<<=============
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 2
: Not in wait; last wait ended 57 secs ago.
: last wait_id 570620887 at 'gcs remote message'.
...... *** 2017-12-04 01:23:54.401
Process diagnostic dump for oracle@mgrdb01 (LMS0), OS id=17259,
pid: 13, proc_ser: 1, sid: 1223, sess_ser: 1
-------------------------------------------------------------------------------
os thread scheduling delay history: (sampling every 1.000000 secs)
0.000000 secs at [ 01:23:53 ]
NOTE: scheduling delay has not been sampled for 0.859464 secs 0.000000 secs from [ 01:23:49 - 01:23:54 ], 5 sec avg
0.000000 secs from [ 01:22:55 - 01:23:54 ], 1 min avg
0.000000 secs from [ 01:18:55 - 01:23:54 ], 5 min avg
loadavg : 31.82 11.43 4.38
Memory (Avail / Total) = 92795.89M / 129069.95M
Swap (Avail / Total) = 16384.00M / 16384.00M
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
0 D oracle 17259 1 0 58 - - 12123770 sleep_ Sep25 ? 14:24:46 ora_lms0_mgr1 <<<<<<<============= still 'D' *** 2017-12-04 01:23:59.409
Short stack dump: ORA-32515: ???? ORADEBUG ?? 'SHORT_STACK' ??? 'Unix process pid: 17259, image: oracle@mgrdb01 (LMS0)'; ???????????? 4930 ?? -------------------------------------------------------------------------------
Process diagnostic dump actual duration=5.010000 sec
(max dump time=5.000000 sec) *** 2017-12-04 01:23:59.410
kjgcr_Main: KJGCR_ACTION - id 5 *** 2017-12-04 01:24:14.409
==============================
LMS0 (ospid: 17259) has not moved for 101 sec (1512321854.1512321753)
kjfmGCR_HBCheckAll: LMS0 (ospid: 17259) has status 6
==================================================
=== LMS0 (ospid: 17259) Heartbeat Report
==================================================
LMS0 (ospid: 17259) has no heartbeats for 101 sec. (threshold 70 sec)
: Not in wait; last wait ended 97 secs ago.
: last wait_id 570620887 at 'gcs remote message'.
==============================
Dumping PROCESS LMS0 (ospid: 17259) States <<<<<<<=============
==============================
===[ System Load State ]===
CPU Total 32 Core 32 Socket 4
Load normal: Cur 9144 Highmark 40960 (35.71 160.00)
===[ Latch State ]===
Not in Latch Get
===[ Session State Object ]===
----------------------------------------
SO: 0xb98ef85d8, type: 4, owner: 0xb88bf79d0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0xb88bf79d0, name=session, file=ksu.h LINE:12624, pg=0
(session) sid: 1223 ser: 1 trans: (nil), creator: 0xb88bf79d0
flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x409) -/-/INC
DID: , short-term DID:
txn branch: (nil)
oct: 0, prv: 0, sql: (nil), psql: (nil), user: 0/SYS
ksuxds FALSE at location: 0
service name: SYS$BACKGROUND
Current Wait Stack:
Not in wait; last wait ended 1 min 37 sec ago <<<<<<<============= not in wait.
Wait State:
fixed_waits=0 flags=0x21 boundary=(nil)/-1
Session Wait History:
elapsed time of 1 min 37 sec since last wait
0: waited for 'gcs remote message'<<<<<<<=============
waittime=0x1, poll=0x0, event=0x0
wait_id=570620887 seq_num=7644 snap_id=1
wait times: snap=0.001200 sec, exc=0.001200 sec, total=0.001200 sec
wait times: max=0.010000 sec
wait counts: calls=1 os=1
occurred after 0.000218 sec of elapsed time
1: waited for 'gcs remote message'
waittime=0x1, poll=0x0, event=0x0
wait_id=570620886 seq_num=7643 snap_id=1
wait times: snap=0.006362 sec, exc=0.006362 sec, total=0.006362 sec
wait times: max=0.010000 sec
wait counts: calls=1 os=1
occurred after 0.000022 sec of elapsed time
通过trace文件可以看到,lms0 的状态是 'D'
D是什么意思?
Linux进程状态:D (TASK_UNINTERRUPTIBLE),不可中断的睡眠状态。
引起D状态的根本原因是由于IO等待,若你对某个磁盘的IO操作特别频繁,就会造成后续的IO操作处于等待状态,即处于D状态。
也就是说LMS0进程在不可中断的睡眠状态,同时CPU很高,但是不是ORACLE用户占用的。
到这里就可以怀疑,问题可能出现在OS层面,去查看OSwatcher监控信息
zzz ***Mon Dec 4 01:21:02 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 0 0 95113492 1518640 31129980 0 0 3 5 0 0 1 0 99 0 0
0 0 0 95107952 1518640 31130044 0 0 80 1 6505 8514 2 1 97 0 0
0 0 0 95107652 1518640 31130020 0 0 80 34 4694 7057 1 0 99 0 0
zzz ***Mon Dec 4 01:21:32 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 95120764 1518644 31130072 0 0 3 5 0 0 1 0 99 0 0
1 0 0 95116052 1518644 31130136 0 0 80 1 6378 8855 2 1 97 0 0
0 0 0 95115644 1518644 31130132 0 0 16 34 5174 7705 1 0 99 0 0
zzz ***Mon Dec 4 01:24:53 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 6 0 96085040 1518752 31135792 0 0 3 5 0 0 1 0 99 0 0
2 0 0 96155748 1518772 31120520 0 0 0 9642 11662 9591 9 5 84 3 0
1 0 0 97170696 1518780 30175204 0 0 0 2749 5527 6026 3 2 95 0 0
zzz ***Mon Dec 4 01:25:23 CST 2017
procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 117056400 1518796 10318440 0 0 3 5 0 0 1 0 99 0 0
0 0 0 117052664 1518796 10318456 0 0 0 230 3792 4556 0 1 99 0 0
1 0 0 116783816 1518796 10587120 0 0 0 841 3856 4530 1 2 97 0 0
vmstat 采样间隔为30秒,但是从01:21:32 之后一直到01:24:53之间的记录都没有 。说明OS在此期间性能很差。
性能差到什么程度,简单的top,iostat,ps都无法生成,直接导致lms进程通信障碍。
问题定位到OS层面,很可惜我们抛开了oracle的嫌疑,但是OS工程师并没有找到问题的根因。
RAC 单节点实例异常关闭,关键报错ORA--29770的更多相关文章
- dbstart和dbshut启动、关闭数据库报错ORACLE_HOME_LISTNER is not SET解决办法
dbstart启动数据库报错,如下: [oracle@wen ~]$ dbstartORACLE_HOME_LISTNER is not SET, unable to auto-start Oracl ...
- RAC+单节点搭建DG
primary RAC to single standby 参考文献:RAC+单实例DATAGUARD 配置 http://blog.csdn.net/miyatang/article/detai ...
- MongoDB在已有账号的实例下还原数据库报错的分析(error applying oplog)
一. 背景 今天在MongoDB 4.0.4版本下,在还原恢复数据库时报错. 主要错误为: Failed: restore error: error applying oplog: applyOps: ...
- Oracle RAC集群搭建(zero)--全是报错
1. 提示Check if the DISPLAYvariable is set. Failed<<<< 解决方案: #xhost + //切换到root用户输入 #s ...
- Websocket 关闭浏览器报错
这个报错,是因为你关闭之后,websocket 自动连接失败造成的 只要在你的websocket 运行的类里面加上: @OnError public void onError(Throwable e, ...
- Rac grid用户启停监听报错无权限
[grid@max1 ~]$ lsnrctl stop LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 04-NOV-2016 00:20: ...
- mongodb之 非正常关闭启动报错处理
Mongodb如果非正常关闭,直接启动会报错.查看日志文件. 处理: 需要做的是删除mongod.lock和WiredTiger.lock这两个lock文件,然后执行--repair,这里的mongo ...
- centos 关闭selinux 临时关闭selinux 报错 setenforce: setenforce() failed
关闭selinux的方法有两种:临时关闭和永久关闭. 查看selinux的状态:estatus [root@--- ~]# sestatus SELinux status: enabled SELin ...
- nginx关闭php报错页面显示
默认情况下nginx是会显示php的报错的,如果要关闭报错显示,需要在/usr/local/php7/etc/php-fpm.d/www.conf文件里面设置,貌似默认情况下在php.ini关闭没效果 ...
随机推荐
- Babel6.x的安装过程
1.首先安装babel-cli(用于在终端使用babel) npm install -g babel-cli 2.然后安装babel-preset-es2015插件 npm install --sav ...
- 使用vue-cli搭建element-ui项目
最近在使用element-ui搭建项目时发现若只纯用webpack来运行element-ui,要配置各种文件,对于新手来说实在太不友好了, 就想到用vue-cli来搭建整个vue项目 1.安装node ...
- L1-2. 点赞【求多组数据中出现次数最多的】
L1-2. 点赞 时间限制 200 ms 内存限制 65536 kB 代码长度限制 8000 B 判题程序 Standard 作者 陈越 微博上有个“点赞”功能,你可以为你喜欢的博文点个赞表示支持.每 ...
- SYN攻击SYN Attack
SYN攻击SYN Attack SYN Attack是一种DOS攻击方式.它利用的是TCP协议的漏洞,攻击目标,使其不在响应网络请求.在TCP协议中,需要三次握手,才能建立TCP连接.在握手过程中 ...
- 正确使用Block避免Cycle Retain和Crash
Block简介 Block作为C语言的扩展,并不是高新技术,和其他语言的闭包或lambda表达式是一回事.需要注意的是由于Objective-C在iOS中不支持GC机制,使用Block必须自己管理内存 ...
- 邁向IT專家成功之路的三十則鐵律 鐵律二十二:IT人升遷之道-無為
升遷管道是許多人求職時相當重要的考量之一,畢竟人除了很愛錢之外更愛顯赫的頭銜,然而在企業中越顯赫的頭銜,其背後通常有更多的罵名,因為許多人的高官厚爵都是踩著一群人的頭頂爬上去的,隨時哪一天跌了下來,都 ...
- sublime的tab和spaces空格切换的坑
python是严格要求对齐或者叫缩进的: 使用sublime对python进行编程时,可以使用tab或者空格,但是不能混用.特别是从外面把代码拷贝进sublime的时候,更要注意是否一致. 简单介绍一 ...
- Linux文件内容查阅
直接查阅一个文件的内容:cat/tac/nl命令 cat (concatenate) # cat [-AbEnTv] 选项与參数: -A :相当於 -vET 的整合选项.可列出一些特殊字符而不是空白 ...
- 微信授权网页登陆,oauth
1.在微信公众号请求用户网页授权之前.开发人员须要先到公众平台官网中的开发人员中心页配置授权回调域名.请注意,这里填写的是域名(是一个字符串),而不是URL,因此请勿加http://等协议头. 2.授 ...
- Odoo10对套件的处理
Odoo10对套件的处理更强, 除了老版本支持的 销售套件, 按组件出货: 现在还增加了 采购套件, 按组件进货 建立 组件产品 KIT 设置 虚件BOM 测试, ...