11.2.0.3 RAC(VCS)节点crash以及hang的问题分析
昨天某个客户的一套双节RAC当中一个节点crash,同一时候最后导致另外一个节点也hang住,仅仅能shutdown abort.
且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。
当中有lgwr。arch等进程.
首先我们来看下,在下午出现crash的节点的alert log信息:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
Tue Apr 22 17:16:04 2014Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4878_9l51y1cc_.arcDeleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4879_9l529hc6_.arcArchived Log entry 10847 addedfor thread 1 sequence 5314 ID 0xffffffffae21a60f dest 1:Tue Apr 22 17:25:05 2014IPC Send timeout detected. Sender: ospid 27573 [oracle@xhdb-server3 (LMON)]Receiver: inst 2 binc 95439 ospid 13752Communications reconfiguration: instance_number 2Tue Apr 22 17:26:49 2014LMON (ospid: 27573) has not called a waitfor 89 secs.Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_lmhb_27613.trc (incident=14129):ORA-29770: global enqueue process LMON (OSID 27573) is hungfor more than 70 secondsIncident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/incident/incdir_14129/authorcl1_lmhb_27613_i14129.trcTue Apr 22 17:26:58 2014Sweep [inc][14129]: completedSweep [inc2][14129]: completedERROR: Some process(s) is not making progress.LMHB (ospid: 27613) is terminating the instance.Please check LMHB trace filefor more details.Please also check the CPU load, I/O loadand other system propertiesfor anomalous behaviorERROR: Some process(s) is not making progress.Tue Apr 22 17:26:58 2014System state dump requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination].LMHB (ospid: 27613): terminating the instance due to error 29770System State dumped to trace file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trcTue Apr 22 17:27:00 2014ORA-1092 : opitsk aborting processTue Apr 22 17:27:01 2014License high water mark = 144Tue Apr 22 17:27:08 2014Termination issued to instance processes. Waitingfor the processes toexitInstance termination failed to kill oneor more processesInstance terminated by LMHB, pid = 27613Tue Apr 22 17:27:15 2014USER (ospid: 1378): terminating the instanceTermination issued to instance processes. Waitingfor the processes toexitTue Apr 22 17:27:25 2014Instance termination failed to kill oneor more processesInstance terminated by USER, pid = 1378Tue Apr 22 21:51:56 2014Adjusting the default value of parameter parallel_max_serversfrom 640 to 135 due to the value of parameter processes (150)Starting ORACLE instance (normal) |
我们能够看到。最早在Apr 22 17:25:05 2014 时间点,即抛出LMON IPC send timeout的错误了。
Receiver: inst 2 binc 95439 ospid 13752 这里的receiver进程为节点2的13752进程,即节点2的LMON进程。
对于LMON进程。主要是监控RAC的GES信息。当然其作用不只局限于此。其还负责检查集群中各个Node的健康
情况,当有节点出现问题是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。
我们知道
RAC的脑裂机制。假设IO fencing是Oracle本身来完毕,也就是说由CLusterware来完毕。
那么Lmon进程检查到
实例级别出现脑裂时,会通知clusterware来进行脑裂操作,然而其并不会等待clusterware的处理结果。当等待
超过一定时间,那么LMON进程会自己主动触发IMR(instance membership recovery)。这实际上也就是我们所说的
Instance membership reconfig。
其次,lmon进程主要通过2种心跳机制来检查推断集群节点的健康状态:
1) 网络心跳 (主要是通过ping进行检測)
2) 控制文件磁盘心跳,事实上就是每一个节点的ckpt进程每3s更新一次controlfile的机制。
所以这里大家能够看出,Lmon进程是须要操作controlfile的。
否则无法推断第2点。
尽管从上面的错误来看,该实例是被LMHB进程给终止掉的,这里我们须要说明一下,LMBH进程的原理。
LMBH进程是Oracle 11R2版本号引入的一个进程。该进程的作用的监控LMD,LMON,LCK,LMS等核心进程。防止这些Oracle
关键性后台进程spin或不被堵塞。
该进程会定时的将监控的打印信息输出在trace文件里,便于我们进行诊断。
这也是11gR2一个亮点。当LMBH进程发现其它核心进程出现异常时,会尝试发起一些kill动作。怎样有进程被堵塞的话。
假设一定时间内仍然无法解决,那么将触发保护,将实例强行终止掉。当然,这是为了保证RAC节点数据的完整性和一致性。
这里比較郁闷的是。这个diag的dump并没有产没 /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc
我们首先来看下Node1的Lmon进程的信息:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
*** ACTION NAME:() 2014-04-22 17:26:56.052*** 2014-04-22 17:26:49.401==============================LMON (ospid: 27573) has not movedfor 105 sec (1398158808.1398158703)kjfmGCR_HBCheckAll: LMON (ospid: 27573) has status 6===================================================== LMON (ospid: 27573) Heartbeat Report==================================================LMON (ospid: 27573) has no heartbeatsfor 105 sec. (threshold 70 sec): Not in wait; last wait ended 89 secs ago. -------------等待了89秒: last wait_id 165313538 at'enq: CF - contention'.==============================Dumping PROCESS LMON (ospid: 27573) States=================================[ System Load State ]===CPU Total 16 Core 16 Socket 16Load normal: Cur 988 Highmark 20480 (3.85 80.00)===[ Latch State ]===Not in Latch Get===[ Session State Object ]===----------------------------------------SO: 0x52daba340, type: 4, owner: 0x52f5d8330, flag: INIT/-/-/0x00if: 0x3 c: 0x3proc=0x52f5d8330, name=session, file=ksu.h LINE:12624 ID:, pg=0(session) sid: 1057 ser: 1 trans: 0x0, creator: 0x52f5d8330flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-flags2: (0x409) -/-/INCDID: , short-term DID:txn branch: 0x0oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYSksuxds FALSE at location: 0service name: SYS$BACKGROUNDCurrent Wait Stack:Not in wait; last wait ended 1 min 29 sec agoThere are 1 sessions blocked by this session.Dumping one waiter:inst: 1, sid: 297, ser: 6347wait event:'name-service call wait'p1:'waittime'=0x32p2:''=0x0p3:''=0x0row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0min_blocked_time: 0 secs, waiter_cache_ver: 30272Wait State:fixed_waits=0 flags=0x20 boundary=0x0/-1Session Wait History:elapsed time of 1 min 29 sec since last wait ---LMON进程等待enq: CF - contention。等待了1分29秒,即89秒0: waitedfor 'enq: CF - contention'name|mode=0x43460006, 0=0x0, operation=0x3wait_id=165313538 seq_num=35946 snap_id=1wait times: snap=1.027254 sec, exc=1.027254 sec, total=1.027254 secwait times: max=1.000000 secwait counts: calls=1 os=1occurred after 0.000109 sec of elapsed time。。。。。。 |
例如以下是该进程的资源使用情况:
|
1
2
3
4
5
6
|
*** 2014-04-22 17:26:57.229loadavg : 3.94 3.80 3.99swap info: free_mem = 36949.53M rsv = 24548.22Malloc = 23576.62M avail = 45643.61M swap_free = 46615.21MF S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD0 O oa 27573 1 6 79 20 ?
|
我们能够看到,系统在该时间点load并不高。Memory也非常充足。
这里有一个问题。该节点LMON进程hung的原因是什么? 从日志分析来看,是因为无法获得enq: CF – contention。
我们知道ckpt 进程会定时更新操作controlfile,且就须要获得该enqueue。 全部这里我大胆的如果,是因为ckpt持有CF的latch
不释放,导致LMON进程无法获得. 依据这一点我搜mos 发现一个bug。可惜该bug说已经在11.2.0.3中fixed了。
Bug 10276173 LMON hang possible while trying to get access to CKPT progress record
该bug描写叙述说,当在进行reconfig时。lmon会尝试去获得ckpt processes record。会等待enq: CF -contention。会导致hung.
依据文档来看,显然这跟我们的实际情况不符。
以下我们来结合Node2的日志进行综合分析:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
Tue Apr 22 17:25:06 2014IPC Send timeout detected. Receiver ospid 13752 [oracle@xhdb-server4 (LMON)]Tue Apr 22 17:26:59 2014Dumping diagnostic data in directory=[cdmp_20140422172658], requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination].Tue Apr 22 17:29:22 2014WARNING: aiowait timed out 1 timesTue Apr 22 17:29:53 2014Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3681):ORA-32701: Possible hangs up to hang ID=3 detectedIncident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3681/authorcl2_dia0_13750_i3681.trcTue Apr 22 17:30:24 2014DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3requested by master DIA0 process on instance 2Hang Resolution Reason: Automatic hang resolution was performed to free asignificant number of affected sessions.by terminating session sid: 578 ospid: 16818DIA0 successfully terminated session sid:578 ospid:16818 with status 31.DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3.Tue Apr 22 17:30:33 2014LMS2 (ospid: 13764) has detected no messaging activity from instance 1LMS2 (ospid: 13764) issues an IMR to resolve the situationPlease check LMS2 trace filefor more detail.Tue Apr 22 17:31:48 2014LMD0 (ospid: 13754) has detected no messaging activity from instance 1LMD0 (ospid: 13754) issues an IMR to resolve the situationPlease check LMD0 trace filefor more detail.Tue Apr 22 17:32:03 2014Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3682):ORA-32701: Possible hangs up to hang ID=3 detectedIncident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3682/authorcl2_dia0_13750_i3682.trcTue Apr 22 17:32:16 2014IPC Send timeout detected. Sender: ospid 23666 [oracle@xhdb-server4 (TNS V1-V3)]Receiver: inst 1 binc 380222 ospid 27575IPC Send timeout to 1.0 inc 20for msg type 12 from opid 154IPC Send timeout: Terminating pid 154 osid 23666Tue Apr 22 17:32:20 2014IPC Send timeout detected. Sender: ospid 13746 [oracle@xhdb-server4 (PING)]Receiver: inst 1 binc 380164 ospid 27565Tue Apr 22 17:32:34 2014DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3requested by master DIA0 process on instance 2Hang Resolution Reason: Automatic hang resolution was performed to free asignificant number of affected sessions.by terminating the processDIA0 successfully terminated process ospid:16818.DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3.Tue Apr 22 17:32:35 2014LMS1 (ospid: 13760) has detected no messaging activity from instance 1LMS1 (ospid: 13760) issues an IMR to resolve the situationPlease check LMS1 trace filefor more detail.Tue Apr 22 17:32:44 2014IPC Send timeout detected. Sender: ospid 13754 [oracle@xhdb-server4 (LMD0)]Receiver: inst 1 binc 380222 ospid 27575IPC Send timeout to 1.0 inc 20for msg type 65521 from opid 12Tue Apr 22 17:33:11 2014LMS0 (ospid: 13756) has detected no messaging activity from instance 1LMS0 (ospid: 13756) issues an IMR to resolve the situationPlease check LMS0 trace filefor more detail.Tue Apr 22 17:34:29 2014IPC Send timeout detected. Sender: ospid 13764 [oracle@xhdb-server4 (LMS2)]Receiver: inst 1 binc 380309 ospid 27585IPC Send timeout to 1.1 inc 20for msg type 65522 from opid 15Tue Apr 22 17:36:31 2014 |
我们能够看到Node2 在Apr 22 17:26:59 2014 节点Node1的LMBH终止instance的信息了。然后在后面抛出hung的信息,
只是Oracle自己主动攻克了hung的session。 以下我们来看下Node2上lmon进程的trace内容:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
*** 2014-04-22 17:26:59.377Process diagnostic dumpfor oracle@xhdb-server4 (LMON), OS id=13752,pid: 11, proc_ser: 1, sid: 353, sess_ser: 1-------------------------------------------------------------------------------current sql: <none>Current Wait Stack:0: waitingfor 'control file sequential read'file#=0x0, block#=0x23, blocks=0x1wait_id=272969233 seq_num=24337 snap_id=1wait times: snap=7 min 42 sec, exc=7 min 42 sec, total=7 min 42 sec ---已经等待了7分42秒wait times: max=infinite, heur=7 min 42 secwait counts: calls=0 os=0in_wait=1 iflags=0x5a0There are 1 sessions blocked by this session.Dumping one waiter:inst: 2, sid: 1092, ser: 49369wait event:'name-service call wait'p1:'waittime'=0x32p2:''=0x0p3:''=0x0row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0min_blocked_time: 0 secs, waiter_cache_ver: 6248Wait State:fixed_waits=0 flags=0x22 boundary=0x0/-1Session Wait History: |
从lmon的trace信息我们能够看出。该进程正在等待control file sequential read,且已经等待了7分42秒。
依据trace的时间点,我们能够向前推进7分42秒,换句话讲,从17:19:18秒就開始等待了。
既然是controlfile的等待,那么我们就有必要来看下Node2节点上的ckpt进程在干什么了? 例如以下是ckpt进程的信息:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
Redo thread mounted by this instance: 2Oracle process number: 26Unix process pid: 13788, image: oracle@xhdb-server4 (CKPT)*** 2014-04-22 17:26:59.882*** SESSION ID:(833.1) 2014-04-22 17:26:59.882*** 2014-04-22 17:26:59.882Process diagnostic dumpfor oracle@xhdb-server4 (CKPT), OS id=13788,pid: 26, proc_ser: 1, sid: 833, sess_ser: 1-------------------------------------------------------------------------------current sql: <none>Current Wait Stack:0: waitingfor 'control file sequential read'file#=0x0, block#=0x1, blocks=0x1wait_id=14858985 seq_num=48092 snap_id=1wait times: snap=7 min 40 sec, exc=7 min 40 sec, total=7 min 40 sec ----等待了7分40秒wait times: max=infinite, heur=7 min 40 secwait counts: calls=0 os=0in_wait=1 iflags=0x5a0There are 2 sessions blocked by this session.Dumping one waiter:inst: 2, sid: 291, ser: 59157wait event:'DFS lock handle'p1:'type|mode'=0x43490005p2:'id1'=0xap3:'id2'=0x2row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0min_blocked_time: 352 secs, waiter_cache_ver: 6248Wait State:fixed_waits=0 flags=0x22 boundary=0x0/-1 |
我们能够看到,Node2的ckpt进程等待control file sequential read,等待了7分40秒。
同一时候大家还能够看到,ckpt
进程堵塞了2个进程,也就是说ckpt进程有2个waiter,当中一个waiter的信息是:sid:291,ser:59157
且该waiter进程的等待事件竟然是DFS lock handle,这是一个比較危急的event。 这里我们还无法确认这个waiter是什么?
同一时候ckpt进程为啥等待这么长的时间 ?
大家知道11g引入的hung auto resolution,那么我们就来看下Node1上的diag的信息:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
|
Unix process pid: 27571, image: oracle@xhdb-server3 (DIA0)*** 2014-04-22 17:22:01.536*** SESSION ID:(961.1) 2014-04-22 17:22:01.536*** CLIENT ID:() 2014-04-22 17:22:01.536*** SERVICE NAME:(SYS$BACKGROUND) 2014-04-22 17:22:01.536*** MODULE NAME:() 2014-04-22 17:22:01.536*** ACTION NAME:() 2014-04-22 17:22:01.536One ormore possible hangs have been detected on your system.These could be genuine hangs in which no further progress willbe made without intervention,or it may be very slow progressin the system due to high load.Previously detectedand output hangs are not displayed again.Instead, the 'Victim Information' section will indicate thatthe victim is from an'Existing Hang' under the'Previous Hang'column.'Verified Hangs'below indicate one ormore hangs that were foundand final and they occurred. Since the current hang resolution state is'PROCESS',any hangs requiring sessionor process termination will beautomatically resolved.Any hangs with a 'Hang Resolution Action' 'Unresolvable'will be ignored. These types of hangs will either be resolvedby another layer in the RDBMSor cannot be resolvedas they mayrequire Deadlocks (also named cycles) are currently NOT resolved evenifhang resolution is enabled. The'Hang Type' of DLCK in the'Verified Hangs'output identifies these hangs.Below that are the complete hang chains from the time the hangwas detected.The following information will assist Oracle Support Servicesin further analysis of the root cause of the hang.*** 2014-04-22 17:22:01.537Verified Hangs in the SystemRoot Chain Total HangHang Hang Inst Root #hung #hung Hang Hang ResolutionID Type Status Num Sess Sess Sess Conf Span Action----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------2 HANG VICSELTD 2 833 2 2 HIGH LOCAL IGNRD:InstKillNotAHang Ignored Reason: Since instance termination is not allowed, automatichang resolution cannot be performed.inst# SessId Ser# OSPID PrcNm Event----- ------ ----- --------- ----- -----2 291 59157 10646 M000 DFS lock handle ----大家注意这里的sid和ser#以及PrcNm2 833 1 13788 CKPT control file sequential read |
这里提到M000进程,大家应该知道这是跟AWR快照有关系的进程,该进程事实上是被CKPT所堵塞。我们也能够来看下该进程
的情况,例如以下:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
*** 2014-04-22 17:27:00.778Process diagnostic dumpfor oracle@xhdb-server4 (M000), OS id=10646,pid: 57, proc_ser: 143, sid: 291, sess_ser: 59157-------------------------------------------------------------------------------current sql:select tablespace_id, rfno, allocated_space, file_size, file_maxsize, changescn_base, changescn_wrap, flag,inst_id from sys.ts$, GV$FILESPACE_USAGEwhere ts# = tablespace_id andonline$ != 3 and or and and and or and Current Wait Stack:0: waitingfor 'DFS lock handle'type|mode=0x43490005, id1=0xa, id2=0x2wait_id=6 seq_num=7 snap_id=1wait times: snap=6 min 12 sec, exc=6 min 12 sec, total=6 min 12 secwait times: max=infinite, heur=6 min 12 secwait counts: calls=818 os=818in_wait=1 iflags=0x15a2There is at least one session blocking this session.Dumping 2 direct blocker(s):inst: 2, sid: 833, ser: 1inst: 1, sid: 482, ser: 1Dumpingfinal blocker:inst: 2, sid: 833, ser: 1 -----终于的blocker是833,也就是Node2节点的CKPT进程。
There are 1 sessions blocked by this session.Dumping one waiter:inst: 1, sid: 581, ser: 36139wait event:'DFS lock handle'p1:'type|mode'=0x43490005p2:'id1'=0xap3:'id2'=0x5 |
从这里看,root sess却是833,也就是我们Node2节点的CKPT进程。 到这里也许有人会说,问题的解决办法
应该非常明白了。因为Node2 ckpt的异常,到这Node2 节点Lmon进程异常,因为须要和Node1的Lmon进程
进行通信,导致Node1 的lmon进程出现IPc send timeout的情况。
事实上不然,到最后至始至终我们都没有全然弄清楚为何CKPT进程会等待这么长时间 ?
到这里我不得不怀疑IO的问题了,再回过头来分析Node1的diag trace时,发现一个有趣的事情:
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
|
*** 2014-04-22 17:24:08.363Verified Hangs in the SystemRoot Chain Total HangHang Hang Inst Root #hung #hung Hang Hang ResolutionID Type Status Num Sess Sess Sess Conf Span Action----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------7 HANG VICSELTD 2 801 2 3 HIGH LOCAL IGNRD:InstKillNotAHang Ignored Reason: Since instance termination is not allowed, automatichang resolution cannot be performed.inst# SessId Ser# OSPID PrcNm Event----- ------ ----- --------- ----- -----2 996 39917 6598 FG log file sync2 801 1 13786 LGWR log file parallel write。。
-------------------------------------------------------------------------------Chain 2:-------------------------------------------------------------------------------Oracle session identified by:{instance: 1 (authorcl.authorcl1)os id: 22323process id: 64, oracle@xhdb-server3session id: 14session serial #: 29257}is waitingfor 'log file sync'with wait info:{p1:'buffer#'=0x2abp2:'sync scn'=0x1ed4d8betime in wait: 4 min 18 sectimeout after: neverwait id: 13287blocking: 0 sessionswait history:* time between current waitand wait #1: 0.000388 sec1. event:'SQL*Net message from client'time waited: 0.000486 secwait id: 13286 p1:'driver id'=0x54435000p2:'#bytes'=0x1* time between wait #1and #2: 0.000027 sec2. event:'SQL*Net message to client'time waited: 0.000003 secwait id: 13285 p1:'driver id'=0x54435000p2:'#bytes'=0x1* time between wait #2and #3: 0.001494 sec3. event:'SQL*Net message from client'time waited: 0.002089 secwait id: 13284 p1:'driver id'=0x54435000p2:'#bytes'=0x1}andis blocked by=> Oracle session identified by:{instance: 1 (authorcl.authorcl1)os id: 27634process id: 20, oracle@xhdb-server3 (LGWR)session id: 386session serial #: 1}which is waitingfor 'log file parallel write'with wait info:{p1:'files'=0x1p2:'blocks'=0x2p3:'requests'=0x1time in wait: 4 min 32 sec -----等待了4分32秒timeout after: neverwait id: 51742574blocking: 17 sessions ------堵塞了17个Sessionwait history:* time between current waitand wait #1: 0.000176 sec1. event:'rdbms ipc message'time waited: 0.047194 secwait id: 51742573 p1:'timeout'=0x22* time between wait #1and #2: 0.000399 sec2. event:'log file parallel write'time waited: 0.004006 secwait id: 51742572 p1:'files'=0x1p2:'blocks'=0x2p3:'requests'=0x1* time between wait #2and #3: 0.000318 sec3. event:'rdbms ipc message'time waited: 2.654606 secwait id: 51742571 p1:'timeout'=0x12c}Chain 2 Signature:'log file parallel write'<='log file sync' |
我们能够看到,Node1上,lgwr进程堵塞了17个会话,等待log file parallel write。一直持续了4分32秒。
假设将时间点2014-04-22 17:24:08,向前推进4分32秒,那么就是2014-04-22 17:19:36.
我们再来检查Node2的操作系统日志你会发现有意思的事情:
|
1
2
3
4
5
6
7
8
9
10
11
|
Apr 22 17:19:16 xhdb-server4 vxio: [ID 442312 kern.notice] NOTICE: VxVM VVR vxio V-5-0-209 Log overflow protection on rlink oa_zs_bj triggered DCM protectionApr 22 17:19:16 xhdb-server4 vxio: [ID 636438 kern.warning] WARNING: VxVM VVR vxio V-5-0-1905 Replication stoppedfor RVG rvg_oa due to SRL overflow, DCM protection is triggered. To start replication, perform DCM resynchronization using"vradmin resync" command.Apr 22 19:41:19 xhdb-server4 su: [ID 810491 auth.crit]'su root' failed for oa on /dev/pts/5Apr 22 21:31:38 xhdb-server4 su: [ID 810491 auth.crit]'su grid' failed for root on /dev/pts/14Apr 22 21:36:03 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 Thread(4) Agent is calling cleanfor resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own.Apr 22 21:36:03 xhdb-server4 Had[5704]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling cleanfor resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own.Apr 22 21:36:06 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(4) Resource(cssd_oaora) - clean completed successfully.Apr 22 21:40:51 xhdb-server4 reboot: [ID 662345 auth.crit] rebooted by rootApr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal receivedwhile waiting for event on reaper 'CFSMount'. Returning.Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal receivedwhile waiting for event on reaper 'CFSfsckd'. Returning.Apr 23 00:31:59 xhdb-server4 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-25 64-bit |
我们能够看到,在17:19:16秒时vertias的VVR出现了异常。这也就是为什么我们在前面分析看到Node2在17:19:18时出现control file sequential read
等待的原因。尽管从vcs的日志什么信息:
|
1
2
3
4
5
6
|
2014/04/22 14:39:12 VCS INFO V-16-1-53504 VCS Engine Alive message!!2014/04/22 18:39:13 VCS INFO V-16-1-53504 VCS Engine Alive message!!2014/04/22 21:36:03 VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling cleanfor resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own.2014/04/22 21:36:06 VCS INFO V-16-2-13068 (xhdb-server4) Resource(cssd_oaora) - clean completed successfully.2014/04/22 21:36:09 VCS INFO V-16-1-10307 Resource cssd_oaora (Owner: Unspecified, Group: sg_oaora) is offline on xhdb-server4 (Not initiated by VCS)2014/04/22 21:36:09 VCS NOTICE V-16-1-10446 Group sg_oaora is offline on system xhdb-server4 |
所以。最后我的感觉是根本原因是vcs的问题。
尽管vertias的project师一直解释这里的Log overflow protection没有太大的关系。
针对这个问题,欢迎大家探讨。
补充:关于ora-29770导致实例crash的问题。Oracle确实有不少的bug,可是针对这个情况,眼下没有发现符合的。例如以下是来自Oracle MOS的搜索结果:
Bug 11890804:LMHB crashes instance with ORA-29770 after long “control file sequential read” waits
Bug 8888434: LMHB crashes the instance with LMON waiting on controlfile read
Bug 11890804: LMHB TERMINATE INSTANCE WHEN LMON WAIT CHANGE FROM CF READ AFTER 60 SEC
Bug 13467673: CSS MISSCOUNT AND ALL ASM DOWN WITH ORA-29770 BY LMHB
Bug 13390052: KJFMGCR_HBCHECKALL MESSAGES ARE CONTINUOUSLY LOGGED IN LMHB TRACE FILE.
Bug 13322797: LMHB TERMINATES THE INSTANCE DUE TO ERROR 29770
Bug 13061883: LMHB IS TERMINATING THE INSTANCE DURING SHUTDOWN IMMEDIATE
Bug 12886605: ESSC: LMHB TERMINATE INSTANCE DUE TO 29770 – LMON WAIT ENQ: AM – DISK OFFLINE
Bug 12757321: LMHB TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10296263: LMHB (OSPID: 15872): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10431752: SINGLE NODE RAC: LMHB TERMINATES INSTANCE DUE TO 29770
Bug 11656856: LMHB (OSPID: 27701): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10411143: INSTANCE CRASHES WITH IPC SEND TIMEOUT AND LMHB TERMINATES WITH ORA-29770
Bug 11704041: DATABASE INSTANCE CRASH BY LMHB PROCESS
Bug 10412545: ORA-29770 LMHB TERMINATE INSTANCE DUE TO VARIOUS LONG CSS WAIT
Bug 10147827: INSTANCE TERMINATED BY LMHB WITH ERROR ORA-29770
Bug 10016974: ORA-29770 LMD IS HUNG FOR MORE THAN 70 SECONDS AND LMHB TERMINATE INSTANCE
Bug 9376100: LMHB TERMINATING INSTANCE DUE ERROR 29770
11.2.0.3 RAC(VCS)节点crash以及hang的问题分析的更多相关文章
- Oracle 11.2.0.4 RAC安装最新PSU补丁
环境:两节点RAC(RHEL 6.4 + GI 11.2.0.4 + Oracle 11.2.0.4) 需求:安装最新PSU补丁11.2.0.4.7 1.下载补丁和最新OPatch 2.检查数据库当前 ...
- ORACLE LINUX 6.3 + ORACLE 11.2.0.3 RAC + VBOX安装文档
ORACLE LINUX 6.3 + ORACLE 11.2.0.3 RAC + VBOX安装文档 2015-10-21 12:51 525人阅读 评论(0) 收藏 举报 分类: Oracle RA ...
- HPDL380G8平台11.2.0.3 RAC实施手册
HPDL380G8平台11.2.0.3 RAC实施手册 1 前言 此文档详细描述了Oracle 11gR2 数据库在HPDL380G上的安装RAC的检查及安装步骤.文档中#表示root用户执行,$ ...
- RHEL7或CentOS7安装11.2.0.4 RAC碰到的问题
RHEL7或CentOS7安装11.2.0.4 RAC碰到的问题 随着Linux 版本的普及,但Oracle数据库主流版本仍是11gR2, 的支持不很完美,在Linux 上安装会遇到几处问题,以此记录 ...
- Oracle 11g R2(11.2.0.4) RAC 数据文件路径错误解决--ORA-01157 ORA-01110: 数据文件
Oracle 11g R2(11.2.0.1) RAC 数据文件路径错误解决--ORA-01157 ORA-01110: 数据文件 oracle 11g R2(11.2.0.4) rac--scan ...
- oracle 11.2.0.4 rac 打补丁
本次安装pus环境是11.2.0.4 rac,打的patch为11.2.0.4.180717 (Includes Database PSU),gi补丁和数据库补丁一起打 安装最新opatch版本 un ...
- Oracle 11.2.0.4 RAC重建EM案例
环境:Oracle 11.2.0.4 RAC 重建EM 背景:客户之前的EM已经被损坏,需要重建EM 重建EM的方案有很多,其中最简单的方法是:直接使用emca重建,oracle用户下,只需一条命令搞 ...
- Deploy 11.2.0.3 RAC+DG on Windows 2008 R2 Step by Step
环境规划: 节点1: tc1 192.168.56.101 内存:2G 节点2: tc2 192.168.56.102 内存:2G 物理备库:tcdg192.168.56.108内存:1.5G 操作系 ...
- 11.2.0.4 RAC 手动打补丁
1. 下载补丁和最新OPatchGI PSU : p25869727_112040_Linux-x86-64.zipOPatch : p6880880_112000_Linux-x86-64.zip ...
随机推荐
- codeM编程大赛E题 (暴力+字符串匹配(kmp))
题目大意:S(n,k)用k(2-16)进制表示1-n的数字所组成的字符串,例如S(16,16)=123456789ABCDEF10: 解题思路: n最大50000,k最大100000,以为暴力会超时. ...
- sonarQube代码管理工具
第一步:安装环境:jdk 1.8 idea mysql5.6以上 sonarqube5.6.6 第二歩:下载好sonarqube后,解压打开bin目录,启动相应OS目录下的StartSonar ...
- Windows 下 MySql 5.7.20安装及data和my.ini文件的配置(转)
Windows 下 MySql 5.7.20安装及data和my.ini文件的配置 本文通过图文并茂的形式给大家介绍了MySql 5.7.20安装及data和my.ini文件的配置方法. my ...
- 如何启用内置的Win10 OpenSSH客户端(转)
如何启用内置的Win10 OpenSSH客户端 关注Windows10,锁定Win10之家(http://www.ghost580.com/windows10/) 如何启用内置的Win10 OpenS ...
- PAT 1066. Root of AVL Tree (25)
An AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child sub ...
- 【二分答案+2-SAT】Now or later UVALive - 3211
题目链接:https://cn.vjudge.net/contest/209473#problem/J 题目大意: 有n架飞机,每架飞机有两个可降落时间点a,b(a<b)(即一架飞机可以选择在时 ...
- Activity-Flag标志位
Activity-Flag标志位 学习自 <Android开发艺术探索> 标志位漫谈 var intent: Intent = Intent(this, Test2Activity::cl ...
- C#简单的通用分页
通用分页技术分析 需要返回不同的类型的数据--采用泛型实现该操作 需要提供不同的方法 上一页 上一页 第一页 最后一页 跳转到指定页 Demo 代码 using System; using Syste ...
- java 实现生产者-消费者模式
生产和消费者模式有很多种,现在介绍几种常见的方式 wait/notify实现生产和消费者模式 1.使用wait/notify实现生产和消费者模式: public class Depot { // 实际 ...
- 如何利用Reveal神器查看各大APP UI搭建层级
作者 乔同X2016.08.22 19:45 写了3195字,被42人关注,获得了73个喜欢 如何利用Reveal神器查看各大APP UI搭建层级 字数413 阅读110 评论0 喜欢5 title: ...