上篇,通过分析listener日志发现rac1数据库无法连接时出现了listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537日志。原因是rac2在此前突发断电28-JUN-2016 07:55:30 * service_updateTue Jun 28 16:47:37 2016。同时,问题是在节点2断开的情况下节点1应该自动接管监听,对于前端应用是无感知的。

另外,分析日志还发现在运行的半年过程中还有几次serice_died错误。本文分析记录是什么造成了这个service_died错误。

查看CRS日志

直接通过find /u1 -name alert获得CRS日志的位置'/u1/app/grid/diag/crs/test-rac1/crs/alert'。

使用以下方式获得问题当天的日志。

[root@test-rac1 alert]# grep -n '2016-06-28' log.xml | head -1
121600:<msg time='2016-06-28T07:55:46.994+08:00' org_id='oracle' comp_id='crs'
[root@test-rac1 alert]# grep -n '2016-06-27' log.xml | head -1
[root@test-rac1 alert]# grep -n '2016-06-29' log.xml | head -1
123264:<msg time='2016-06-29T04:33:38.602+08:00' org_id='oracle' comp_id='crs'
[root@test-rac1 alert]# sed -n '121600,123264p' log.xml > log_20160628.xml

得到错误信息

<msg time='2016-06-28T07:55:46.994+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:46.991 [CRSD(2962)]CRS-2771: Maximum restart attempts reached for resource &apos;ora.scan1.vip&apos;; will not restart.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.450+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.449 [CRSD(2962)]CRS-2771: Maximum restart attempts reached for resource &apos;ora.test-rac1.vip&apos;; will not restart.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.974+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.974 [ORAROOTAGENT(3053)]CRS-5017: The resource action &quot;ora.net1.network start&quot; encountered the following error:
2016-06-28 07:55:47.974+CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to &quot;(:CLSN00107:)&quot; in &quot;/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc&quot;.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.992+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.990 [CRSD(2962)]CRS-2878: Failed to restart resource &apos;ora.net1.network&apos;
</txt>
</msg>
<msg time='2016-06-28T07:55:48.005+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:48.005 [CRSD(2962)]CRS-2769: Unable to failover resource &apos;ora.net1.network&apos;.
</txt>
</msg>
<msg time='2016-06-28T07:55:49.119+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.115 [ORAAGENT(3044)]CRS-5016: Process &quot;/u1/app/12.1.0/grid/bin/lsnrctl&quot; spawned by agent &quot;ORAAGENT&quot; for action &quot;check&quot; failed: details at &quot;(:CLSN00010:)&quot; in &quot;/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_oraagent_grid.trc&quot;
</txt>
</msg>
<msg time='2016-06-28T07:55:49.132+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.131 [ORAROOTAGENT(3053)]CRS-5017: The resource action &quot;ora.net1.network start&quot; encountered the following error:
2016-06-28 07:55:49.131+CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to &quot;(:CLSN00107:)&quot; in &quot;/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc&quot;.
</txt>
</msg>
<msg time='2016-06-28T07:55:49.144+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.144 [CRSD(2962)]CRS-2878: Failed to restart resource &apos;ora.LISTENER_SCAN1.lsnr&apos;
</txt>
</msg>

查看orarootagent的trace日志

位置:/u1/app/grid/diag/crs/test-rac1/crs/trace

按照监听的错误时间,以这个时间查看trace中的错误信息。

listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 07:56:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 07:57:09 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:13:54 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:15:05 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:28:04 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:36:42 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:41:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:42:08 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:56:58 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:57:10 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:02:46 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:04:53 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:05:06 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 14:18:03 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:24:42 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:26:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:30:21 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:30:26 * service_died * LsnrAgt * 12537
2016-06-28 07:55:46.806663 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] Failed to check 192.19.88.76 on eth0
2016-06-28 07:55:46.806700 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:55:46.806741 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] VipAgent::checkIp returned false
2016-06-28 07:55:46.811719 : AGFW:463456000: {2:11414:10026} ora.scan1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:46.823646 : AGFW:463456000: {0:5:245} Generating new Tint for unplanned state change. Original Tint: {2:11414:10026}
2016-06-28 07:55:46.823776 : AGFW:463456000: {0:5:245} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749652
2016-06-28 07:55:46.933964 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:2} [check] Network Res Check Action returned 1 return ONLINE
2016-06-28 07:55:47.182374 : AGFW:463456000: {1:3322:2} Agent received the message: AGENT_HB[Engine] ID 12293:223268
2016-06-28 07:55:47.436435 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] Failed to check 192.19.88.82 on eth0
2016-06-28 07:55:47.436484 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:55:47.436518 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] VipAgent::checkIp returned false
2016-06-28 07:55:47.440127 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start { acquire thndMX:f8023680
2016-06-28 07:55:47.440156 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start pThnd:f8000a30
2016-06-28 07:55:47.440246 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start 2 release thndMX:f8023680 }
2016-06-28 07:55:47.440817 : AGFW:463456000: {1:3322:2} ora.test-rac1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:47.440980 : AGFW:463456000: {0:5:246} Generating new Tint for unplanned state change. Original Tint: {1:3322:2}
2016-06-28 07:55:47.441095 : AGFW:463456000: {0:5:246} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749662
2016-06-28 07:55:47.445142 : USRTHRD:465557248: {1:3322:2} VipAgent::sendFail2Srv {
2016-06-28 07:55:47.461505 : USRTHRD:465557248: {1:3322:2} VipAgent::sendFail2Srv }
2016-06-28 07:55:47.461662 : USRTHRD:465557248: {1:3322:2} Thread:[SendFail2SrvThread:] isRunning is reset to false here
2016-06-28 07:55:47.941795 :CLSDYNAM:450848512: [ora.net1.network]{1:3322:2} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.943998 : AGFW:463456000: {1:3322:2} ora.net1.network test-rac1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:47.944024 : AGFW:463456000: {1:3322:2} Switching online monitor to offline one
2016-06-28 07:55:47.944108 : AGFW:463456000: {1:3322:2} Starting offline monitor
2016-06-28 07:55:47.944197 : AGFW:463456000: {1:3322:2} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.944248 : AGFW:463456000: {0:5:247} Generating new Tint for unplanned state change. Original Tint: {1:3322:2}
2016-06-28 07:55:47.944357 : AGFW:463456000: {0:5:247} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749668
2016-06-28 07:55:47.955510 : AGFW:463456000: {0:5:247} Agent received the message: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.955543 : AGFW:463456000: {0:5:247} Preparing START command for: ora.net1.network test-rac1 1
2016-06-28 07:55:47.955561 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: OFFLINE to: STARTING
2016-06-28 07:55:47.957213 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (:CLSN00107:) clsn_agent::start {
2016-06-28 07:55:47.958107 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] NetworkAgent::init enter {
2016-06-28 07:55:47.958500 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] VendorType=0
2016-06-28 07:55:47.958556 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] Checking if eth0 Interface is fine
2016-06-28 07:55:47.958730 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.973654 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] Agent::commonStart Exception UserErrorException
2016-06-28 07:55:47.973987 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.net1.network start" encountered the following error:
CRS-5008: Invalid attribute value: eth0 for the network interface 2016-06-28 07:55:47.974495 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (:CLSN00107:) clsn_agent::start }
2016-06-28 07:55:47.974525 : AGFW:450848512: {0:5:247} Command: start for resource: ora.net1.network test-rac1 1 completed with status: FAIL
2016-06-28 07:55:47.974822 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.975059 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] NetworkAgent::init enter {
2016-06-28 07:55:47.975252 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] VendorType=0
2016-06-28 07:55:47.975296 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] Checking if eth0 Interface is fine
2016-06-28 07:55:47.975524 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.975551 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.976004 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] exception in init
2016-06-28 07:55:47.976497 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: STARTING to: OFFLINE
2016-06-28 07:55:47.976523 : AGFW:463456000: {0:5:247} Switching online monitor to offline one
2016-06-28 07:55:47.976587 : AGFW:463456000: {0:5:247} Starting offline monitor
2016-06-28 07:55:47.976700 : AGFW:463456000: {0:5:247} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.976840 : AGFW:463456000: {0:5:247} Agent sending last reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.982037 : AGFW:463456000: {0:5:247} Agent received the message: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:55:47.982071 : AGFW:463456000: {0:5:247} Preparing CLEAN command for: ora.net1.network test-rac1 1
2016-06-28 07:55:47.982089 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: OFFLINE to: CLEANING
2016-06-28 07:55:47.982930 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] (:CLSN00106:) clsn_agent::clean {
2016-06-28 07:55:47.982997 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] clean {
2016-06-28 07:55:47.983018 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] clean }
2016-06-28 07:55:47.983060 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] (:CLSN00106:) clsn_agent::clean }
2016-06-28 07:55:47.983079 : AGFW:459253504: {0:5:247} Command: clean for resource: ora.net1.network test-rac1 1 completed with status: SUCCESS
2016-06-28 07:55:47.983607 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] NetworkAgent::init enter {
2016-06-28 07:55:47.983716 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:55:47.983802 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] VendorType=0
2016-06-28 07:55:47.983844 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] Checking if eth0 Interface is fine
2016-06-28 07:55:47.984009 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.984601 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] exception in init
2016-06-28 07:55:47.985001 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: CLEANING to: OFFLINE
2016-06-28 07:55:47.985026 : AGFW:463456000: {0:5:247} Switching online monitor to offline one
2016-06-28 07:55:47.985086 : AGFW:463456000: {0:5:247} Starting offline monitor
2016-06-28 07:55:47.985143 : AGFW:463456000: {0:5:247} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.985691 : AGFW:463456000: {0:5:247} Agent sending last reply for: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:56:55.862677 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] Failed to check 192.19.88.83 on eth0
2016-06-28 07:56:55.862713 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:55.862775 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:55.863613 : AGFW:463456000: {1:3322:17228} ora.test-rac2.vip 1 1 state changed from: PARTIAL to: OFFLINE
2016-06-28 07:56:55.863701 : AGFW:463456000: {0:5:251} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:55.863813 : AGFW:463456000: {0:5:251} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750208
2016-06-28 07:56:56.196759 :CLSDYNAM:459253504: [ora.net1.network]{1:3322:17228} [check] Network Res Check Action returned 1 return ONLINE
2016-06-28 07:56:56.363286 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] Failed to check 192.19.88.82 on eth0
2016-06-28 07:56:56.363324 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:56.363408 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:56.363737 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start { acquire thndMX:f8034b50
2016-06-28 07:56:56.363760 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start pThnd:f80235e0
2016-06-28 07:56:56.363855 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start 2 release thndMX:f8034b50 }
2016-06-28 07:56:56.364491 : AGFW:463456000: {1:3322:17228} ora.test-rac1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:56.364598 : AGFW:463456000: {0:5:252} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:56.364658 : USRTHRD:452949760: {1:3322:17228} VipAgent::sendFail2Srv {
2016-06-28 07:56:56.364681 : AGFW:463456000: {0:5:252} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750217
2016-06-28 07:56:56.387222 : USRTHRD:452949760: {1:3322:17228} VipAgent::sendFail2Srv }
2016-06-28 07:56:56.387282 : USRTHRD:452949760: {1:3322:17228} Thread:[SendFail2SrvThread:] isRunning is reset to false here
2016-06-28 07:56:56.863825 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] Failed to check 192.19.88.76 on eth0
2016-06-28 07:56:56.863856 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:56.863883 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:56.864616 : AGFW:463456000: {1:3322:17228} ora.scan1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:56.864711 : AGFW:463456000: {0:5:253} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:56.864780 : AGFW:463456000: {0:5:253} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750220
2016-06-28 07:56:57.197099 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:17228} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:56:57.197197 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:17228} [check] ifName = eth1:1
2016-06-28 07:56:57.197608 : AGFW:463456000: {1:3322:17228} ora.net1.network test-rac1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:57.197639 : AGFW:463456000: {1:3322:17228} Switching online monitor to offline one
2016-06-28 07:56:57.197714 : AGFW:463456000: {1:3322:17228} Starting offline monitor
2016-06-28 07:56:57.197804 : AGFW:463456000: {1:3322:17228} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:56:57.197841 : AGFW:463456000: {0:5:254} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:57.197897 : AGFW:463456000: {0:5:254} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750226
2016-06-28 07:56:57.203615 : AGFW:463456000: {0:5:254} Agent received the message: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223726
2016-06-28 07:56:57.203639 : AGFW:463456000: {0:5:254} Preparing START command for: ora.net1.network test-rac1 1
2016-06-28 07:56:57.203651 : AGFW:463456000: {0:5:254} ora.net1.network test-rac1 1 state changed from: OFFLINE to: STARTING
2016-06-28 07:56:57.204319 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] (:CLSN00107:) clsn_agent::start {
2016-06-28 07:56:57.204507 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] NetworkAgent::init enter {
2016-06-28 07:56:57.204735 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] VendorType=0
2016-06-28 07:56:57.204795 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] Checking if eth0 Interface is fine
2016-06-28 07:56:57.205009 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:56:57.205101 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] ifName = eth1:1
2016-06-28 07:56:57.205528 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] Agent::commonStart Exception UserErrorException
2016-06-28 07:56:57.205805 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.net1.network start" encountered the following error:
CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".

验证测试

手工转移

看手工是否可以转移VIP

  • 在节点1启动监控脚本
cat mon_crs_status.sh
while [[ 1 ]]
do
date
ifconfig -a
echo ''
ping -c 2 test-rac1
ping -c 2 test-rac2
ping -c 2 test-rac1-vip
ping -c 2 test-rac2-vip
ping -c 2 test-cluster
crsctl stat res -t
sleep 5
done
  • 在RAC1使用以下命令执行
nohup ./mon_crs_status.sh > mon_crs_log &
ps -ef | grep mon_crs_status.sh
  • 截取网络包
tcpdump -s 0 -c 100  -w test1.cap
  • 停止节点2的VIP
srvctl stop instance -d testdb -i TESTDB2
srvctl stop asm -n test-rac2 -force
srvctl stop listener -n test-rac2
srvctl stop nodeapps -n test-rac2
  • 在本地测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
  • 手动迁移VIP
crs_relocate ora.test-rac2.vip
  • 重新测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
  • 启动rac2服务
srvctl start nodeapps -n test-rac2
srvctl start asm -n test-rac2
srvctl start instance -d testdb -i TESTDB2
crsctl stat res -t

关闭RAC2虚拟机,查看VIP是否转移

  • 在RAC1使用以下命令执行
nohup ./mon_crs_status.sh > mon_crs_log &
ps -ef | grep mon_crs_status.sh
  • 截取网络包
tcpdump -s 0 -c 100  -w test2.cap
  • 停止rac2服务器
poweroff
  • 在本地测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
  • 启动服务器,启动实例
crsctl check crs
crsctl start crs

测试结果(0706中午)

  1. 手工停RAC服务,集群服务仍可用。rac2-vip不能手动迁移。
[grid@test-rac2 ~]$ crs_relocate ora.test-rac2.vip
CRS-5708: Resource 'ora.test-rac2.vip' is not relocatable (current and target state not running)
CRS-0223: Resource 'ora.test-rac2.vip' has placement error. [grid@test-rac2 ~]$ srvctl relocate vip -vip test-rac2
PRCR-1090 : Failed to relocate resource ora.test-rac2.vip. It is not running.
  1. rac2虚拟机关机,集群服务仍可用。则rac2-vip可以自动迁移到rac1节点,但是状态处于INTERMEDIATE状态。
[oracle@test-rac1 scripts]$ crsctl stat res -t
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE test-rac1 STABLE
ora.test-rac1.vip
1 ONLINE ONLINE test-rac1 STABLE
ora.test-rac2.vip
1 ONLINE INTERMEDIATE test-rac1 FAILED OVER,STABLE
ora.scan1.vip
1 ONLINE ONLINE test-rac1 STABLE
--------------------------------------------------------------------------------
Wed Jul  6 12:50:53 CST 2016
***********************start*************************************
eth0 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.70 Bcast:192.19.88.255 Mask:255.255.255.0
inet6 addr: fe80::215:5dff:fe75:b15/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18106573 errors:0 dropped:0 overruns:0 frame:0
TX packets:16997728 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5394898582 (5.0 GiB) TX bytes:14791039101 (13.7 GiB) eth0:1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.76 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth0:2 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.83 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth0:3 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.82 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:16
inet addr:192.168.2.5 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::215:5dff:fe75:b16/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:109282486 errors:0 dropped:0 overruns:0 frame:0
TX packets:68084084 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:82205348347 (76.5 GiB) TX bytes:41966966280 (39.0 GiB) eth1:1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:16
inet addr:169.254.31.89 Bcast:169.254.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

结论:无论手动关节点还是直接关闭虚拟机,集群服务不受影响。单节点转移有问题,需要查找配置能够使vip漂移后使其状态修复为Online。等软件测试通过后再做硬件掉电测试。

结合操作系统日志进行分析

既然从上面的实验可以看出停单个节点情况下数据库服务是持续的,RAC切换功能可用。进一步结合Linux服务器的/var/log/message进行分析。

网卡重启

从操作系统日志看到网卡删除与数据库监听停止强相关。

操作系统日志:

Jun 28 07:55:47 test-rac1 kernel: lo: Disabled Privacy Extensions
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #2 eth1, fe80::215:5dff:fe75:b16#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #4 eth0, fe80::215:5dff:fe75:b15#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #6 eth0, 192.19.88.70#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #7 eth1, 192.168.2.5#123, interface stats: received=461, sent=461, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #8 eth1:1, 169.254.31.89#123, interface stats: received=0, sent=0, dropped=0, active_time=415389 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #9 eth0:1, 192.19.88.82#123, interface stats: received=0, sent=0, dropped=0, active_time=415350 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #11 eth0:3, 192.19.88.76#123, interface stats: received=0, sent=0, dropped=0, active_time=415108 secs
Jun 28 07:55:51 test-rac1 ntpd[2054]: Listening on interface #13 eth0, fe80::215:5dff:fe75:b15#123 Enabled

监听日志:

listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537

操作系统日志:

Jun 28 16:24:28 test-rac1 kernel: lo: Disabled Privacy Extensions
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #133 eth1, 192.168.2.5#123, interface stats: received=0, sent=8, dropped=0, active_time=7583 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #135 eth0:1, 192.19.88.82#123, interface stats: received=0, sent=0, dropped=0, active_time=7281 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #136 eth0:2, 192.19.88.76#123, interface stats: received=0, sent=0, dropped=0, active_time=7227 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #137 eth0:3, 192.19.88.83#123, interface stats: received=0, sent=0, dropped=0, active_time=7178 secs
Jun 28 16:24:37 test-rac1 ntpd[2054]: Listening on interface #138 eth1, 192.168.2.5#123 Enabled

监听日志:

listener_20160628.log:28-JUN-2016 16:24:42 * service_died * LsnrAgt * 12537

ntp服务

从上面的日志看,ntpd进程删除了网络接口。初步分析是由于数据库的NTP服务器设置为192.168.2.4虚拟机,而此虚拟机在物理机2上,所以物理机2一旦试图启动,ntpd进程试图连接NTP服务器,但是实际物理机启动失败,ntpd进程会重启所有网卡试图重连。此问题可见serverfault,redhat官网

解决方案

因为没有独立的NTP服务器,可以使用CRS自带的ctss服务进行集群的时钟同步。

具体操作可以参考博客1,2

Disabled Privacy Extensions

这个Linux内核操作与IPV6相关,因为整个HyperV的虚拟网卡都没有启用ipv6,可以在Linux操作系统上关闭ipv6配置,以免引起异常。

Oracle_12c_RAC_service_died问题分析处理的更多相关文章

  1. alias导致virtualenv异常的分析和解法

    title: alias导致virtualenv异常的分析和解法 toc: true comments: true date: 2016-06-27 23:40:56 tags: [OS X, ZSH ...

  2. 火焰图分析openresty性能瓶颈

    注:本文操作基于CentOS 系统 准备工作 用wget从https://sourceware.org/systemtap/ftp/releases/下载最新版的systemtap.tar.gz压缩包 ...

  3. 一起来玩echarts系列(一)------箱线图的分析与绘制

    一.箱线图 Box-plot 箱线图一般被用作显示数据分散情况.具体是计算一组数据的中位数.25%分位数.75%分位数.上边界.下边界,来将数据从大到小排列,直观展示数据整体的分布情况. 大部分正常数 ...

  4. 应用工具 .NET Portability Analyzer 分析迁移dotnet core

    大多数开发人员更喜欢一次性编写好业务逻辑代码,以后再重用这些代码.与构建不同的应用以面向多个平台相比,这种方法更加容易.如果您创建与 .NET Core 兼容的.NET 标准库,那么现在比以往任何时候 ...

  5. UWP中新加的数据绑定方式x:Bind分析总结

    UWP中新加的数据绑定方式x:Bind分析总结 0x00 UWP中的x:Bind 由之前有过WPF开发经验,所以在学习UWP的时候直接省略了XAML.数据绑定等几个看着十分眼熟的主题.学习过程中倒是也 ...

  6. 查看w3wp进程占用的内存及.NET内存泄露,死锁分析

    一 基础知识 在分析之前,先上一张图: 从上面可以看到,这个w3wp进程占用了376M内存,启动了54个线程. 在使用windbg查看之前,看到的进程含有 *32 字样,意思是在64位机器上已32位方 ...

  7. ZIP压缩算法详细分析及解压实例解释

    最近自己实现了一个ZIP压缩数据的解压程序,觉得有必要把ZIP压缩格式进行一下详细总结,数据压缩是一门通信原理和计算机科学都会涉及到的学科,在通信原理中,一般称为信源编码,在计算机科学里,一般称为数据 ...

  8. ABP源码分析一:整体项目结构及目录

    ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...

  9. HashMap与TreeMap源码分析

    1. 引言     在红黑树--算法导论(15)中学习了红黑树的原理.本来打算自己来试着实现一下,然而在看了JDK(1.8.0)TreeMap的源码后恍然发现原来它就是利用红黑树实现的(很惭愧学了Ja ...

随机推荐

  1. 【资讯】天啦鲁,这十余款创客设计居然由FPGA搞定 [转]

    按理说‘高大上’的FPGA,多出现在航天航空(如火星探测器).通信(如基站.数据中心).测试测量等高端应用场景.但麦迪却也发现,近期,在很多创客的作品内部都有FPGA的影子.这或许也从侧面看出,打从总 ...

  2. HDU 3033 分组背包变形(每种至少一个)

    I love sneakers! Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others) ...

  3. reds Virtual Memory

    Virtual Memory technical specification This document details the internals of the Redis Virtual Memo ...

  4. vmware虚拟机网络自动断开的问题

    最近搭建一个集群环境,因此用vmware安装了几台虚拟机,系统是centos7.2. 但是发现网络总是不经意间自动断开,重启网络(service network restart)恢复. 虚拟机网络类型 ...

  5. Sort Characters By Frequency

    Given a string, sort it in decreasing order based on the frequency of characters. Example 1: Input: ...

  6. 使用Windows Form 制作一个简易资源管理器

    自制一个简易资源管理器----TreeView控件 第一步.新建project,进行基本设置:(Set as StartUp Project:View/Toolbox/TreeView) 第二步.开始 ...

  7. HTML编辑器

    终于有时间静下来总结一下最近的工作. 第一个就是html编辑器: 首先是编辑器的选择,之前用的是ewebeditor,功能很强大,出于粘贴word内容得安装插件的原因,暂时放弃. ewebeditor ...

  8. 一步一步搭框架(asp.netmvc+easyui+sqlserver)-02

    一步一步搭框架(asp.netmvc+easyui+sqlserver)-02 我们期望简洁带前台代码,如下: <table id="dataGrid" class=&quo ...

  9. HTML 行内元素和块级元素的理解及其相互转换

    块级元素:div, p(段落), form(表单), ul(无序列表), li(列表项), ol(有序列表), dl(定义列表), hr(水平分割线), menu(菜单列表), table(表格).. ...

  10. js相关参考资料

    [图片等比例适配:]http://www.cnblogs.com/zengxiangzhan/archive/2009/09/12/1565323.html