Oracle_12c_RAC_service_died问题分析处理
接上篇,通过分析listener日志发现rac1数据库无法连接时出现了
listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537日志。原因是rac2在此前突发断电28-JUN-2016 07:55:30 * service_updateTue Jun 28 16:47:37 2016。同时,问题是在节点2断开的情况下节点1应该自动接管监听,对于前端应用是无感知的。
另外,分析日志还发现在运行的半年过程中还有几次serice_died错误。本文分析记录是什么造成了这个service_died错误。
查看CRS日志
直接通过find /u1 -name alert获得CRS日志的位置'/u1/app/grid/diag/crs/test-rac1/crs/alert'。
使用以下方式获得问题当天的日志。
[root@test-rac1 alert]# grep -n '2016-06-28' log.xml | head -1
121600:<msg time='2016-06-28T07:55:46.994+08:00' org_id='oracle' comp_id='crs'
[root@test-rac1 alert]# grep -n '2016-06-27' log.xml | head -1
[root@test-rac1 alert]# grep -n '2016-06-29' log.xml | head -1
123264:<msg time='2016-06-29T04:33:38.602+08:00' org_id='oracle' comp_id='crs'
[root@test-rac1 alert]# sed -n '121600,123264p' log.xml > log_20160628.xml
得到错误信息
<msg time='2016-06-28T07:55:46.994+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:46.991 [CRSD(2962)]CRS-2771: Maximum restart attempts reached for resource 'ora.scan1.vip'; will not restart.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.450+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.449 [CRSD(2962)]CRS-2771: Maximum restart attempts reached for resource 'ora.test-rac1.vip'; will not restart.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.974+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.974 [ORAROOTAGENT(3053)]CRS-5017: The resource action "ora.net1.network start" encountered the following error:
2016-06-28 07:55:47.974+CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".
</txt>
</msg>
<msg time='2016-06-28T07:55:47.992+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.990 [CRSD(2962)]CRS-2878: Failed to restart resource 'ora.net1.network'
</txt>
</msg>
<msg time='2016-06-28T07:55:48.005+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:48.005 [CRSD(2962)]CRS-2769: Unable to failover resource 'ora.net1.network'.
</txt>
</msg>
<msg time='2016-06-28T07:55:49.119+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.115 [ORAAGENT(3044)]CRS-5016: Process "/u1/app/12.1.0/grid/bin/lsnrctl" spawned by agent "ORAAGENT" for action "check" failed: details at "(:CLSN00010:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_oraagent_grid.trc"
</txt>
</msg>
<msg time='2016-06-28T07:55:49.132+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.131 [ORAROOTAGENT(3053)]CRS-5017: The resource action "ora.net1.network start" encountered the following error:
2016-06-28 07:55:49.131+CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".
</txt>
</msg>
<msg time='2016-06-28T07:55:49.144+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.144 [CRSD(2962)]CRS-2878: Failed to restart resource 'ora.LISTENER_SCAN1.lsnr'
</txt>
</msg>
查看orarootagent的trace日志
位置:/u1/app/grid/diag/crs/test-rac1/crs/trace
按照监听的错误时间,以这个时间查看trace中的错误信息。
listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 07:56:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 07:57:09 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:13:54 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:15:05 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:28:04 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:36:42 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:41:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:42:08 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:56:58 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:57:10 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:02:46 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:04:53 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:05:06 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 14:18:03 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:24:42 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:26:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:30:21 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:30:26 * service_died * LsnrAgt * 12537
2016-06-28 07:55:46.806663 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] Failed to check 192.19.88.76 on eth0
2016-06-28 07:55:46.806700 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:55:46.806741 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] VipAgent::checkIp returned false
2016-06-28 07:55:46.811719 : AGFW:463456000: {2:11414:10026} ora.scan1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:46.823646 : AGFW:463456000: {0:5:245} Generating new Tint for unplanned state change. Original Tint: {2:11414:10026}
2016-06-28 07:55:46.823776 : AGFW:463456000: {0:5:245} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749652
2016-06-28 07:55:46.933964 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:2} [check] Network Res Check Action returned 1 return ONLINE
2016-06-28 07:55:47.182374 : AGFW:463456000: {1:3322:2} Agent received the message: AGENT_HB[Engine] ID 12293:223268
2016-06-28 07:55:47.436435 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] Failed to check 192.19.88.82 on eth0
2016-06-28 07:55:47.436484 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:55:47.436518 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] VipAgent::checkIp returned false
2016-06-28 07:55:47.440127 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start { acquire thndMX:f8023680
2016-06-28 07:55:47.440156 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start pThnd:f8000a30
2016-06-28 07:55:47.440246 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start 2 release thndMX:f8023680 }
2016-06-28 07:55:47.440817 : AGFW:463456000: {1:3322:2} ora.test-rac1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:47.440980 : AGFW:463456000: {0:5:246} Generating new Tint for unplanned state change. Original Tint: {1:3322:2}
2016-06-28 07:55:47.441095 : AGFW:463456000: {0:5:246} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749662
2016-06-28 07:55:47.445142 : USRTHRD:465557248: {1:3322:2} VipAgent::sendFail2Srv {
2016-06-28 07:55:47.461505 : USRTHRD:465557248: {1:3322:2} VipAgent::sendFail2Srv }
2016-06-28 07:55:47.461662 : USRTHRD:465557248: {1:3322:2} Thread:[SendFail2SrvThread:] isRunning is reset to false here
2016-06-28 07:55:47.941795 :CLSDYNAM:450848512: [ora.net1.network]{1:3322:2} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.943998 : AGFW:463456000: {1:3322:2} ora.net1.network test-rac1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:47.944024 : AGFW:463456000: {1:3322:2} Switching online monitor to offline one
2016-06-28 07:55:47.944108 : AGFW:463456000: {1:3322:2} Starting offline monitor
2016-06-28 07:55:47.944197 : AGFW:463456000: {1:3322:2} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.944248 : AGFW:463456000: {0:5:247} Generating new Tint for unplanned state change. Original Tint: {1:3322:2}
2016-06-28 07:55:47.944357 : AGFW:463456000: {0:5:247} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749668
2016-06-28 07:55:47.955510 : AGFW:463456000: {0:5:247} Agent received the message: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.955543 : AGFW:463456000: {0:5:247} Preparing START command for: ora.net1.network test-rac1 1
2016-06-28 07:55:47.955561 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: OFFLINE to: STARTING
2016-06-28 07:55:47.957213 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (:CLSN00107:) clsn_agent::start {
2016-06-28 07:55:47.958107 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] NetworkAgent::init enter {
2016-06-28 07:55:47.958500 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] VendorType=0
2016-06-28 07:55:47.958556 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] Checking if eth0 Interface is fine
2016-06-28 07:55:47.958730 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.973654 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] Agent::commonStart Exception UserErrorException
2016-06-28 07:55:47.973987 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.net1.network start" encountered the following error:
CRS-5008: Invalid attribute value: eth0 for the network interface
2016-06-28 07:55:47.974495 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (:CLSN00107:) clsn_agent::start }
2016-06-28 07:55:47.974525 : AGFW:450848512: {0:5:247} Command: start for resource: ora.net1.network test-rac1 1 completed with status: FAIL
2016-06-28 07:55:47.974822 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.975059 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] NetworkAgent::init enter {
2016-06-28 07:55:47.975252 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] VendorType=0
2016-06-28 07:55:47.975296 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] Checking if eth0 Interface is fine
2016-06-28 07:55:47.975524 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.975551 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.976004 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] exception in init
2016-06-28 07:55:47.976497 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: STARTING to: OFFLINE
2016-06-28 07:55:47.976523 : AGFW:463456000: {0:5:247} Switching online monitor to offline one
2016-06-28 07:55:47.976587 : AGFW:463456000: {0:5:247} Starting offline monitor
2016-06-28 07:55:47.976700 : AGFW:463456000: {0:5:247} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.976840 : AGFW:463456000: {0:5:247} Agent sending last reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.982037 : AGFW:463456000: {0:5:247} Agent received the message: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:55:47.982071 : AGFW:463456000: {0:5:247} Preparing CLEAN command for: ora.net1.network test-rac1 1
2016-06-28 07:55:47.982089 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: OFFLINE to: CLEANING
2016-06-28 07:55:47.982930 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] (:CLSN00106:) clsn_agent::clean {
2016-06-28 07:55:47.982997 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] clean {
2016-06-28 07:55:47.983018 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] clean }
2016-06-28 07:55:47.983060 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] (:CLSN00106:) clsn_agent::clean }
2016-06-28 07:55:47.983079 : AGFW:459253504: {0:5:247} Command: clean for resource: ora.net1.network test-rac1 1 completed with status: SUCCESS
2016-06-28 07:55:47.983607 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] NetworkAgent::init enter {
2016-06-28 07:55:47.983716 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:55:47.983802 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] VendorType=0
2016-06-28 07:55:47.983844 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] Checking if eth0 Interface is fine
2016-06-28 07:55:47.984009 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.984601 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] exception in init
2016-06-28 07:55:47.985001 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: CLEANING to: OFFLINE
2016-06-28 07:55:47.985026 : AGFW:463456000: {0:5:247} Switching online monitor to offline one
2016-06-28 07:55:47.985086 : AGFW:463456000: {0:5:247} Starting offline monitor
2016-06-28 07:55:47.985143 : AGFW:463456000: {0:5:247} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.985691 : AGFW:463456000: {0:5:247} Agent sending last reply for: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:56:55.862677 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] Failed to check 192.19.88.83 on eth0
2016-06-28 07:56:55.862713 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:55.862775 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:55.863613 : AGFW:463456000: {1:3322:17228} ora.test-rac2.vip 1 1 state changed from: PARTIAL to: OFFLINE
2016-06-28 07:56:55.863701 : AGFW:463456000: {0:5:251} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:55.863813 : AGFW:463456000: {0:5:251} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750208
2016-06-28 07:56:56.196759 :CLSDYNAM:459253504: [ora.net1.network]{1:3322:17228} [check] Network Res Check Action returned 1 return ONLINE
2016-06-28 07:56:56.363286 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] Failed to check 192.19.88.82 on eth0
2016-06-28 07:56:56.363324 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:56.363408 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:56.363737 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start { acquire thndMX:f8034b50
2016-06-28 07:56:56.363760 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start pThnd:f80235e0
2016-06-28 07:56:56.363855 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start 2 release thndMX:f8034b50 }
2016-06-28 07:56:56.364491 : AGFW:463456000: {1:3322:17228} ora.test-rac1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:56.364598 : AGFW:463456000: {0:5:252} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:56.364658 : USRTHRD:452949760: {1:3322:17228} VipAgent::sendFail2Srv {
2016-06-28 07:56:56.364681 : AGFW:463456000: {0:5:252} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750217
2016-06-28 07:56:56.387222 : USRTHRD:452949760: {1:3322:17228} VipAgent::sendFail2Srv }
2016-06-28 07:56:56.387282 : USRTHRD:452949760: {1:3322:17228} Thread:[SendFail2SrvThread:] isRunning is reset to false here
2016-06-28 07:56:56.863825 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] Failed to check 192.19.88.76 on eth0
2016-06-28 07:56:56.863856 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:56.863883 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:56.864616 : AGFW:463456000: {1:3322:17228} ora.scan1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:56.864711 : AGFW:463456000: {0:5:253} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:56.864780 : AGFW:463456000: {0:5:253} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750220
2016-06-28 07:56:57.197099 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:17228} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:56:57.197197 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:17228} [check] ifName = eth1:1
2016-06-28 07:56:57.197608 : AGFW:463456000: {1:3322:17228} ora.net1.network test-rac1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:57.197639 : AGFW:463456000: {1:3322:17228} Switching online monitor to offline one
2016-06-28 07:56:57.197714 : AGFW:463456000: {1:3322:17228} Starting offline monitor
2016-06-28 07:56:57.197804 : AGFW:463456000: {1:3322:17228} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:56:57.197841 : AGFW:463456000: {0:5:254} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:57.197897 : AGFW:463456000: {0:5:254} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750226
2016-06-28 07:56:57.203615 : AGFW:463456000: {0:5:254} Agent received the message: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223726
2016-06-28 07:56:57.203639 : AGFW:463456000: {0:5:254} Preparing START command for: ora.net1.network test-rac1 1
2016-06-28 07:56:57.203651 : AGFW:463456000: {0:5:254} ora.net1.network test-rac1 1 state changed from: OFFLINE to: STARTING
2016-06-28 07:56:57.204319 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] (:CLSN00107:) clsn_agent::start {
2016-06-28 07:56:57.204507 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] NetworkAgent::init enter {
2016-06-28 07:56:57.204735 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] VendorType=0
2016-06-28 07:56:57.204795 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] Checking if eth0 Interface is fine
2016-06-28 07:56:57.205009 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:56:57.205101 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] ifName = eth1:1
2016-06-28 07:56:57.205528 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] Agent::commonStart Exception UserErrorException
2016-06-28 07:56:57.205805 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.net1.network start" encountered the following error:
CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".
验证测试
手工转移
看手工是否可以转移VIP
- 在节点1启动监控脚本
cat mon_crs_status.sh
while [[ 1 ]]
do
date
ifconfig -a
echo ''
ping -c 2 test-rac1
ping -c 2 test-rac2
ping -c 2 test-rac1-vip
ping -c 2 test-rac2-vip
ping -c 2 test-cluster
crsctl stat res -t
sleep 5
done
- 在RAC1使用以下命令执行
nohup ./mon_crs_status.sh > mon_crs_log &
ps -ef | grep mon_crs_status.sh
- 截取网络包
tcpdump -s 0 -c 100 -w test1.cap
- 停止节点2的VIP
srvctl stop instance -d testdb -i TESTDB2
srvctl stop asm -n test-rac2 -force
srvctl stop listener -n test-rac2
srvctl stop nodeapps -n test-rac2
- 在本地测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
- 手动迁移VIP
crs_relocate ora.test-rac2.vip
- 重新测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
- 启动rac2服务
srvctl start nodeapps -n test-rac2
srvctl start asm -n test-rac2
srvctl start instance -d testdb -i TESTDB2
crsctl stat res -t
关闭RAC2虚拟机,查看VIP是否转移
- 在RAC1使用以下命令执行
nohup ./mon_crs_status.sh > mon_crs_log &
ps -ef | grep mon_crs_status.sh
- 截取网络包
tcpdump -s 0 -c 100 -w test2.cap
- 停止rac2服务器
poweroff
- 在本地测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
- 启动服务器,启动实例
crsctl check crs
crsctl start crs
测试结果(0706中午)
- 手工停RAC服务,集群服务仍可用。rac2-vip不能手动迁移。
[grid@test-rac2 ~]$ crs_relocate ora.test-rac2.vip
CRS-5708: Resource 'ora.test-rac2.vip' is not relocatable (current and target state not running)
CRS-0223: Resource 'ora.test-rac2.vip' has placement error.
[grid@test-rac2 ~]$ srvctl relocate vip -vip test-rac2
PRCR-1090 : Failed to relocate resource ora.test-rac2.vip. It is not running.
- rac2虚拟机关机,集群服务仍可用。则rac2-vip可以自动迁移到rac1节点,但是状态处于INTERMEDIATE状态。
[oracle@test-rac1 scripts]$ crsctl stat res -t
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE test-rac1 STABLE
ora.test-rac1.vip
1 ONLINE ONLINE test-rac1 STABLE
ora.test-rac2.vip
1 ONLINE INTERMEDIATE test-rac1 FAILED OVER,STABLE
ora.scan1.vip
1 ONLINE ONLINE test-rac1 STABLE
--------------------------------------------------------------------------------
Wed Jul 6 12:50:53 CST 2016
***********************start*************************************
eth0 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.70 Bcast:192.19.88.255 Mask:255.255.255.0
inet6 addr: fe80::215:5dff:fe75:b15/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18106573 errors:0 dropped:0 overruns:0 frame:0
TX packets:16997728 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5394898582 (5.0 GiB) TX bytes:14791039101 (13.7 GiB)
eth0:1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.76 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth0:2 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.83 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth0:3 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.82 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:16
inet addr:192.168.2.5 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::215:5dff:fe75:b16/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:109282486 errors:0 dropped:0 overruns:0 frame:0
TX packets:68084084 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:82205348347 (76.5 GiB) TX bytes:41966966280 (39.0 GiB)
eth1:1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:16
inet addr:169.254.31.89 Bcast:169.254.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
结论:无论手动关节点还是直接关闭虚拟机,集群服务不受影响。单节点转移有问题,需要查找配置能够使vip漂移后使其状态修复为Online。等软件测试通过后再做硬件掉电测试。
结合操作系统日志进行分析
既然从上面的实验可以看出停单个节点情况下数据库服务是持续的,RAC切换功能可用。进一步结合Linux服务器的/var/log/message进行分析。
网卡重启
从操作系统日志看到网卡删除与数据库监听停止强相关。
操作系统日志:
Jun 28 07:55:47 test-rac1 kernel: lo: Disabled Privacy Extensions
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #2 eth1, fe80::215:5dff:fe75:b16#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #4 eth0, fe80::215:5dff:fe75:b15#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #6 eth0, 192.19.88.70#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #7 eth1, 192.168.2.5#123, interface stats: received=461, sent=461, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #8 eth1:1, 169.254.31.89#123, interface stats: received=0, sent=0, dropped=0, active_time=415389 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #9 eth0:1, 192.19.88.82#123, interface stats: received=0, sent=0, dropped=0, active_time=415350 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #11 eth0:3, 192.19.88.76#123, interface stats: received=0, sent=0, dropped=0, active_time=415108 secs
Jun 28 07:55:51 test-rac1 ntpd[2054]: Listening on interface #13 eth0, fe80::215:5dff:fe75:b15#123 Enabled
监听日志:
listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537
操作系统日志:
Jun 28 16:24:28 test-rac1 kernel: lo: Disabled Privacy Extensions
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #133 eth1, 192.168.2.5#123, interface stats: received=0, sent=8, dropped=0, active_time=7583 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #135 eth0:1, 192.19.88.82#123, interface stats: received=0, sent=0, dropped=0, active_time=7281 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #136 eth0:2, 192.19.88.76#123, interface stats: received=0, sent=0, dropped=0, active_time=7227 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #137 eth0:3, 192.19.88.83#123, interface stats: received=0, sent=0, dropped=0, active_time=7178 secs
Jun 28 16:24:37 test-rac1 ntpd[2054]: Listening on interface #138 eth1, 192.168.2.5#123 Enabled
监听日志:
listener_20160628.log:28-JUN-2016 16:24:42 * service_died * LsnrAgt * 12537
ntp服务
从上面的日志看,ntpd进程删除了网络接口。初步分析是由于数据库的NTP服务器设置为192.168.2.4虚拟机,而此虚拟机在物理机2上,所以物理机2一旦试图启动,ntpd进程试图连接NTP服务器,但是实际物理机启动失败,ntpd进程会重启所有网卡试图重连。此问题可见serverfault,redhat官网。
解决方案
因为没有独立的NTP服务器,可以使用CRS自带的ctss服务进行集群的时钟同步。
具体操作可以参考博客1,2
Disabled Privacy Extensions
这个Linux内核操作与IPV6相关,因为整个HyperV的虚拟网卡都没有启用ipv6,可以在Linux操作系统上关闭ipv6配置,以免引起异常。
Oracle_12c_RAC_service_died问题分析处理的更多相关文章
- alias导致virtualenv异常的分析和解法
title: alias导致virtualenv异常的分析和解法 toc: true comments: true date: 2016-06-27 23:40:56 tags: [OS X, ZSH ...
- 火焰图分析openresty性能瓶颈
注:本文操作基于CentOS 系统 准备工作 用wget从https://sourceware.org/systemtap/ftp/releases/下载最新版的systemtap.tar.gz压缩包 ...
- 一起来玩echarts系列(一)------箱线图的分析与绘制
一.箱线图 Box-plot 箱线图一般被用作显示数据分散情况.具体是计算一组数据的中位数.25%分位数.75%分位数.上边界.下边界,来将数据从大到小排列,直观展示数据整体的分布情况. 大部分正常数 ...
- 应用工具 .NET Portability Analyzer 分析迁移dotnet core
大多数开发人员更喜欢一次性编写好业务逻辑代码,以后再重用这些代码.与构建不同的应用以面向多个平台相比,这种方法更加容易.如果您创建与 .NET Core 兼容的.NET 标准库,那么现在比以往任何时候 ...
- UWP中新加的数据绑定方式x:Bind分析总结
UWP中新加的数据绑定方式x:Bind分析总结 0x00 UWP中的x:Bind 由之前有过WPF开发经验,所以在学习UWP的时候直接省略了XAML.数据绑定等几个看着十分眼熟的主题.学习过程中倒是也 ...
- 查看w3wp进程占用的内存及.NET内存泄露,死锁分析
一 基础知识 在分析之前,先上一张图: 从上面可以看到,这个w3wp进程占用了376M内存,启动了54个线程. 在使用windbg查看之前,看到的进程含有 *32 字样,意思是在64位机器上已32位方 ...
- ZIP压缩算法详细分析及解压实例解释
最近自己实现了一个ZIP压缩数据的解压程序,觉得有必要把ZIP压缩格式进行一下详细总结,数据压缩是一门通信原理和计算机科学都会涉及到的学科,在通信原理中,一般称为信源编码,在计算机科学里,一般称为数据 ...
- ABP源码分析一:整体项目结构及目录
ABP是一套非常优秀的web应用程序架构,适合用来搭建集中式架构的web应用程序. 整个Abp的Infrastructure是以Abp这个package为核心模块(core)+15个模块(module ...
- HashMap与TreeMap源码分析
1. 引言 在红黑树--算法导论(15)中学习了红黑树的原理.本来打算自己来试着实现一下,然而在看了JDK(1.8.0)TreeMap的源码后恍然发现原来它就是利用红黑树实现的(很惭愧学了Ja ...
随机推荐
- 【资讯】天啦鲁,这十余款创客设计居然由FPGA搞定 [转]
按理说‘高大上’的FPGA,多出现在航天航空(如火星探测器).通信(如基站.数据中心).测试测量等高端应用场景.但麦迪却也发现,近期,在很多创客的作品内部都有FPGA的影子.这或许也从侧面看出,打从总 ...
- HDU 3033 分组背包变形(每种至少一个)
I love sneakers! Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) ...
- reds Virtual Memory
Virtual Memory technical specification This document details the internals of the Redis Virtual Memo ...
- vmware虚拟机网络自动断开的问题
最近搭建一个集群环境,因此用vmware安装了几台虚拟机,系统是centos7.2. 但是发现网络总是不经意间自动断开,重启网络(service network restart)恢复. 虚拟机网络类型 ...
- Sort Characters By Frequency
Given a string, sort it in decreasing order based on the frequency of characters. Example 1: Input: ...
- 使用Windows Form 制作一个简易资源管理器
自制一个简易资源管理器----TreeView控件 第一步.新建project,进行基本设置:(Set as StartUp Project:View/Toolbox/TreeView) 第二步.开始 ...
- HTML编辑器
终于有时间静下来总结一下最近的工作. 第一个就是html编辑器: 首先是编辑器的选择,之前用的是ewebeditor,功能很强大,出于粘贴word内容得安装插件的原因,暂时放弃. ewebeditor ...
- 一步一步搭框架(asp.netmvc+easyui+sqlserver)-02
一步一步搭框架(asp.netmvc+easyui+sqlserver)-02 我们期望简洁带前台代码,如下: <table id="dataGrid" class=&quo ...
- HTML 行内元素和块级元素的理解及其相互转换
块级元素:div, p(段落), form(表单), ul(无序列表), li(列表项), ol(有序列表), dl(定义列表), hr(水平分割线), menu(菜单列表), table(表格).. ...
- js相关参考资料
[图片等比例适配:]http://www.cnblogs.com/zengxiangzhan/archive/2009/09/12/1565323.html