最近半个月遇到有两个客户的Oracle Exadata一体机出现物理磁盘的损坏,一个客户是机械磁盘、一个客户是FLASH磁盘。很巧的是这两个客户他们的日常运维过程中都是只看物理服务器的故障信号灯。但是在一体机环境中其实这远远不够的,就如今天我们分享的这个案例一样,一台存储节点故障灯并没有亮,但是磁盘已经出现了大量的坏块,导致最后重平衡失败。

针对Oracle一体机环境在硬件巡检时,有几个维度必须检查到:

1,所有节点的messages日志
2,存储节点cell的alert日志
3,cellcli中的alerthistory日志
4,asm的alert日志
5,ilom日志
6,最后才服务器信号灯

故障概述

2025年6月19日,值班人员在机房巡检时发现Exadata一体机存储节点亮黄灯,随即通知数据库工程师。经排查,确认存储6节点(CD_05_HTZHTZHADM06)出现故障,影响了部分业务磁盘的正常使用。

故障现象

  • 存储节点 HTZHTZHADM06 的物理磁盘出现异常,指示灯报警。
  • 通过 CellCLI 工具查询,发现相关物理磁盘、celldisk、griddisk 状态异常,部分磁盘状态为 proactive failure。
  • ASM 日志显示,业务 I/O 被转移到其他联机伙伴磁盘,rebalance 过程中,其他节点磁盘也出现坏块计数增加,rebalance 长时间无法完成。
  • 数据库层面,部分 ASM 磁盘未能自动 drop,rebalance 状态长时间处于 WAIT 或 RUN,无法顺利结束。

故障原因分析

  1. 硬盘物理故障

    存储节点6的物理磁盘(/dev/sdf,252:5)出现硬件故障,CellCLI 查询状态为 proactive failure,errorCount 累计至 46 次。
  2. 磁盘组冗余受损

    故障磁盘droping过程中触发磁盘组重平衡过程,遇到节点磁盘出现坏块,使得重平衡失败。
  3. 自动修复受阻

    由于部分磁盘健康状况不佳,导致相关 griddisk 的 asmdeactivationoutcome 状态为“Cannot deactivate because partner disk ... has poor health”,影响了自动修复和冗余恢复流程。

处理过程

确定损坏磁盘的信息

1,通过list physicaldisk 查看状态标志包含failure的所有磁盘

CellCLI> list physicaldisk attributes all
252:0 22 /dev/sda HardDisk 252 0 0_0 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6WSUN 8.91015625T 0 normal
252:1 23 /dev/sdb HardDisk 252 0 0_1 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R74S6N 8.91015625T 1 normal
252:2 28 /dev/sdc HardDisk 252 0 0_2 "HGST H1231A823SUN010T" A680 2024-09-12T10:24:58+08:00 sas R52JHK 8.91015625T 2 normal
252:3 20 /dev/sdd HardDisk 252 0 0_3 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R75DLN 8.91015625T 3 normal
252:4 26 /dev/sde HardDisk 252 0 0_4 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R63KJN 8.91015625T 4 normal
252:5 27 /dev/sdf HardDisk 252 0 0_5 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R714UN 8.91015625T 5 normal
252:6 25 /dev/sdg HardDisk 252 0 0_6 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6SLBN 8.91015625T 6 normal
252:7 24 /dev/sdh HardDisk 252 1106 0_7 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R7D8GN 8.91015625T 7 normal
252:8 19 /dev/sdi HardDisk 252 0 0_8 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6W8AN 8.91015625T 8 normal
252:9 18 /dev/sdj HardDisk 252 0 0_9 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R6Y44N 8.91015625T 9 normal
252:10 17 /dev/sdk HardDisk 252 0 0_10 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R7D4RN 8.91015625T 10 normal
252:11 16 /dev/sdl HardDisk 252 0 0_11 "HGST H1231A823SUN010T" A680 2018-06-27T12:43:40+08:00 sas R7G62N 8.91015625T 11 normal
FLASH_10_1 /dev/nvme2n1 FlashDisk 10_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE805600PD6P4BGN-1 2.910957656800746917724609375T "PCI Slot: 10; FDOM: 1" normal
FLASH_10_2 /dev/nvme3n1 FlashDisk 10_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE805600PD6P4BGN-2 2.910957656800746917724609375T "PCI Slot: 10; FDOM: 2" normal
FLASH_4_1 /dev/nvme4n1 FlashDisk 4_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE8055008T6P4BGN-1 2.910957656800746917724609375T "PCI Slot: 4; FDOM: 1" normal
FLASH_4_2 /dev/nvme5n1 FlashDisk 4_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE8055008T6P4BGN-2 2.910957656800746917724609375T "PCI Slot: 4; FDOM: 2" normal
FLASH_5_1 /dev/nvme6n1 FlashDisk 5_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE805600P76P4BGN-1 2.910957656800746917724609375T "PCI Slot: 5; FDOM: 1" normal
FLASH_5_2 /dev/nvme7n1 FlashDisk 5_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2019-04-21T10:13:44+08:00 PHLE805600P76P4BGN-2 2.910957656800746917724609375T "PCI Slot: 5; FDOM: 2" normal
FLASH_6_1 /dev/nvme0n1 FlashDisk 6_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2018-06-27T12:44:13+08:00 PHLE8056009A6P4BGN-1 2.910957656800746917724609375T "PCI Slot: 6; FDOM: 1" normal
FLASH_6_2 /dev/nvme1n1 FlashDisk 6_0 "Oracle Flash Accelerator F640 PCIe Card" QDV1RF35 2019-04-22T09:36:12+08:00 PHLE8056009A6P4BGN-2 2.910957656800746917724609375T "PCI Slot: 6; FDOM: 2" normal
M2_SYS_0 /dev/sdm M2Disk "INTEL SSDSCKJB150G7" N2010121 2018-06-27T12:44:19+08:00 PHDW802004L8150A 139.73558807373046875G "M.2 Slot: 0" normal
M2_SYS_1 /dev/sdn M2Disk "INTEL SSDSCKJB150G7" N2010121 2018-06-27T12:44:19+08:00 PHDW802004ZT150A 139.73558807373046875G "M.2 Slot: 1" normal

2,通过list celldisk查看状态

    CD_00_HTZHTZHADM01	 	 2018-06-27T17:33:49+08:00	 /dev/sda  	 /dev/sda  	 HardDisk 	 0       	 0	 e432c76d-f0d4-46b5-8ef4-c6aa2f7efee0	 R6WSUN                                   	 8.9094085693359375T	 normal
CD_01_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sdb /dev/sdb HardDisk 0 0 25f32ef1-dd5a-4cf7-8440-1e74abe8e858 R74S6N 8.9094085693359375T normal
CD_02_HTZHTZHADM01 2024-09-12T10:25:05+08:00 /dev/sdc /dev/sdc HardDisk 0 0 76bcd3bd-9e46-4f7e-bece-046d4e83276a R52JHK 8.9094085693359375T normal
CD_03_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sdd /dev/sdd HardDisk 0 0 0b1c0522-e657-4d70-b6ea-85811ddf5913 R75DLN 8.9094085693359375T normal
CD_04_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sde /dev/sde HardDisk 0 0 c169081a-1ba8-435f-ba7c-a605b3049c78 R63KJN 8.9094085693359375T normal
CD_05_HTZHTZHADM01 2018-06-27T17:33:49+08:00 /dev/sdf /dev/sdf HardDisk 0 0 4b56c0f2-a6a0-43a0-bbeb-314a70626b09 R714UN 8.9094085693359375T normal
CD_06_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdg /dev/sdg HardDisk 0 0 0161eb49-0dca-4ae5-a247-a0b810491270 R6SLBN 8.9094085693359375T normal
CD_07_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdh /dev/sdh HardDisk 11413920 0 d7ef18a7-0aae-4136-8a98-2ec3978c498b R7D8GN 8.9094085693359375T normal
CD_08_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdi /dev/sdi HardDisk 0 0 312cc384-6cb4-4724-a88d-e43a06003e80 R6W8AN 8.9094085693359375T normal
CD_09_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdj /dev/sdj HardDisk 0 0 700e6561-de2f-4577-8cfe-d8d09435d291 R6Y44N 8.9094085693359375T normal
CD_10_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdk /dev/sdk HardDisk 0 0 1ba9303f-16a7-4c75-bd06-9e772307b63a R7D4RN 8.9094085693359375T normal
CD_11_HTZHTZHADM01 2018-06-27T17:33:50+08:00 /dev/sdl /dev/sdl HardDisk 0 0 0a429ee8-790b-4087-bfa5-74a94a79238a R7G62N 8.9094085693359375T normal
FD_00_HTZHTZHADM01 2018-06-27T17:33:51+08:00 /dev/md310 /dev/md310 FlashDisk 0 0 4d5272bf-4aba-48c4-8e98-b5c651ff19ed PHLE805600PD6P4BGN-2,PHLE805600PD6P4BGN-1 5.8218994140625T normal
FD_01_HTZHTZHADM01 2018-06-27T17:33:51+08:00 /dev/md304 /dev/md304 FlashDisk 0 0 6eea3fb4-deaf-436f-aa61-46ecf4bc5f2b PHLE8055008T6P4BGN-2,PHLE8055008T6P4BGN-1 5.8218994140625T normal
FD_02_HTZHTZHADM01 2018-06-27T17:33:52+08:00 /dev/md305 /dev/md305 FlashDisk 0 0 e10fe5b7-2d5c-48d4-a1a6-37612ddee585 PHLE805600P76P4BGN-2,PHLE805600P76P4BGN-1 5.8218994140625T normal
FD_03_HTZHTZHADM01 2018-06-27T17:33:53+08:00 /dev/md306 /dev/md306 FlashDisk 0 0 38181a27-b808-4238-a98c-04325146db87 PHLE8056009A6P4BGN-2,PHLE8056009A6P4BGN-1 5.8218994140625T normal

3,通过list griddisk查看状态

CellCLI> list griddisk attributes all
DATAC1_CD_00_HTZHTZHADM01 DATAC1 DATAC1_CD_00_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_00_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 8666a91f-f41b-4ad9-91af-72e5e1dd3840 7.1279296875T active
DATAC1_CD_01_HTZHTZHADM01 DATAC1 DATAC1_CD_01_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_01_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 98e70d44-755a-416e-b221-1339bfe6accf 7.1279296875T active
DATAC1_CD_02_HTZHTZHADM01 DATAC1 DATAC1_CD_02_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_02_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2024-09-12T10:25:05+08:00 HardDisk 0 20be2fbd-de47-455f-aceb-6f33451227e9 7.1279296875T active
DATAC1_CD_03_HTZHTZHADM01 DATAC1 DATAC1_CD_03_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_03_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 c5dd1a7b-3ab4-4afd-ba95-6b52d7e0c990 7.1279296875T active
DATAC1_CD_04_HTZHTZHADM01 DATAC1 DATAC1_CD_04_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_04_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 0 66daaae1-f9bd-409f-a587-056a49bb0737 7.1279296875T active
DATAC1_CD_05_HTZHTZHADM01 DATAC1 DATAC1_CD_05_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_05_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 475cab0a-eef4-4c5b-b3be-1ff23b155f51 7.1279296875T active
DATAC1_CD_06_HTZHTZHADM01 DATAC1 DATAC1_CD_06_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_06_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:57+08:00 HardDisk 0 f75ea3a4-7ee2-4f85-9fee-dada8e11c6cf 7.1279296875T active
DATAC1_CD_07_HTZHTZHADM01 DATAC1 DATAC1_CD_07_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_07_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 4 d30809fc-57df-442d-bea9-b22d12c72281 7.1279296875T active
DATAC1_CD_08_HTZHTZHADM01 DATAC1 DATAC1_CD_08_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_08_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:56+08:00 HardDisk 0 9cc88c99-31dc-4731-be5f-1f4e467efb01 7.1279296875T active
DATAC1_CD_09_HTZHTZHADM01 DATAC1 DATAC1_CD_09_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_09_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 0 1656d696-640e-459b-a15f-c6c3c2ec504a 7.1279296875T active
DATAC1_CD_10_HTZHTZHADM01 DATAC1 DATAC1_CD_10_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_10_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:55+08:00 HardDisk 0 881d0d07-1a4a-4492-8320-407bc8b2e5f0 7.1279296875T active
DATAC1_CD_11_HTZHTZHADM01 DATAC1 DATAC1_CD_11_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_11_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup DATAC1" 2018-06-27T17:38:57+08:00 HardDisk 0 13fddb80-857b-40c5-9b41-68efced3dc40 7.1279296875T active
RECOC1_CD_00_HTZHTZHADM01 RECOC1 RECOC1_CD_00_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_00_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 a4ad0311-da18-4419-80dc-8b75227329d2 1.78143310546875T active
RECOC1_CD_01_HTZHTZHADM01 RECOC1 RECOC1_CD_01_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_01_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 ce81a19a-3651-4dc1-9dc7-e84b41754a76 1.78143310546875T active
RECOC1_CD_02_HTZHTZHADM01 RECOC1 RECOC1_CD_02_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_02_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2024-09-12T10:25:05+08:00 HardDisk 0 789705ed-e76f-43ab-ac1f-2544bebb1ea3 1.78143310546875T active
RECOC1_CD_03_HTZHTZHADM01 RECOC1 RECOC1_CD_03_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_03_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:21+08:00 HardDisk 0 37098fd1-2708-4a4b-86a7-d4a8c14ddf88 1.78143310546875T active
RECOC1_CD_04_HTZHTZHADM01 RECOC1 RECOC1_CD_04_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_04_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 0 e18f3551-0af6-44b3-b577-7f08bde773c3 1.78143310546875T active
RECOC1_CD_05_HTZHTZHADM01 RECOC1 RECOC1_CD_05_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_05_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 0 cbb48299-dbac-4230-b784-508c954a461b 1.78143310546875T active
RECOC1_CD_06_HTZHTZHADM01 RECOC1 RECOC1_CD_06_HTZHTZHADM01 HTZHTZHADM01 FD_03_HTZHTZHADM01 default CD_06_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:21+08:00 HardDisk 0 0010e607-713f-4055-8924-2fc23a31e9a1 1.78143310546875T active
RECOC1_CD_07_HTZHTZHADM01 RECOC1 RECOC1_CD_07_HTZHTZHADM01 HTZHTZHADM01 FD_01_HTZHTZHADM01 default CD_07_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 23825674 a2560429-fb29-4022-a85a-98e0baeacf28 1.78143310546875T active
RECOC1_CD_08_HTZHTZHADM01 RECOC1 RECOC1_CD_08_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_08_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:20+08:00 HardDisk 0 70997c8e-4bf0-4820-97a9-32c48be234ff 1.78143310546875T active
RECOC1_CD_09_HTZHTZHADM01 RECOC1 RECOC1_CD_09_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_09_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 309f073f-262f-4042-b9c7-b935c6f1a143 1.78143310546875T active
RECOC1_CD_10_HTZHTZHADM01 RECOC1 RECOC1_CD_10_HTZHTZHADM01 HTZHTZHADM01 FD_00_HTZHTZHADM01 default CD_10_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 fbc56c09-c95f-4230-a971-31b2a18fc10c 1.78143310546875T active
RECOC1_CD_11_HTZHTZHADM01 RECOC1 RECOC1_CD_11_HTZHTZHADM01 HTZHTZHADM01 FD_02_HTZHTZHADM01 default CD_11_HTZHTZHADM01 "Cluster cluster-clu1 diskgroup RECOC1" 2018-06-27T17:39:19+08:00 HardDisk 0 96061cc1-ad24-4a8e-afe7-f974d0930a8f 1.78143310546875T active CellCLI>

4,确认故障LUN的位置

         name:               CD_05_HTZHTZHADM06
comment:
creationTime: 2018-06-27T17:33:50+08:00
deviceName: /dev/sdf
devicePartition: /dev/sdf
diskType: HardDisk
errorCount: 46
freeSpace: 0
id: 86b90b17-fbe4-4622-821c-f3e96881ff09
physicalDisk: R74YWN
size: 8.9094085693359375T
status: proactive failure CellCLI>
查看griddisk的信息
list griddisk where celldisk=CD_05_HTZHTZHADM06
CellCLI> list griddisk where celldisk=CD_05_HTZHTZHADM06
DATAC1_CD_05_HTZHTZHADM06 proactive failure
RECOC1_CD_05_HTZHTZHADM06 proactive failure

5,确认失败的磁盘所关联的ASM disk是否已经自动drop。

使用grid用户登录到数据库节点,使用sqlplus / as sysasm连接到ASM实例。

SQL> set lines 180 pages 999
col path format a50
select group_number,path,header_status,mount_status,mode_status,name from v$ASM_DISK where path like '%CD_05_HTZHTZHADM06%';
SQL> SQL>
GROUP_NUMBER PATH HEADER_STATU MOUNT_S MODE_ST NAME
------------ -------------------------------------------------- ------------ ------- ------- ------------------------------
0 o/192.168.10.19;192.168.10.20/DATAC1_CD_05_sjxzcel FORMER CLOSED ONLINE
adm06
3 o/192.168.10.19;192.168.10.20/RECOC1_CD_05_sjxzcel MEMBER CACHED ONLINE RECOC1_CD_05_HTZHTZHADM06
adm06

发现未完全drop。

6,查看数据库reblance状态

SQL>
INST_ID GROUP_NUMBER OPERA PASS STAT POWER ACTUAL SOFAR EST_WORK EST_RATE EST_MINUTES ERROR_CODE CON_ID
------- ------------ ----- ---- ----- ----- ------ -------- -------- -------- ----------- ---------- ------
1 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
3 3 REBAL REBALANCE RUN 12 121789 1240900 0 0 0 0
3 3 REBAL REBUILD DONE 12 12 0 0 0 0 0
3 3 REBAL RESYNC DONE 12 12 0 0 0 0 0
1 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
2 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
2 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
2 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
3 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
3 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
3 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
3 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
1 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
1 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
1 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
1 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0
2 3 REBAL COMPACT WAIT 12 12 0 0 0 0 0
2 3 REBAL REBALANCE WAIT 12 12 0 0 0 0 0
2 3 REBAL REBUILD WAIT 12 12 0 0 0 0 0
2 3 REBAL RESYNC WAIT 12 12 0 0 0 0 0 20 rows selected.

发现一直是这个状态。

确认ALERT日志信息

某节点的日志

----------------Alert----------------
2025-06-16T16:51:26.645923+08:00
NOTE: updating disk modes to 0x5 from 0x7 for disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1): lflags 0x0
NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for reads
NOTE: updating disk modes to 0x1 from 0x5 for disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1): lflags 0x0
NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for writes
NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for writes
SUCCESS: disk DATAC1_CD_05_HTZHTZHADM06 (19.1897451371) dropped from diskgroup DATAC1 ----------------ASM磁盘故障-------------------------------
WARNING: I/O on unhealthy ASM disk (DATAC1_CD_05_HTZHTZHADM06) in group DATAC1 /0x3a82859c will be diverted to its online partner disks
2025-06-19T18:08:35.323341+08:00
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup RECOC1 drop
disk RECOC1_CD_05_HTZHTZHADM06 NOTE: GroupBlock outside rolling migration privileged region
NOTE: initiating offline for alter one membership refresh for group=3
2025-06-19T18:08:37.501287+08:00 2025-06-19T18:08:25.583485+08:00
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup RECOC1 drop
disk RECOC1_CD_05_HTZHTZHADM06 2025-06-19T18:08:26.836233+08:00
NOTE: Attempting voting file refresh on diskgroup RECOC1
NOTE: Refresh completed on diskgroup RECOC1. No voting file found.

某节点的信息

2025-06-23T21:54:19.843872+08:00
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:1048576 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef8f3000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 7057 usec Time waited on I/O: 6048 usec
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:0 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef98288 bufp:0x7f19ea383000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 15155 usec Time waited on I/O: 14146 usec
NOTE: Suppressing further IO Read errors on group:3 disk:80
WARNING: Read Failed. group:3 disk:80 AU:38374 offset:0 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef969348 bufp:0x7f19ea073000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 7068 usec Time waited on I/O: 2021 usec
2025-06-23T21:54:50.567089+08:00
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:1048576 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef8f3000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 40362 usec Time waited on I/O: 6050 usec
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:0 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef969348 bufp:0x7f19ea073000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 10082 usec Time waited on I/O: 1009 usec
NOTE: Suppressing further IO Read errors on group:3 disk:80
WARNING: Read Failed. group:3 disk:80 AU:57491 offset:0 size:1048576

Exadata error:'Generic I/O error'和Read Failed这里给得很明显,提示在从平衡的时候读取AU时出现了IO错误。

确定IO故障节点的CELL的日志

Read Error on Cell Disk CD_07_sjxzceladm01 (/dev/sdh) at device offset 8078426636288 bytes with size 1048576 bytes membuf 0x1c89a00000, bioreq 0x616d522c (errno: No data available [61])
Read Error on Grid Disk RECOC1_CD_07_sjxzceladm01 at grid disk offset 241134731264 bytes with size 1048576 bytes from database +ASM
2025-06-23T17:23:05.781390+08:00
Read Error on Cell Disk CD_07_sjxzceladm01 (/dev/sdh) at device offset 7844585799680 bytes with size 1048576 bytes membuf 0x1e70f00000, bioreq 0x6120b2b4 (errno: No data available [61])
Read Error on Grid Disk RECOC1_CD_07_sjxzceladm01 at grid disk offset 7293894656 bytes with size 1048576 bytes from database +ASM
2025-06-23T17:23:05.790166+08:00

errno: No data available [61]通过这行日志,也可以很明确知道IO有异常

操作系统日志

Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.789307] sd 0:2:7:0: [sdh] tag#7 BRCM Debug mfi stat 0x2d, data len requested/completed 0x100000/0x0
Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790573] sd 0:2:7:0: [sdh] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790576] sd 0:2:7:0: [sdh] tag#7 Sense Key : Medium Error [current]
Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790578] sd 0:2:7:0: [sdh] tag#7 Add. Sense: Unrecovered read error

操作系统层面同样在报读取错误

故障处理

强制删除磁盘

使用踢盘命令alter physicaldisk 252:5 drop for replacement剔除存储节点6的故障硬盘

更换物理磁盘

更换磁盘,每个磁盘都有一个挂扣,拨动挂扣将旧的磁盘移除,然后在将新的磁盘推入槽位,锁起挂扣。更换完成后,磁盘上的LED指示灯消失,绿灯亮起。

确认更换磁盘的状态

在物理磁盘更换完成以后,系统会自动创建LUN,celldisk,griddisk,当其是系统盘时,如果磁盘包含系统分区,RAID同时也会自动进行重组。

在存储服务器这一端的cellcli命令提示符下执行如下命令可查看lun,physicaldisk,celldisk,griddisk的状态,创建时间及名称,确认更换后的信息正确无误。

CellCLI> list lun 0_5 detail
name: 0_5
cellDisk: CD_05_HTZHTZHADM06
deviceName: /dev/sdf
diskType: HardDisk
id: 0_5
isSystemLun: FALSE
lunSize: 8.90940952301025390625T
lunUID: 0_5
physicalDrives: 252:5
raidLevel: 0
lunWriteCacheMode: "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
status: normal
CellCLI> list physicaldisk 252:5 detail
name: 252:5
deviceId: 29
deviceName: /dev/sdf
diskType: HardDisk
enclosureDeviceId: 252
errOtherCount: 0
luns: 0_5
makeModel: "HGST H1231A823SUN010T"
physicalFirmware: A680
physicalInsertTime: 2025-06-24T02:30:45+08:00
physicalInterface: sas
physicalSerial: R94HEN
physicalSize: 8.91015625T
slotNumber: 5
status: normal
CellCLI> list celldisk where lun=0_5 detail
name: CD_05_HTZHTZHADM06
comment:
creationTime: 2025-06-24T02:30:53+08:00
deviceName: /dev/sdf
devicePartition: /dev/sdf
diskType: HardDisk
errorCount: 0
freeSpace: 0
id: 1a41238e-cbb6-4d66-8427-5e7dc2e2729f
physicalDisk: R94HEN
size: 8.9094085693359375T
status: normal
CellCLI> list griddisk where celldisk=CD_05_HTZHTZHADM06 detail
name: DATAC1_CD_05_HTZHTZHADM06
asmDiskGroupName: DATAC1
asmDiskName: DATAC1_CD_05_HTZHTZHADM06
asmFailGroupName: HTZHTZHADM06
availableTo:
cachedBy: FD_01_HTZHTZHADM06
cachingPolicy: default
cellDisk: CD_05_HTZHTZHADM06
comment: "Cluster cluster-clu1 diskgroup DATAC1"
creationTime: 2025-06-24T02:30:53+08:00
diskType: HardDisk
errorCount: 0
id: ac2ecc45-619e-464c-bf66-5c0fd7e8c608
size: 7.1279296875T
status: active name: RECOC1_CD_05_HTZHTZHADM06
asmDiskGroupName: RECOC1
asmDiskName: RECOC1_CD_05_HTZHTZHADM06
asmFailGroupName: HTZHTZHADM06
availableTo:
cachedBy: FD_01_HTZHTZHADM06
cachingPolicy: default
cellDisk: CD_05_HTZHTZHADM06
comment: "Cluster cluster-clu1 diskgroup RECOC1"
creationTime: 2025-06-24T02:30:53+08:00
diskType: HardDisk
errorCount: 0
id: 5ad626b7-9167-4f64-b07d-cfb767ca8d3d
size: 1.78143310546875T
status: active

然后在数据库服务器的ASM实例这一段查看griddisk是否已经正确添加到ASM磁盘组:

SQL> set linesize 180 pages 999
col path format a50
select group_number,path,header_status,mount_status,name from v$ASM_DISK where path like '%CD_05_HTZHTZHADM06%';
SQL> SQL>
GROUP_NUMBER PATH HEADER_STATU MOUNT_S NAME
------------ -------------------------------------------------- ------------ ------- ------------------------------
1 o/192.168.10.19;192.168.10.20/DATAC1_CD_05_sjxzcel MEMBER CACHED DATAC1_CD_05_HTZHTZHADM06
adm06
3 o/192.168.10.19;192.168.10.20/RECOC1_CD_05_sjxzcel MEMBER CACHED RECOC1_CD_05_HTZHTZHADM06
adm06

故障处理:Oracle一体机磁盘故障时磁盘组重平衡失败的故障处理的更多相关文章

  1. 安装Oracle进行先决条件检查时显示 Environment variable:"PATH" 失败”

    问题已解决:安装时exe可执行文件的目录也不能有中文,安装时注意目录一定要按oracle的格式.运行安装程序时,要用右键--> 要以管理员方式启动. 原文: 用到oracle数据库,由于电脑装的 ...

  2. bay——Oracle RAC环境下ASM磁盘组扩容.docx

    https://www.cnblogs.com/polestar/p/10115263.html Oracle RAC环境下ASM磁盘组扩容 生产环境注意调整以下参数: +++++++++++++++ ...

  3. lvm讲解、磁盘故障小案例

    第4周第3次课(4月11日) 课程内容: 4.10/4.11/4.12 lvm讲解4.13 磁盘故障小案例 4.10/4.11/4.12 lvm讲解 lvm可以给磁盘扩容和缩容,结构图如下. 首先创建 ...

  4. Linux centosVMware 命令 lvm、磁盘故障小案例

    一.lvm命令 LVM:逻辑分区管理,可基于动态的扩展缩小硬件设备的使用空间,注意:lvm磁盘复杂,由于使用lvm,数据丢失恢复起来有一定风险.概念:pv.VG.lvpv(物理卷,有pp基本单位构成) ...

  5. [Oracle]如何获得出现故障时,客户端的详细连接信息

    [Oracle]如何获得出现故障时,客户端的详细连接信息 客户坚持说 只是在 每天早上5点才运行下面的语句: select / * + FULL (TAB001_TT01) * / 'TAB001_T ...

  6. lvm讲解/磁盘故障小案例

    4.10/4.11/4.12 lvm讲解 4.13 磁盘故障小案例 lvm讲解 磁盘故障小案例

  7. DELL R720针对磁盘故障面板信息误报解决

    现象: 面板报警信息显示 PDR1101 fault detected on drive 0. Check drive... 经查资料是磁盘故障的原因,而r720的idrac似乎我们没有安装,我不能通 ...

  8. 华为云计算IE面试笔记-云磁盘和普通磁盘的区别。

    1. 定义 云硬盘:一种虚拟块存储服务,主要为ECS和BMS提供块存储空间 普通磁盘:也称本地硬盘,指挂载在计算实例物理机上的本地硬盘 2. 性能 吞吐量具体情况具体分析.(若云磁盘用的SSD本地磁盘 ...

  9. 将 Windows 虚拟机从非托管磁盘转换为托管磁盘

    如果有使用非托管磁盘的现有 Windows 虚拟机 (VM),可通过 Azure 托管磁盘服务将 VM 转换为使用托管磁盘. 此过程会同时转换 OS 磁盘和任何附加的数据磁盘. 本文介绍如何使用 Az ...

  10. (转)GPT磁盘与MBR磁盘区别

    摘要:   Windows 2008磁盘管理器中,在磁盘标签处右击鼠标,随磁盘属性的不同会出现“转换到动态磁盘”,“转换到基本磁盘”“转换成GPT磁盘”,“转换成MBR磁盘”等选项,在此做简单介绍.部 ...

随机推荐

  1. CSS那些事读书笔记-1

    背景 作为一个后端开发,曾经尝试过学习前端,但是总觉不得要领,照猫画虎,而公司里又有专业的前端开发,工作中几乎接触不到实际的前端任务,所以前端的技能田野一直是一片荒芜.但是笔者深知前端的技能对找工作和 ...

  2. 物理机Jenkins接入K8s环境

    前言 本次记录物理机部署Jenkins,k8s弹性伸缩agent节点供部署项目. 安装 K8S 插件 登录 Jenkins,系统管理→ 插件管理 → 搜索 kubernetes,选择第二个 Kuber ...

  3. 浅聊java运行机制

    Java程序运行机制 首先要清楚运行机制一般有两种 解释型 编译型 解释型: 顾名思义,就像有个人在旁边给你解释东西一样.比如看一本英文书,英语老师在旁边一句一句给你翻译解释.在写源代码时,每写一个 ...

  4. 【Web】Servlet三大作用域、JSP四大作用域

    request 生命周期: 创建:客户端向服务器发送一次请求,服务器就会创建request对象. 销毁:服务器对这次请求作出响应后就会销毁request对象. 有效:仅在当前请求中有效. 作用:常用于 ...

  5. git 取消 git add 操作

    ... 按照套路我们在对项目做了一些新增或修改操作后,会很自然的执行 git add 操作, 但是马上又发现好像添加的内容有点不对: 文件名错了 多了个符号 少了点什么 马上发现bug 等等... 总 ...

  6. Visual Studio 中的 .sln 和 .suo 文件

    解决方案文件1 Visual Studio 采用两种文件类型 .sln & .suo 来存储特定于解决方案的设置.这些文件总称为解决方案文件,为解决方案资源管理器提供显示管理文件的图形接口所需 ...

  7. 【自用】git提交commit 注释规范

    git初始化 首先下载安装git,配置好公私密钥和github git命令 git init git remote add origin [远程库地址] git pull origin master ...

  8. [设计模式/Java] 设计模式之门面模式(外观模式)【20】

    概述 : 门面模式 := 外观模式 := Facade Pattern 产生背景 软件开发过程中,我们经常会遇到复杂系统,其中包含多个子系统和接口.在这种情况下,为了简化客户端的调用过程,提高代码的可 ...

  9. 里程碑:MCP星球作为国内首个中文MCP社区和MCP工具平台,突破7000个MCP服务!

    随着人工智能技术的快速发展,越来越多的开发者开始使用模型上下文协议(Model Context Protocol,简称MCP)来优化大模型与外部工具的交互.作为首个最大的中文MCP工具市场,MCP星球 ...

  10. 记一次SQL隐式转换导致精度丢失问题的排查 → 不规范就踩坑

    开心一刻 刚毕业的侄子给我发消息侄子:叔,人生太难了我:怎么呢?侄子:工作太难了,感情也太难了,怎么什么都这么难我:你还小啊侄子:大了就不难了?我:大了你就习惯了 问题复现 先准备表:数据源( tbl ...