[edit] Signs of impending disk failure
[edit] Stage 1: SMART Pre-failure warnings
According to SMART, one of the drives developed an unreadable sector, and sent me an email: "SMART error (CurrentPendingSector) detected":
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb, 1 Currently unreadable (pending) sectors
To determine the extent of the problem, execute a long selftest:
# smartctl -t long /dev/sdb
Some of the symptoms during the pre-failure period are:
- "Wait for IO" state (%wa in top) goes though the roof; the disk subsystem is effectively halted while the disk tries to recover from the read-error.
- The system becomes very sluggish at times.
As always, check that your backups are up-to-date and properly readable.
[edit] Stage 2: mdadm Fail event
After encountering some more read-errors, the RAID monitoring software decided to fail the disk/partition that was causing problems:
A Fail event had been detected on md device /dev/md1.
It could be related to component device /dev/sdb2.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sda1[0] sdb1[1] sdc1[2] sdd1[3]
128384 blocks [4/4] [UUUU]
md1 : active raid5 sdd2[3] sdc2[2] sdb2[4](F) sda2[0]
1464765696 blocks level 5, 256k chunk, algorithm 2 [4/3] [U_UU]
unused devices:
You can verify this information using
mdadm:
[root@hal ~]# mdadm --detail /dev/md1
/dev/md1:
Version : 0.90
Creation Time : Sun Sep 2 18:56:50 2007
Raid Level : raid5
Array Size : 1464765696 (1396.91 GiB 1499.92 GB)
Used Dev Size : 488255232 (465.64 GiB 499.97 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Tue Feb 2 12:16:34 2010
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 256K
UUID : f8a183b6:6748af53:6c1c8c11:87458cf7
Events : 0.964132
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
1 0 0 1 removed
2 8 34 2 active sync /dev/sdc2
3 8 50 3 active sync /dev/sdd2
4 8 18 - faulty spare /dev/sdb2
The RAID array is running in degraded mode - the failed disk needs to be replaced as soon as possible. Again, check your backups!
[edit] Prepare for disk replacement
The faulty disk,
/dev/sdb, was partitioned and used in two RAID arrays (
/dev/md0 and
/dev/md1). We need to mark all occurrences of this disk as faulty before physically replacing the disk!
[root@hal ~]# mdadm --fail /dev/md0 /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
The faulty disk is no longer part of any RAID array; erase all data before sending it in for replacement:
[root@hal ~]# dd if=/dev/zero of=/dev/sdb1
[root@hal ~]# dd if=/dev/zero of=/dev/sdb2
[edit] Recovering from disk failure
[edit] Partitioning
Replace the broken harddisk with a new one, create the proper partitions and set the bootable flag on partition 1:
[root@hal ~]# fdisk -l /dev/sdb
Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 16 128488+ fd Linux raid autodetect
/dev/sdb2 17 121601 976631512+ fd Linux raid autodetect
Note that the replacement disk is 1TB, while the original disk was 500GB. You can safely create a partition that is larger than the original one - in fact, this may be useful when doing a "rolling upgrade" to a set of larger disks.
[edit] Add to RAID arrays
Add the new disks/partitions to the RAID arrays:
[root@hal ~]# mdadm --add /dev/md0 /dev/sdb1
mdadm: added /dev/sdb1
[root@hal ~]# mdadm --add /dev/md1 /dev/sdb2
mdadm: added /dev/sdb2
Array reconstruction should now begin automatically:
[root@hal ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdb1[1] sda1[0] sdc1[2] sdd1[3]
128384 blocks [4/4] [UUUU]
md1 : active raid5 sdb2[4] sdd2[3] sdc2[2] sda2[0]
1464765696 blocks level 5, 256k chunk, algorithm 2 [4/3] [U_UU]
[>....................] recovery = 0.0% (84068/488255232) finish=193.4min speed=42034K/sec
unused devices:
[edit] Check boot-loader
The new disk is bootable, but GRUB is not yet installed:
[root@hal ~]# file -s /dev/sda
/dev/sda: x86 boot sector; partition 1: ID=0xfd, active, starthead 1, startsector 63, 256977 sectors;
partition 2: ID=0xfd, starthead 0, startsector 257040, 976511025 sectors, code offset 0x48
[root@hal ~]# file -s /dev/sdb
/dev/sdb: x86 boot sector; partition 1: ID=0xfd, active, starthead 1, startsector 63, 256977 sectors;
partition 2: ID=0xfd, starthead 0, startsector 257040, 1953263025 sectors
Install GRUB on the new disk,
/dev/sdb (GRUB uses different names for your disks,
/dev/sda =>
hd0,
/dev/sdb =>
hd1 etc.):
[root@hal ~]# grub
<...snip...>
grub> root (hd1,0)
root (hd1,0)
Filesystem type is ext2fs, partition type 0xfd
grub> setup (hd1)
setup (hd1)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/e2fs_stage1_5" exists... yes
Running "embed /grub/e2fs_stage1_5 (hd1)"... 15 sectors are embedded.
succeeded
Running "install /grub/stage1 (hd1) (hd1)1+15 p (hd1,0)/grub/stage2 /grub/grub.conf"... succeeded
Done.
grub> quit
quit
Verify the result:
[root@hal ~]# file -s /dev/sdb
/dev/sdb: x86 boot sector; partition 1: ID=0xfd, active, starthead 1, startsector 63, 256977 sectors;
partition 2: ID=0xfd, starthead 0, startsector 257040, 1953263025 sectors, code offset 0x48
[edit] Wait for reconstruction to complete
During array reconstruction, performance may suffer. Check RAID status for an estimated time of completion:
[root@hal ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdb1[1] sda1[0] sdc1[2] sdd1[3]
128384 blocks [4/4] [UUUU]
md1 : active raid5 sdb2[4] sdd2[3] sdc2[2] sda2[0]
1464765696 blocks level 5, 256k chunk, algorithm 2 [4/3] [U_UU]
[===>.................] recovery = 18.5% (90480896/488255232) finish=147.6min speed=44912K/sec
unused devices:
[edit] Navigation