Software RAID 障害復旧

Software RAID のハードディスクが壊れたっぽいのでディスクを交換して RAID の復旧を行いました。

ハードディスクが壊れたっぽい、もしくはそろそろ壊れそうと思ったのはこんなメールが送られて来たからです。

Jun 30 10:55:21 ns1 smartd[3428]: Device: /dev/hdc, FAILED SMART self-check. BACK UP DATA NOW!

RAID の状態は以下のとおり正常だったのですが、早めの対処という事でディスク交換しました。

[root@ns1 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hdc2[1] hda2[0]
      2096384 blocks [2/2] [UU]
 
md2 : active raid1 hdc3[1] hda3[0]
      75762432 blocks [2/2] [UU]
 
md0 : active raid1 hdc1[1] hda1[0]
      264960 blocks [2/2] [UU]
 
unused devices: <none>

準備

ディスク交換をする前に現在の設定を控えておきました。

[root@ns1 ~]# fdisk -l /dev/hda
 
Disk /dev/hda: 80.0 GB, 80000000000 bytes
255 heads, 63 sectors/track, 9726 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/hda1   *           1          33      265041   fd  Linux raid autodetect
/dev/hda2              34         294     2096482+  fd  Linux raid autodetect
/dev/hda3             295        9726    75762540   fd  Linux raid autodetect
 
[root@ns1 ~]# smartctl -a /dev/hdc
結果が長いので省略しました。

それと、IDE のディスクだったのでセカンダリコントローラのマスターに接続されている事や、ディスクのジャンパーをどうすればよいのか調べておきました。

RAID デバイスから /dev/hdc を削除

[root@ns1 ~]# mdadm --manage /dev/md0 --fail /dev/hdc1
mdadm: set /dev/hdc1 faulty in /dev/md0
[root@ns1 ~]# mdadm --manage /dev/md1 --fail /dev/hdc2
mdadm: set /dev/hdc2 faulty in /dev/md1
[root@ns1 ~]# mdadm --manage /dev/md2 --fail /dev/hdc3
mdadm: set /dev/hdc3 faulty in /dev/md2
 
すると mdstat はこんな風になります。
[root@ns1 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hdc2[2](F) hda2[0]
      2096384 blocks [2/1] [U_]
 
md2 : active raid1 hdc3[2](F) hda3[0]
      75762432 blocks [2/1] [U_]
 
md0 : active raid1 hdc1[2](F) hda1[0]
      264960 blocks [2/1] [U_]
 
unused devices: <none>
 
さらに続けて以下を実施。
[root@ns1 ~]# mdadm --manage /dev/md0 --remove /dev/hdc1
mdadm: hot removed /dev/hdc1
[root@ns1 ~]# mdadm --manage /dev/md1 --remove /dev/hdc2
mdadm: hot removed /dev/hdc2
[root@ns1 ~]# mdadm --manage /dev/md2 --remove /dev/hdc3
mdadm: hot removed /dev/hdc3
 
/dev/hdc が RAID デバイスから削除されました。
[root@ns1 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hda2[0]
      2096384 blocks [2/1] [U_]
 
md2 : active raid1 hda3[0]
      75762432 blocks [2/1] [U_]
 
md0 : active raid1 hda1[0]
      264960 blocks [2/1] [U_]
 
unused devices: <none>

ディスク交換

[root@ns1 ~]# sync;sync;sync;shutdown -h now
電源停止後ディスク交換。

パーティショニング

デフォルトの状態はこんな風になっていました。
[root@ns1 ~]# fdisk /dev/ -l /dev/hdc
last_lba(): I don't know how to handle files with mode 41ed
 
Disk /dev/hdc: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
Disk /dev/hdc doesn't contain a valid partition table
 
で、/dev/hda と同じようにパーティションを切りました。
 
[root@ns1 ~]# fdisk /dev/hdc
Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-9729, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-9729, default 9729): 33
Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 2
First cylinder (34-9729, default 34):
Using default value 34
Last cylinder or +size or +sizeM or +sizeK (34-9729, default 9729): 294
 
Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4): 3
First cylinder (295-9729, default 295):
Using default value 295
Last cylinder or +size or +sizeM or +sizeK (295-9729, default 9729): 9726
 
Command (m for help): t
Partition number (1-4): 1
Hex code (type L to list codes): fd
Changed system type of partition 1 to fd (Linux raid autodetect
Command (m for help): t
Partition number (1-4): 2
Hex code (type L to list codes): fd
Changed system type of partition 2 to fd (Linux raid autodetect)
Command (m for help): t
Partition number (1-4): 3
Hex code (type L to list codes): fd
Changed system type of partition 3 to fd (Linux raid autodetect)
 
Command (m for help): p
 
Disk /dev/hdc: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/hdc1               1          33      265041   fd  Linux raid autodetect
/dev/hdc2              34         294     2096482+  fd  Linux raid autodetect
/dev/hdc3             295        9726    75762540   fd  Linux raid autodetect
 
Command (m for help): a
Partition number (1-4): 1
Command (m for help): p
 
Disk /dev/hdc: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/hdc1   *           1          33      265041   fd  Linux raid autodetect
/dev/hdc2              34         294     2096482+  fd  Linux raid autodetect
/dev/hdc3             295        9726    75762540   fd  Linux raid autodetect
 
Command (m for help): w
The partition table has been altered!
 
Calling ioctl() to re-read partition table.
Syncing disks.
 
そして、以下で確認しました。
[root@ns1 ~]# fdisk -l /dev/hdc
 
Disk /dev/hdc: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
 
   Device Boot      Start         End      Blocks   Id  System
/dev/hdc1   *           1          33      265041   fd  Linux raid autodetect
/dev/hdc2              34         294     2096482+  fd  Linux raid autodetect
/dev/hdc3             295        9726    75762540   fd  Linux raid autodetect

デフォルトの状態はこんな風になっていました。 [root@ns1 ~]# fdisk /dev/ -l /dev/hdc last_lba(): I don't know how to handle files with mode 41ed Disk /dev/hdc: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/hdc doesn't contain a valid partition table で、/dev/hda と同じようにパーティションを切りました。 [root@ns1 ~]# fdisk /dev/hdc Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-9729, default 1): Using default value 1 Last cylinder or +size or +sizeM or +sizeK (1-9729, default 9729): 33 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 2 First cylinder (34-9729, default 34): Using default value 34 Last cylinder or +size or +sizeM or +sizeK (34-9729, default 9729): 294 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 3 First cylinder (295-9729, default 295): Using default value 295 Last cylinder or +size or +sizeM or +sizeK (295-9729, default 9729): 9726 Command (m for help): t Partition number (1-4): 1 Hex code (type L to list codes): fd Changed system type of partition 1 to fd (Linux raid autodetect Command (m for help): t Partition number (1-4): 2 Hex code (type L to list codes): fd Changed system type of partition 2 to fd (Linux raid autodetect) Command (m for help): t Partition number (1-4): 3 Hex code (type L to list codes): fd Changed system type of partition 3 to fd (Linux raid autodetect) Command (m for help): p Disk /dev/hdc: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 1 33 265041 fd Linux raid autodetect /dev/hdc2 34 294 2096482+ fd Linux raid autodetect /dev/hdc3 295 9726 75762540 fd Linux raid autodetect Command (m for help): a Partition number (1-4): 1 Command (m for help): p Disk /dev/hdc: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 1 33 265041 fd Linux raid autodetect /dev/hdc2 34 294 2096482+ fd Linux raid autodetect /dev/hdc3 295 9726 75762540 fd Linux raid autodetect Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. そして、以下で確認しました。 [root@ns1 ~]# fdisk -l /dev/hdc Disk /dev/hdc: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 1 33 265041 fd Linux raid autodetect /dev/hdc2 34 294 2096482+ fd Linux raid autodetect /dev/hdc3 295 9726 75762540 fd Linux raid autodetect

リシンク

RAID デバイスに各パーティションを追加するとリシンクします。

[root@ns1 ~]# mdadm --manage /dev/md0 --add /dev/hdc1
mdadm: hot added /dev/hdc1
[root@ns1 ~]# mdadm --manage /dev/md1 --add /dev/hdc2
mdadm: hot added /dev/hdc2
[root@ns1 ~]# mdadm --manage /dev/md2 --add /dev/hdc3
mdadm: hot added /dev/hdc3
 
mdstat を見るとリシンクしている状況を確認できます。
[root@ns1 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hdc2[2] hda2[0]
      2096384 blocks [2/1] [U_]
      [=====>...............]  recovery = 25.7% (540800/2096384) finish=0.9min　speed=27040K/sec
md2 : active raid1 hdc3[2] hda3[0]
      75762432 blocks [2/1] [U_]
        resync=DELAYED
md0 : active raid1 hdc1[1] hda1[0]
      264960 blocks [2/2] [UU]
 
unused devices: <none>
 
立て続けに複数のパーティションを --add しても同時に複数のリシンクが走らないみたいです。
 
最終的に完了すると以下のようになります。
 
[tatsu@ns1 ~]$  cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 hdc2[1] hda2[0]
      2096384 blocks [2/2] [UU]
 
md2 : active raid1 hdc3[1] hda3[0]
      75762432 blocks [2/2] [UU]
 
md0 : active raid1 hdc1[1] hda1[0]
      264960 blocks [2/2] [UU]
 
unused devices: <none>

これで RAID デバイスの復旧は完了ですが、交換したディスクからも起動できるようにする為にブートローダをインストールする必要があります。

ブートローダのインストール

 
[root@ns1 ~]# grub
    GNU GRUB  version 0.95  (640K lower / 3072K upper memory)
 
 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]
 
grub> device (hd0) /dev/hdc
 
grub> root (hd0,0)
 Filesystem type is ext2fs, partition type 0xfd
 
grub> install /grub/stage1 (hd0) /grub/stage2 p /grub/grub.conf
grub> quit
 
確認はこうするようです。
[root@ns1 ~]# dd if=/dev/hdc bs=512k count=1 | strings
ZRrI
D|f1
GRUB
Geom
Hard Disk
Read
 Error
ZRrI
D|f1
GRUB
Geom
Hard Disk
Read
 Error
qH*L
sH*L
1+0 records in
1+0 records out

これで復旧作業は完了です 🙂

Software RAID はリシンク中に OS を起動してサービスをすぐに再開させる事ができて
ダウンタイムが短くて良いですね。

以下を参考にさせていただきました。
大変助かりました。ありがとうございました。

http://centossrv.com/　RAID構成ハードディスク交換

ylog

V.A.

Software RAID 障害復旧