Wednesday, January 22, 2014

HOWTO : Replace a failed disk drive in a FreeBSD ZFS pool

In this blog post, we will repair a broken ZFS pool from a FreeBSD server. The machine is running FreeBSD 9.2, but so long as your FreeBSD machine runs a ZFS enabled FreeBSD, all the commands in this article should work.

A little background on this machine. It's been in production for about four years now. It was originally installed with four 750 GB disk drives as a raidz2 pool. The OS has been upgraded several times and so is the disk drives (because that's what fails of course, hence this post). This is a ZFS-only machine built by following the ZFS only FreeBSD installation wiki with GPT formated disks.

I learned that my ZFS pool had an issue via the daily emails. That's because ZFS alerts were enabled. Which they're not by default. But it's so simple! Simply add a single line to /etc/periodic.conf

sudo vi /etc/periodic.conf

# /etc/periodic.conf
#
# $Id: periodic.conf,v 1.1 2012/03/07 23:36:42 drobilla Exp $
#
# Changes in this file override the ones in
# /etc/defaults/periodic.conf
#
# David Robillard, March 7th, 2012.

daily_status_zfs_enable="YES" # Check ZFS

# EOF


Alright, so this blog post is all about disks. And their names. You see, the ZFS pool uses GPT formatted drives as disk names. But the OS does't display these names when it boots. The idea is then to match their serial numbers. 

So to do that, the first thing we need to do is to find out which disk is broken in the ZFS pool and record it's ZFS name. In this example, this is gpt/disk10 as we can see from this output :
sudo zpool status -xv

  pool: zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 617G in 12h48m with 0 errors on Sun Jan 27 22:50:59 2013
config:

NAME                     STATE     READ WRITE CKSUM
zroot                    DEGRADED     0     0     0
  raidz2-0               DEGRADED     0     0     0
    gpt/disk8            ONLINE       0     0     0
    gpt/disk10           DEGRADED     0     0     0
    gpt/disk12           ONLINE       0     0     0
    gpt/disk14           ONLINE       0     0     0

Great, but knowing that gpt/disk10 is in error does't help us much in deciding which disk we need to replace right?! So we need to locate the /dev/gpt/disk serial numbers. But how do we know which disks are in our machine? Simple, we check the /var/run/dmesg.boot file which contains all boot messages. We also know that our disks are SATA disks. So a quick grep(1) will show us the way :

grep ATA /var/run/dmesg.boot

ada0: <WDC WD1001FALS-00Y6A0 05.01D05> ATA-8 SATA 2.x device
ada0: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada1: <ST31000524AS JC4B> ATA-8 SATA 3.x device
ada1: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada2: <WDC WD7502AAEX-00Y9A0 05.01D05> ATA-8 SATA 3.x device
ada2: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada3: <ST3750330AS SD1A> ATA-8 SATA 1.x device
ada3: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)

So this tells us that our disks are named ada something. With that in hand, we can check the serial numbers for both ada disks and gpt/disk disks then compare the two to match them.

Using diskinfo(8), we can get the serial numbers. It shows as the "Disk ident." value. Let's start with the ada disks. We know that we have ada0 to ada3 from our last grep command on /var/run/dmesg.boot. A quick loop will check all four disks...

for i in 0 1 2 3;
do sudo diskinfo -v ada$i
done

Open an empty file (or spreadsheet) and record the ada# with it's serial number. We need to fill in the blanks is this table :

----------------------------------------------------------------
ada  Serial Number  GPT  Model  RAID card slot #
----------------------------------------------------------------
ada0           disk10                    0

ada1           disk8                1


ada2                disk12   2

ada3          disk14                   3
----------------------------------------------------------------

Locate the gpt/disks# serial number as the "Disk ident" value and enter those into the same table (or spreadsheet) started previously.

Now let's do another quick loop to get the serial numbers of disk8 to disk14 from the zpool status command.

for i in 8 10 12 14;
do sudo diskinfo -v /dev/gpt/disk$i
done

Make sure to record the serial numbers and match it with the ones of the ada disks and ta! da! We made it : We know exactly which physical disk to replace!

At this point, we should have this table :

----------------------------------------------------------------
ada  Serial Number  GPT  Model  RAID card slot #
----------------------------------------------------------------
ada0 3QK085G6  disk10 ST3750330AS SD1A  0

ada1 5VPC99S7  disk8  ST31000524AS  1

ada2 WD-WCAW32722468 disk12 WDC WD7502AAEX-00Y9A 2

ada3 3QK08L5V disk14 ST3750330AS SD1A  3
----------------------------------------------------------------

Since we know the failed disk drive is ada0 and disk10 with serial number 3QK085G6, we thus  record ada0's disk format.

gpart show ada0
=>        34  1465149101  ada0  GPT  (698G)
          34         128     1  freebsd-boot  (64k)
         162     2097152     2  freebsd-swap  (1.0G)
     2097314  1463051821     3  freebsd-zfs  (697G)

There's a freebsd-swap partition, we can assume that since disk10 is broken, then swap10 is probably on the same drive. That's usually how we build things. But, a quik check nerver hurts.

Not now, because I'm lazy. This machine has four swap partitions : one per disk. Chances are high that we'll be just fine.

Handle the freebsd-swap partition by removing swap10 from the fstab(5). Simply comment it out.

sudo vi /etc/fstab

If you have users, inform then that you need to shutdown the server and need a maintenance window. Once this window arrives, make sure to schedule a downtime in your monitoring system and shutdown the server. Remove the disk with serial number 3QK085G6.

Upon reboot, of course, the machine was configured to boot from the drive that was just removed. A quick pass in the BIOS to change that and the server was able to boot.

Login and check the boot messages to see if our new disk is seen?

grep ^ada /var/run/dmesg.boot

ada0 at ata4 bus 0 scbus4 target 0 lun 0
ada0: <WDC WD1001FALS-00Y6A0 05.01D05> ATA-8 SATA 2.x device
ada0: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada0: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad8
ada1 at ata5 bus 0 scbus5 target 0 lun 0
ada1: <ST31000524AS JC4B> ATA-8 SATA 3.x device
ada1: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada1: 953869MB (1953525168 512 byte sectors: 16H 63S/T 16383C)
ada1: Previously was known as ad10
ada2 at ata6 bus 0 scbus6 target 0 lun 0
ada2: <WDC WD7502AAEX-00Y9A0 05.01D05> ATA-8 SATA 3.x device
ada2: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada2: 715404MB (1465149168 512 byte sectors: 16H 63S/T 16383C)
ada2: Previously was known as ad12
ada3 at ata7 bus 0 scbus7 target 0 lun 0
ada3: <ST3750330AS SD1A> ATA-8 SATA 1.x device
ada3: 150.000MB/s transfers (SATA 1.x, UDMA5, PIO 8192bytes)
ada3: 715404MB (1465149168 512 byte sectors: 16H 63S/T 16383C)
ada3: Previously was known as ad14

Has the ada0 drive changed name? It has if the disk you installed is different from the one that was removed.

Check the zpool status?

zpool status -vx

  pool: zroot
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 617G in 12h48m with 0 errors on Sun Jan 27 22:50:59 2013
config:

NAME                     STATE     READ WRITE CKSUM
zroot                    DEGRADED     0     0     0
 raidz2-0               DEGRADED     0     0     0
   gpt/disk8            ONLINE       0     0     0
   2974201805316735291  UNAVAIL      0     0     0  was /dev/gpt/disk10
   gpt/disk12           ONLINE       0     0     0
   gpt/disk14           ONLINE       0     0     0

Check the current partition state of the new disk.

gpart show ada0
=>        63  1953525105  ada0  MBR  (931G)
          63      206785        - free -  (101M)
      206848  1953314816     1  ntfs  (931G)
  1953521664        3504        - free -  (1.7M)

Of course, all new drives now have an NTFS partition by default. NTFS is not a bad file system. It's just that ZFS is better IMHO :) Let's clear that.

sudo gpart destroy -F ada0

Make sure it's destroyed.

gpart show ada0
gpart: No such geom: ada0.

Good, now parition the new drive.

sudo gpart create -s gpt ada0
sudo gpart add -b 34 -s 128 -t freebsd-boot ada0
sudo gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
sudo gpart add -s 2097152 -t freebsd-swap -l swap10 ada0
sudo gpart add -t freebsd-zfs -l disk10 ada0

Take a look at what we just created.

gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34         128     1  freebsd-boot  (64k)
         162     2097152     2  freebsd-swap  (1.0G)
     2097314  1951427821     3  freebsd-zfs  (930G)

Next step is to actually tell ZFS that it's got a new drive to work with. Be ready to wait because this can take quite a while.

Update the zpool.

sudo zpool replace zroot /dev/gpt/disk10

This will trigger the disk replacement (resilvering in ZFS terms).

NOTE: Make sure you wait until the resilvering is finished before you reboot!

Check the replacement's status :

zpool status -xv

  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Jan 20 19:36:53 2014
        420G scanned out of 2.28T at 324/s, (scan is slow, no estimated time)
        105G resilvered, 17.99% done
config:

NAME                       STATE     READ WRITE CKSUM
zroot                      DEGRADED     0     0     0
 raidz2-0                 DEGRADED     0     0     0
   gpt/disk8              ONLINE       0     0     0
   replacing-1            DEGRADED     0     0    16
     2974201805316735291  UNAVAIL      0     0     0  was /dev/gpt/disk10/old
     gpt/disk10           ONLINE       0     0     0  (resilvering)
   gpt/disk12             ONLINE       0     0     0
   gpt/disk14             ONLINE       0     0     0  (resilvering)

While we're waiting, let's go back to our fstab(5) and enable swap10.

sudo vim /etc/fstab

Periodically check the status of the zpool. It might take a while, as we can see from my own server's output :

zpool status -xv

  pool: zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 582G in 12h4m with 0 errors on Tue Jan 21 07:41:11 2014
config:

NAME            STATE     READ WRITE CKSUM
zroot           ONLINE       0     0     0
 raidz2-0      ONLINE       0     0     0
   gpt/disk8   ONLINE       0     0     0
   gpt/disk10  ONLINE       0     0     0
   gpt/disk12  ONLINE       0     0     0
   gpt/disk14  ONLINE       0     0     18

12 hours! Not bad for a old Pentium 4 with 1 GB of memory running a 32 bit version of ZFS :)

But wait?! We also see that the vdev gpt/disk14 has had a checksum error (18 of them to be precise). WTF? This means gpt/disk14 is probably close to his retirement. Looking at our table, we see that it's one of the old 750 GB drives. So the data fits the reality.

It's not dead yet, so we'll give him a chance. Clear it and see what happens in the future. Make a note to double check this vdev in a couple of days.

sudo zpool clear zroot /dev/gpt/disk14

Then when we check our pool, it's now back to normal operations.

zpool status -xv
all pools are healthy

Make sure you have the ZFS report with each periodic run.

sudo vim /etc/periodic.conf 

# /etc/periodic.conf
#
# $Id: periodic.conf,v 1.1 2012/03/07 23:36:42 drobilla Exp $
#
# Changes in this file override the ones in
# /etc/defaults/periodic.conf
#
# David Robillard, March 7th, 2012.

daily_status_zfs_enable="YES" # Check ZFS

# EOF

There you go. The machine is back to normal status and your daily email will have the ZFS status.

HTH,

David

4 comments:

  1. How would you add any other drive beside the boot drive for a raidz2?

    ReplyDelete
  2. Hey JoKer,

    The name « zroot » can be any other that you decide. You will see a lot of ZFS examples that use the name « tank » for some reason. Then, if you don't boot off from the ZFS pool, you can skip the creation of both freebsd-boot and freebsd-swap partitions. Then, depending on your boot device, the /boot/loader.conf file and /etc/rc.conf files might be different.

    Does that help?

    David

    ReplyDelete
  3. Thank you for the write up. I found this very helpful.

    Another command that I found useful for my configuration was zdb, which gave me the guid of my drive since I'm not using standard adaX drives and whoever configured it originally was inconsistent with naming.

    ReplyDelete
    Replies
    1. Hey Noah,

      Interesting, thanks for sharing!

      David

      Delete