zpool degraded – one drive missing from system

I rebooted knew yesterday for upgrades. When it came back, the main storage zpool was degraded:

[dan@knew:~] $ zpool status system
  pool: system
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: scrub repaired 0 in 0 days 16:23:38 with 0 errors on Tue Nov 19 20:47:29 2019
config:

	NAME                     STATE     READ WRITE CKSUM
	system                   DEGRADED     0     0     0
	  raidz2-0               ONLINE       0     0     0
	    da3p3                ONLINE       0     0     0
	    da10p3               ONLINE       0     0     0
	    da9p3                ONLINE       0     0     0
	    da2p3                ONLINE       0     0     0
	    da13p3               ONLINE       0     0     0
	    da15p3               ONLINE       0     0     0
	    da11p3               ONLINE       0     0     0
	    da14p3               ONLINE       0     0     0
	    da8p3                ONLINE       0     0     0
	    da7p3                ONLINE       0     0     0
	  raidz2-1               DEGRADED     0     0     0
	    da5p1                ONLINE       0     0     0
	    da6p1                ONLINE       0     0     0
	    da18p1               ONLINE       0     0     0
	    da12p1               ONLINE       0     0     0
	    da4p1                ONLINE       0     0     0
	    da1p1                ONLINE       0     0     0
	    da17p1               ONLINE       0     0     0
	    da16p1               ONLINE       0     0     0
	    da0p1                ONLINE       0     0     0
	    1933292688604201684  UNAVAIL      0     0     0  was /dev/da18p1

errors: No known data errors

Is the drive alive?

The drive is not listed at all in /var/run/dmesg.boot.

I keep a list of the expected drives in /etc/periodic.conf, for use by a Nagios check:

[dan@knew:~] $ /usr/sbin/sysrc -nf /etc/periodic.conf daily_status_smart_devices
/dev/da22 /dev/da21 /dev/da20 /dev/da19 /dev/da18 /dev/da17 /dev/da16 /dev/da15 
/dev/da14 /dev/da13 /dev/da12 /dev/da11 /dev/da10 /dev/da9  /dev/da8  /dev/da7 
/dev/da6  /dev/da5  /dev/da4  /dev/da3  /dev/da2  /dev/da1  /dev/da0  /dev/ada1 
/dev/nvd1 /dev/ada0 /dev/nvd0
[dan@knew:~] $

That output (which is formatted into columns for easier reading) has 8 devices on the first three lines and three on the last. That is 27 drives.

This is what the system can see:

[dan@knew:~] $ /sbin/sysctl -n kern.disks 
da21 da20 da19 da18 da17 da16 da15 da14
da13 da12 da11 da10  da9  da8  da7 da6
 da5 da4   da3  da2  da1  da0 ada1 ada0
nvd1 nvd0

That is only 26 drives. Yes, one is missing.

We don’t see a da22 in the second output, so we can quickly figure out that yes, there is a drive missing.

Could this be a loose connection?

At first, I thought this was a recurrence of a CAM status: SCSI Status Error problem I encountered about a year ago. This same issue arose on a new server just last month. In both situations, the problem was solved by not using that particular drive bay. My conclusion was the drive bays each had a connector issue.

When this zpool status issue arose yesterday, I immediately thought it was a connector issue. However, last month it was r720-01, not knew, which is now encountering this problem.

Reseat and reboot

I used sesutil combined with other available information to find out it was drive bay 16 which was having trouble. I created a drive bay map and along the way discovered how I could just use sesutil to find the missing bay.

I don’t know why the drive was not found. I reseated the it and booted into single user mode. Of note, the drive was spinning when I removed it. Examination of /var/run/dmesg.boot failed to find that serial number. I removed the drive and placed into the drive bay above. The console output confirmed the drive was found. I moved the drive back to the original position, and console messages again confirmed the presence of that drive.

I issued a reboot and went into single user mode again. This time I saw the serial number in the dmseg output.

This time I did a shutdown -r now from single user mode (this shouldn’t have made a difference; I think reboot and shutdown while in single user mode amount to the same thing).

Intact zpool

I let the system boot normally, then ssh’d in and found an intact zpool which was resilvering:

  pool: system
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 16.5G in 0 days 00:07:17 with 0 errors on Mon Nov 25 17:36:01 2019
config:

	NAME        STATE     READ WRITE CKSUM
	system      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da2p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da8p3   ONLINE       0     0     0
	    da7p3   ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da5p1   ONLINE       0     0     0
	    da6p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0
	    da12p1  ONLINE       0     0     0
	    da4p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da18p1  ONLINE       0     0    28

errors: No known data errors

After resilver

It took about 7 minutes to resilver:

  pool: system
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 16.5G in 0 days 00:07:17 with 0 errors on Mon Nov 25 17:36:01 2019
config:

	NAME        STATE     READ WRITE CKSUM
	system      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da2p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da8p3   ONLINE       0     0     0
	    da7p3   ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da5p1   ONLINE       0     0     0
	    da6p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0
	    da12p1  ONLINE       0     0     0
	    da4p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da18p1  ONLINE       0     0    28

errors: No known data errors

zpool clear

Let’s clear out that check sum column. That happened because the drive was offline.

[dan@knew:~] $ sudo zpool clear system
[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
  scan: resilvered 16.5G in 0 days 00:07:17 with 0 errors on Mon Nov 25 17:36:01 2019
config:

	NAME        STATE     READ WRITE CKSUM
	system      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da2p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da8p3   ONLINE       0     0     0
	    da7p3   ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da5p1   ONLINE       0     0     0
	    da6p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0
	    da12p1  ONLINE       0     0     0
	    da4p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da18p1  ONLINE       0     0     0

errors: No known data errors
[dan@knew:~] $

All done.

But wait, there’s more

The important part about this issue: the system remained functional. One drive was missing but no operations were affected. The system still had redundancy. It could have lost one more drive and still be functional.

It could have lost three more drives, provided one was in this raidz2 and the other two where in the other half of the stripe.

This is one of the hugely appealing aspects of ZFS.