I rebooted knew yesterday for upgrades. When it came back, the main storage zpool was degraded:
[dan@knew:~] $ zpool status system pool: system state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: scrub repaired 0 in 0 days 16:23:38 with 0 errors on Tue Nov 19 20:47:29 2019 config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da2p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da8p3 ONLINE 0 0 0 da7p3 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 da5p1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 0 da12p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 1933292688604201684 UNAVAIL 0 0 0 was /dev/da18p1 errors: No known data errors
Is the drive alive?
The drive is not listed at all in /var/run/dmesg.boot.
I keep a list of the expected drives in /etc/periodic.conf, for use by a Nagios check:
[dan@knew:~] $ /usr/sbin/sysrc -nf /etc/periodic.conf daily_status_smart_devices /dev/da22 /dev/da21 /dev/da20 /dev/da19 /dev/da18 /dev/da17 /dev/da16 /dev/da15 /dev/da14 /dev/da13 /dev/da12 /dev/da11 /dev/da10 /dev/da9 /dev/da8 /dev/da7 /dev/da6 /dev/da5 /dev/da4 /dev/da3 /dev/da2 /dev/da1 /dev/da0 /dev/ada1 /dev/nvd1 /dev/ada0 /dev/nvd0 [dan@knew:~] $
That output (which is formatted into columns for easier reading) has 8 devices on the first three lines and three on the last. That is 27 drives.
This is what the system can see:
[dan@knew:~] $ /sbin/sysctl -n kern.disks da21 da20 da19 da18 da17 da16 da15 da14 da13 da12 da11 da10 da9 da8 da7 da6 da5 da4 da3 da2 da1 da0 ada1 ada0 nvd1 nvd0
That is only 26 drives. Yes, one is missing.
We don’t see a da22 in the second output, so we can quickly figure out that yes, there is a drive missing.
Could this be a loose connection?
At first, I thought this was a recurrence of a CAM status: SCSI Status Error problem I encountered about a year ago. This same issue arose on a new server just last month. In both situations, the problem was solved by not using that particular drive bay. My conclusion was the drive bays each had a connector issue.
When this zpool status issue arose yesterday, I immediately thought it was a connector issue. However, last month it was r720-01, not knew, which is now encountering this problem.
Reseat and reboot
I used sesutil combined with other available information to find out it was drive bay 16 which was having trouble. I created a drive bay map and along the way discovered how I could just use sesutil to find the missing bay.
I don’t know why the drive was not found. I reseated the it and booted into single user mode. Of note, the drive was spinning when I removed it. Examination of /var/run/dmesg.boot failed to find that serial number. I removed the drive and placed into the drive bay above. The console output confirmed the drive was found. I moved the drive back to the original position, and console messages again confirmed the presence of that drive.
I issued a reboot and went into single user mode again. This time I saw the serial number in the dmseg output.
This time I did a shutdown -r now from single user mode (this shouldn’t have made a difference; I think reboot and shutdown while in single user mode amount to the same thing).
Intact zpool
I let the system boot normally, then ssh’d in and found an intact zpool which was resilvering:
pool: system state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 16.5G in 0 days 00:07:17 with 0 errors on Mon Nov 25 17:36:01 2019 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da2p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da8p3 ONLINE 0 0 0 da7p3 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 da5p1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 da12p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 28 errors: No known data errors
After resilver
It took about 7 minutes to resilver:
pool: system state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 16.5G in 0 days 00:07:17 with 0 errors on Mon Nov 25 17:36:01 2019 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da2p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da8p3 ONLINE 0 0 0 da7p3 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 da5p1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 da12p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 28 errors: No known data errors
zpool clear
Let’s clear out that check sum column. That happened because the drive was offline.
[dan@knew:~] $ sudo zpool clear system [dan@knew:~] $ zpool status system pool: system state: ONLINE scan: resilvered 16.5G in 0 days 00:07:17 with 0 errors on Mon Nov 25 17:36:01 2019 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da2p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da8p3 ONLINE 0 0 0 da7p3 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 da5p1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 da12p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 0 errors: No known data errors [dan@knew:~] $
All done.
But wait, there’s more
The important part about this issue: the system remained functional. One drive was missing but no operations were affected. The system still had redundancy. It could have lost one more drive and still be functional.
It could have lost three more drives, provided one was in this raidz2 and the other two where in the other half of the stripe.
This is one of the hugely appealing aspects of ZFS.