Today I was updating some servers. One of them was rebooted three times. On the third time, one of the drives went missing. This is how I tracked down which drive, and which slot in the server, and fixed it. I’m writing it mostly so I can remember how to flash the light on the drive. There is also an off-by-one issue to avoid.
In this post:
- FreeBSD 12.1
- sesutil
- ZFS
- Supermicro 846 chassis – hot swap drive bays
The first Nagios alert
The server had undergone a bunch of changes:
- pkg upgrade on each jail and the host
- freebsd-update fetch install in the host
- each jail was also updated, similar to the host OS
Usually, this would involve only one reboot, but I had forgotten to do the host, then I forgot to do the pkg upgrade on the host. When I am rebooting the host to upgrade the OS, I usually pkg upgrade all the things.
The first Nagios alert came in short after the reboot. A degraded ZFS array really gets my attention. These are the zpools on that host:
[dan@knew:~] $ zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT nvd 222G 20.3G 202G - - 38% 9% 1.00x ONLINE - system 90.5T 53.6T 36.9T - - 13% 59% 1.00x ONLINE - tank_fast01 928G 216G 712G - - 2% 23% 1.00x ONLINE - zroot 19.9G 11.6G 8.27G - - 58% 58% 1.00x ONLINE - [dan@knew:~] $
This is an extract from zpool status:
pool: system state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-4J scan: scrub repaired 0 in 0 days 14:51:29 with 0 errors on Tue Jun 9 19:58:50 2020 config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da2p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da8p3 ONLINE 0 0 0 da7p3 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 da5p1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 0 da12p1 ONLINE 0 0 0 da4p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 1933292688604201684 FAULTED 0 0 0 was /dev/da18p1 errors: No known data errors
The relevant lines are:
2 – DEGRADED – that’s the key point.
3-4 – status – Things aren’t so bad, we can continue to use this pool.
6 – action – How do we fix this? ZFS is very helpful.
12 – DEGRADED – the zpool named system is degraded.
24 – DEGRADED – the vdev named raidz2-1 is degraded.
34 – FAULTED – the item which default was last used as /dev/da18p1
Looking for da18
I keep gpart, zpool, dmesg, and filesytem information for each host. In this instance, I did not refer to this server’s information. If I had, it would have confirmed the following information. I keep that data around just in case it is needed.
There were three reboots as I did some work today.
See how da18 changes from slot 16 to slot 23? See also Slot 17. This comes into play later.
[dan@knew:~] $ grep da18 /var/log/messages | egrep ' at |Slot' Jun 10 14:37:20 knew kernel: da18 at mps0 bus 0 scbus0 target 28 lun 0 Jun 10 14:37:20 knew kernel: uhub2: ses0: da18,pass22 in 'Slot 17', SAS Slot: 1 phys at slot 16 Jun 10 15:58:31 knew kernel: da18 at mps0 bus 0 scbus0 target 28 lun 0 Jun 10 15:58:31 knew kernel: ses0: da18,pass22 in 'Slot 17', SAS Slot: 1 phys at slot 16 Jun 10 16:50:55 knew kernel: da18 at mps0 bus 0 scbus0 target 29 lun 0 Jun 10 16:50:55 knew kernel: ses0: da18,pass22 in 'Slot 24', SAS Slot: 1 phys at slot 23 [dan@knew:~] $
That’s because it’s a different drive entirely:
[dan@knew:~] $ grep da18 /var/log/messages | grep Serial Jun 10 14:37:20 knew kernel: da18: Serial Number 57NGK1ZGF57D Jun 10 15:58:31 knew kernel: da18: Serial Number 57NGK1ZGF57D Jun 10 16:50:55 knew kernel: da18: Serial Number 653EK93PFS9A [dan@knew:~] $
If a drive stops working, then on the next reboot da19 would be da18, etc.
For some reason, 57NGK1ZGF57D in slot 1, or Slot 17, depending on what you read, is acting up.
[dan@knew:~] $ grep Slot /var/log/messages | head -3 Jun 10 14:37:20 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0 Jun 10 14:37:20 knew kernel: ses0: da4,pass5 in 'Slot 02', SAS Slot: 1 phys at slot 1 Jun 10 14:37:20 knew kernel: ses0: da9,pass10 in 'Slot 03', SAS Slot: 1 phys at slot 2 [dan@knew:~] $
That will become important in the next step.
sesutil is your ZFS tool
This is the part I could not remember: sesutil.
The following output contains very useful information.
First, in hindsight, let me show you how to know this is 1-based, not zero-based. Let’s look at da0:
[dan@knew:~] $ grep da0 /var/log/messages | grep ses Jun 10 14:37:20 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0 Jun 10 15:58:31 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0 Jun 10 16:50:55 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0 [dan@knew:~] $
da0 is in slot 01 and at slot 0.
Yet, in the following, we find da0 in Element 1, which is described as Slot 01 in lines 8-11.
Line 3 identifies my LSI card.
[dan@knew:~] $ sudo sesutil map ses0: Enclosure Name: LSI SAS2X36 0e12 Enclosure ID: 50030480014e78bf Element 0, Type: Array Device Slot Status: Unsupported (0x00 0x00 0x00 0x00) Description: Drive Slots Element 1, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 01 Device Names: da0,pass1 Element 2, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 02 Device Names: da4,pass5 Element 3, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 03 Device Names: da9,pass10 Element 4, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 04 Device Names: da8,pass9 Element 5, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 05 Device Names: da16,pass20 Element 6, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 06 Device Names: da19,pass23 Element 7, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 07 Device Names: da1,pass2 Element 8, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 08 Device Names: da5,pass6 Element 9, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 09 Device Names: da10,pass12 Element 10, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 10 Device Names: da13,pass15 Element 11, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 11 Device Names: da17,pass21 Element 12, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 12 Device Names: da20,pass24 Element 13, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 13 Device Names: da2,pass3 Element 14, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 14 Device Names: da6,pass7 Element 15, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 15 Device Names: da11,pass13 Element 16, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 16 Device Names: da14,pass17 Element 17, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 17 Element 18, Type: Array Device Slot Status: Not Installed (0x05 0x00 0x00 0x00) Description: Slot 18 Element 19, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 19 Device Names: da3,pass4 Element 20, Type: Array Device Slot Status: OK (0x01 0x00 0x00 0x00) Description: Slot 20 Device Names: da7,pass8 [ some information ommitted] [dan@knew:~] $
On lines 72-74 you can find Slot 17, which is the one in question for the missing da18.
Full output is saved in this gist.
Locate that drive!
This is the command I could not recall:
sesutil locate -u /dev/ses0 17 on
The ses0 comes from line 2 above. 17 comes from line 74.
This turns on the locate LED for the drive.
I know now that I wrote about this before, about 7 months ago. The following are keywords for my future search: blink locate led drive. However, I see I did not document the locate option I just used above.
To the basement
Once in front of the rack, I saw the blinking light. I pulled the drive. No new no messages appeared in /var/log/messages. That is good. If messages appeared, it would mean a drive known to the system had been removed.
This vdev is a raidz2 (see line 24 of the zpool status output). That means we could remove one drive and the pool could still function.
I confirmed I had the correct drive by comparing the serial number against the information I found in /var/log/messages. I plugged the drive back in. Looking in /var/log/messages, I saw this:
Jun 10 18:12:37 knew kernel: ses0: da22,pass28 in 'Slot 17', SAS Slot: 1 phys at slot 16 Jun 10 18:12:37 knew kernel: ses0: phy 0: SATA device Jun 10 18:12:37 knew kernel: ses0: phy 0: parent 50030480014e78bf addr 50030480014e789c Jun 10 18:12:37 knew kernel: da22 at mps0 bus 0 scbus0 target 28 lun 0 Jun 10 18:12:37 knew kernel: da22:Fixed Direct Access SPC-4 SCSI device Jun 10 18:12:37 knew kernel: da22: Serial Number 57NGK1ZGF57D Jun 10 18:12:37 knew kernel: da22: 600.000MB/s transfers Jun 10 18:12:37 knew kernel: da22: Command Queueing enabled Jun 10 18:12:37 knew kernel: da22: 4769307MB (9767541168 512 byte sectors)
The drive is now known to FreeBSD.
Back to the zpool
The next goal is adding that drive back into the pool.
$ sudo zpool online system 1933292688604201684
EDIT 2020-06-11: the original version of this post had that command as: sudo zpool replace system 1933292688604201684 /dev/da22p1 – that is incorrect. I was working off the system logs, not my scrollback (which was lost). I apologize for the error.
The number is a vdev_guid and is found on line 34 of the zpool status.
With that, the resilvering started.
Timing
The replace command occurred at:
Jun 10 18:15:32 knew sudo[6922]: dan : TTY=pts/0 ; PWD=/usr/home/dan ; USER=root ; COMMAND=/sbin/zpool replace system 1933292688604201684 /dev/da22p1
This is a yet another reason to use sudo – it logs nicely.
The first part of the status is now:
[dan@knew:~] $ zpool status system pool: system state: ONLINE scan: resilvered 254M in 0 days 00:00:14 with 0 errors on Wed Jun 10 18:18:22 2020
I know that by 18:22 I had noticed the successful resilver.
Also found in the logs, which was looged to the replace command
Jun 10 18:18:01 knew ZFS[8273]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$5892227802261634203 Jun 10 18:18:01 knew ZFS[8275]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$9332658639709199239 Jun 10 18:18:01 knew ZFS[8276]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$250004220145174872 Jun 10 18:18:01 knew ZFS[8277]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$6216472763074854678 Jun 10 18:18:01 knew ZFS[8281]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$12795310201775582855 Jun 10 18:18:01 knew ZFS[8282]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$13315402097660581553 Jun 10 18:18:01 knew ZFS[8284]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$5840140110512920130 Jun 10 18:18:02 knew ZFS[8285]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$13603535286907309607 Jun 10 18:18:02 knew ZFS[8289]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$4677401754715191854 Jun 10 18:18:02 knew ZFS[8303]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$1933292688604201684
None of that math adds up. If the replace command was issued at 18:15:32 and the resilvering ended at 18:18:22, how does 0 days 00:00:14 relate to that?
The key points are:
- with 0 errors – no errors were found when resilvering
- state: ONLINE – we are back at full strength
Here are the current smartctl values for this drive. I am sure this is not the drive, but the hardware (connector, drive cage, something like that):
[dan@knew:~] $ sudo smartctl -a /dev/da22 smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-RELEASE-p6 amd64] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Toshiba X300 Device Model: TOSHIBA HDWE150 Serial Number: 57NGK1ZGF57D LU WWN Device Id: 5 000039 7cbe83a65 Firmware Version: FP2A User Capacity: 5,000,981,078,016 bytes [5.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Wed Jun 10 22:47:32 2020 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 552) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 8451 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 34 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 066 066 000 Old_age Always - 13953 10 Spin_Retry_Count 0x0033 100 100 030 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 34 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 46 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 31 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 163 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 36 (Min/Max 15/50) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 253 000 Old_age Always - 0 220 Disk_Shift 0x0002 100 100 000 Old_age Always - 0 222 Loaded_Hours 0x0032 066 066 000 Old_age Always - 13926 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0 224 Load_Friction 0x0022 100 100 000 Old_age Always - 0 226 Load-in_Time 0x0026 100 100 000 Old_age Always - 558 240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 2286 - # 2 Short offline Completed without error 00% 2180 - # 3 Short offline Completed without error 00% 2012 - # 4 Extended offline Completed without error 00% 1871 - # 5 Short offline Completed without error 00% 1844 - # 6 Short offline Completed without error 00% 1676 - # 7 Extended offline Completed without error 00% 1544 - # 8 Short offline Completed without error 00% 1508 - # 9 Short offline Completed without error 00% 1340 - #10 Short offline Completed without error 00% 1172 - #11 Extended offline Completed without error 00% 78 - #12 Extended offline Completed without error 00% 38 - #13 Extended offline Completed without error 00% 8 - #14 Short offline Completed without error 00% 0 - #15 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. [dan@knew:~] $
Finishing it off
My thanks to Allan Jude for letting me know about the sesutil commands.
Thanks to mav for the code written years ago.
Thanks for bapt and Allan for hooking that code up into sesutil.
And thanks to trasz for recent extensions with some nice output.
I will keep an eye on that drive