Jun 102020
 

Today I was updating some servers. One of them was rebooted three times. On the third time, one of the drives went missing. This is how I tracked down which drive, and which slot in the server, and fixed it. I’m writing it mostly so I can remember how to flash the light on the drive. There is also an off-by-one issue to avoid.

In this post:

  • FreeBSD 12.1
  • sesutil
  • ZFS
  • Supermicro 846 chassis – hot swap drive bays

The first Nagios alert

The server had undergone a bunch of changes:

  • pkg upgrade on each jail and the host
  • freebsd-update fetch install in the host
  • each jail was also updated, similar to the host OS

Usually, this would involve only one reboot, but I had forgotten to do the host, then I forgot to do the pkg upgrade on the host. When I am rebooting the host to upgrade the OS, I usually pkg upgrade all the things.

The first Nagios alert came in short after the reboot. A degraded ZFS array really gets my attention. These are the zpools on that host:

[dan@knew:~] $ zpool list
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
nvd           222G  20.3G   202G        -         -    38%     9%  1.00x  ONLINE  -
system       90.5T  53.6T  36.9T        -         -    13%    59%  1.00x  ONLINE  -
tank_fast01   928G   216G   712G        -         -     2%    23%  1.00x  ONLINE  -
zroot        19.9G  11.6G  8.27G        -         -    58%    58%  1.00x  ONLINE  -
[dan@knew:~] $ 

This is an extract from zpool status:

  pool: system
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0 days 14:51:29 with 0 errors on Tue Jun  9 19:58:50 2020
config:

	NAME                     STATE     READ WRITE CKSUM
	system                   DEGRADED     0     0     0
	  raidz2-0               ONLINE       0     0     0
	    da3p3                ONLINE       0     0     0
	    da10p3               ONLINE       0     0     0
	    da9p3                ONLINE       0     0     0
	    da2p3                ONLINE       0     0     0
	    da13p3               ONLINE       0     0     0
	    da15p3               ONLINE       0     0     0
	    da11p3               ONLINE       0     0     0
	    da14p3               ONLINE       0     0     0
	    da8p3                ONLINE       0     0     0
	    da7p3                ONLINE       0     0     0
	  raidz2-1               DEGRADED     0     0     0
	    da5p1                ONLINE       0     0     0
	    da6p1                ONLINE       0     0     0
	    da18p1               ONLINE       0     0     0
	    da12p1               ONLINE       0     0     0
	    da4p1                ONLINE       0     0     0
	    da1p1                ONLINE       0     0     0
	    da17p1               ONLINE       0     0     0
	    da16p1               ONLINE       0     0     0
	    da0p1                ONLINE       0     0     0
	    1933292688604201684  FAULTED      0     0     0  was /dev/da18p1

errors: No known data errors

The relevant lines are:

2 – DEGRADED – that’s the key point.

3-4 – status – Things aren’t so bad, we can continue to use this pool.

6 – action – How do we fix this? ZFS is very helpful.

12 – DEGRADED – the zpool named system is degraded.

24 – DEGRADED – the vdev named raidz2-1 is degraded.

34 – FAULTED – the item which default was last used as /dev/da18p1

Looking for da18

I keep gpart, zpool, dmesg, and filesytem information for each host. In this instance, I did not refer to this server’s information. If I had, it would have confirmed the following information. I keep that data around just in case it is needed.

There were three reboots as I did some work today.

See how da18 changes from slot 16 to slot 23? See also Slot 17. This comes into play later.

[dan@knew:~] $ grep da18 /var/log/messages | egrep ' at |Slot'  
Jun 10 14:37:20 knew kernel: da18 at mps0 bus 0 scbus0 target 28 lun 0
Jun 10 14:37:20 knew kernel: uhub2: ses0: da18,pass22 in 'Slot 17', SAS Slot: 1 phys at slot 16

Jun 10 15:58:31 knew kernel: da18 at mps0 bus 0 scbus0 target 28 lun 0
Jun 10 15:58:31 knew kernel: ses0: da18,pass22 in 'Slot 17', SAS Slot: 1 phys at slot 16

Jun 10 16:50:55 knew kernel: da18 at mps0 bus 0 scbus0 target 29 lun 0
Jun 10 16:50:55 knew kernel: ses0: da18,pass22 in 'Slot 24', SAS Slot: 1 phys at slot 23
[dan@knew:~] $

That’s because it’s a different drive entirely:

[dan@knew:~] $ grep da18 /var/log/messages | grep Serial
Jun 10 14:37:20 knew kernel: da18: Serial Number 57NGK1ZGF57D
Jun 10 15:58:31 knew kernel: da18: Serial Number 57NGK1ZGF57D
Jun 10 16:50:55 knew kernel: da18: Serial Number 653EK93PFS9A
[dan@knew:~] $

If a drive stops working, then on the next reboot da19 would be da18, etc.

For some reason, 57NGK1ZGF57D in slot 1, or Slot 17, depending on what you read, is acting up.

[dan@knew:~] $ grep Slot /var/log/messages | head -3 
Jun 10 14:37:20 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0
Jun 10 14:37:20 knew kernel: ses0: da4,pass5 in 'Slot 02', SAS Slot: 1 phys at slot 1
Jun 10 14:37:20 knew kernel: ses0: da9,pass10 in 'Slot 03', SAS Slot: 1 phys at slot 2
[dan@knew:~] $ 

That will become important in the next step.

sesutil is your ZFS tool

This is the part I could not remember: sesutil.

The following output contains very useful information.

First, in hindsight, let me show you how to know this is 1-based, not zero-based. Let’s look at da0:

[dan@knew:~] $ grep da0 /var/log/messages | grep ses
Jun 10 14:37:20 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0
Jun 10 15:58:31 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0
Jun 10 16:50:55 knew kernel: ses0: da0,pass1 in 'Slot 01', SAS Slot: 1 phys at slot 0
[dan@knew:~] $ 

da0 is in slot 01 and at slot 0.

Yet, in the following, we find da0 in Element 1, which is described as Slot 01 in lines 8-11.

Line 3 identifies my LSI card.

[dan@knew:~] $ sudo sesutil map
ses0:
	Enclosure Name: LSI SAS2X36 0e12
	Enclosure ID: 50030480014e78bf
	Element 0, Type: Array Device Slot
		Status: Unsupported (0x00 0x00 0x00 0x00)
		Description: Drive Slots
	Element 1, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 01
		Device Names: da0,pass1
	Element 2, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 02
		Device Names: da4,pass5
	Element 3, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 03
		Device Names: da9,pass10
	Element 4, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 04
		Device Names: da8,pass9
	Element 5, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 05
		Device Names: da16,pass20
	Element 6, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 06
		Device Names: da19,pass23
	Element 7, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 07
		Device Names: da1,pass2
	Element 8, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 08
		Device Names: da5,pass6
	Element 9, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 09
		Device Names: da10,pass12
	Element 10, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 10
		Device Names: da13,pass15
	Element 11, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 11
		Device Names: da17,pass21
	Element 12, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 12
		Device Names: da20,pass24
	Element 13, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 13
		Device Names: da2,pass3
	Element 14, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 14
		Device Names: da6,pass7
	Element 15, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 15
		Device Names: da11,pass13
	Element 16, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 16
		Device Names: da14,pass17
	Element 17, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 17
	Element 18, Type: Array Device Slot
		Status: Not Installed (0x05 0x00 0x00 0x00)
		Description: Slot 18
	Element 19, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 19
		Device Names: da3,pass4
	Element 20, Type: Array Device Slot
		Status: OK (0x01 0x00 0x00 0x00)
		Description: Slot 20
		Device Names: da7,pass8

[ some information ommitted]

[dan@knew:~] $ 

On lines 72-74 you can find Slot 17, which is the one in question for the missing da18.

Full output is saved in this gist.

Locate that drive!

This is the command I could not recall:

sesutil locate -u /dev/ses0 17 on

The ses0 comes from line 2 above. 17 comes from line 74.

This turns on the locate LED for the drive.

I know now that I wrote about this before, about 7 months ago. The following are keywords for my future search: blink locate led drive. However, I see I did not document the locate option I just used above.

To the basement

Once in front of the rack, I saw the blinking light. I pulled the drive. No new no messages appeared in /var/log/messages. That is good. If messages appeared, it would mean a drive known to the system had been removed.

This vdev is a raidz2 (see line 24 of the zpool status output). That means we could remove one drive and the pool could still function.

I confirmed I had the correct drive by comparing the serial number against the information I found in /var/log/messages. I plugged the drive back in. Looking in /var/log/messages, I saw this:

Jun 10 18:12:37 knew kernel: ses0: da22,pass28 in 'Slot 17', SAS Slot: 1 phys at slot 16
Jun 10 18:12:37 knew kernel: ses0:  phy 0: SATA device
Jun 10 18:12:37 knew kernel: ses0:  phy 0: parent 50030480014e78bf addr 50030480014e789c
Jun 10 18:12:37 knew kernel: da22 at mps0 bus 0 scbus0 target 28 lun 0
Jun 10 18:12:37 knew kernel: da22:  Fixed Direct Access SPC-4 SCSI device
Jun 10 18:12:37 knew kernel: da22: Serial Number 57NGK1ZGF57D
Jun 10 18:12:37 knew kernel: da22: 600.000MB/s transfers
Jun 10 18:12:37 knew kernel: da22: Command Queueing enabled
Jun 10 18:12:37 knew kernel: da22: 4769307MB (9767541168 512 byte sectors)

The drive is now known to FreeBSD.

Back to the zpool

The next goal is adding that drive back into the pool.

$ sudo zpool online system 1933292688604201684

EDIT 2020-06-11: the original version of this post had that command as: sudo zpool replace system 1933292688604201684 /dev/da22p1 – that is incorrect. I was working off the system logs, not my scrollback (which was lost). I apologize for the error.

The number is a vdev_guid and is found on line 34 of the zpool status.

With that, the resilvering started.

Timing

The replace command occurred at:

Jun 10 18:15:32 knew sudo[6922]:      dan : TTY=pts/0 ; PWD=/usr/home/dan ; USER=root ; COMMAND=/sbin/zpool replace system 1933292688604201684 /dev/da22p1

This is a yet another reason to use sudo – it logs nicely.

The first part of the status is now:

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
  scan: resilvered 254M in 0 days 00:00:14 with 0 errors on Wed Jun 10 18:18:22 2020

I know that by 18:22 I had noticed the successful resilver.

Also found in the logs, which was looged to the replace command

Jun 10 18:18:01 knew ZFS[8273]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$5892227802261634203
Jun 10 18:18:01 knew ZFS[8275]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$9332658639709199239
Jun 10 18:18:01 knew ZFS[8276]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$250004220145174872
Jun 10 18:18:01 knew ZFS[8277]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$6216472763074854678
Jun 10 18:18:01 knew ZFS[8281]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$12795310201775582855
Jun 10 18:18:01 knew ZFS[8282]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$13315402097660581553
Jun 10 18:18:01 knew ZFS[8284]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$5840140110512920130
Jun 10 18:18:02 knew ZFS[8285]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$13603535286907309607
Jun 10 18:18:02 knew ZFS[8289]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$4677401754715191854
Jun 10 18:18:02 knew ZFS[8303]: vdev state changed, pool_guid=$15378250086669402288 vdev_guid=$1933292688604201684

None of that math adds up. If the replace command was issued at 18:15:32 and the resilvering ended at 18:18:22, how does 0 days 00:00:14 relate to that?

The key points are:

  • with 0 errors – no errors were found when resilvering
  • state: ONLINE – we are back at full strength

Here are the current smartctl values for this drive. I am sure this is not the drive, but the hardware (connector, drive cage, something like that):

[dan@knew:~] $ sudo smartctl -a /dev/da22
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.1-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba X300
Device Model:     TOSHIBA HDWE150
Serial Number:    57NGK1ZGF57D
LU WWN Device Id: 5 000039 7cbe83a65
Firmware Version: FP2A
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun 10 22:47:32 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
					was suspended by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 552) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       8451
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   066   066   000    Old_age   Always       -       13953
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       34
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       46
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       31
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       163
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       36 (Min/Max 15/50)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   066   066   000    Old_age   Always       -       13926
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       558
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      2286         -
# 2  Short offline       Completed without error       00%      2180         -
# 3  Short offline       Completed without error       00%      2012         -
# 4  Extended offline    Completed without error       00%      1871         -
# 5  Short offline       Completed without error       00%      1844         -
# 6  Short offline       Completed without error       00%      1676         -
# 7  Extended offline    Completed without error       00%      1544         -
# 8  Short offline       Completed without error       00%      1508         -
# 9  Short offline       Completed without error       00%      1340         -
#10  Short offline       Completed without error       00%      1172         -
#11  Extended offline    Completed without error       00%        78         -
#12  Extended offline    Completed without error       00%        38         -
#13  Extended offline    Completed without error       00%         8         -
#14  Short offline       Completed without error       00%         0         -
#15  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[dan@knew:~] $ 

Finishing it off

My thanks to Allan Jude for letting me know about the sesutil commands.

Thanks to mav for the code written years ago.

Thanks for bapt and Allan for hooking that code up into sesutil.

And thanks to trasz for recent extensions with some nice output.

I will keep an eye on that drive

Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive