On Monday night at about 5:30 PM, I noticed the email: ** PROBLEM alert – knew/zpool is CRITICAL **

What sickened me was that the email was sent at 3:58 PM.

That makes me think I should add a Pushover.net alert …. I could do that at my mail server.

Logging into the server, I saw:

[dan@knew:~] $ zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
system  90.5T  44.3T  46.2T        -         -    11%    48%  1.00x  DEGRADED  -
zroot   19.9G  14.4G  5.43G        -         -    51%    72%  1.00x  ONLINE  -
[dan@knew:~] $

[dan@knew:~] $ zpool status system
  pool: system
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub repaired 0 in 17h26m with 0 errors on Wed Oct 17 20:59:06 2018
config:

	NAME        STATE     READ WRITE CKSUM
	system      DEGRADED     0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da4p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da12p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da5p3   ONLINE       0     0     0
	  raidz2-1  DEGRADED     0     0     0
	    da6p1   ONLINE       0     0     0
	    da7p1   ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da8p1   ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  FAULTED      6    51     0  too many errors
	    da18p1  ONLINE       0     0     0
	    da2p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0

errors: No known data errors

da17 is a known problem, or rather, devices around that number. See my FreeBSD Forums post started in 2017, yes, more than a year ago.

Looking at the drive, there is nothing alarming there.

[dan@knew:~] $ sudo smartctl -a /dev/da17
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba 3.5" MD04ACA... Enterprise HDD
Device Model:     TOSHIBA MD04ACA500
Serial Number:    653AK2MXFS9A
LU WWN Device Id: 5 000039 65bb8029e
Firmware Version: FP2A
User Capacity:    5,000,981,078,016 bytes [5.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct 22 22:06:18 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  120) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 540) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       556
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       81
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   038   038   000    Old_age   Always       -       25165
 10 Spin_Retry_Count        0x0033   101   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       81
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       780
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       72
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       676
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       36 (Min/Max 18/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   038   038   000    Old_age   Always       -       24996
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       198
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20993         -
# 2  Extended offline    Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

These entries from /var/log/messages are very familiar.

Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 59 9f b7 60 00 00 00 08 00 00 length 4096 SMID 796 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 888 terminated ioc 804b loginfo 31(da17:mps2:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 59 9f b7 60 00 00 00 08 00 00 
Oct 22 19:52:59 knew kernel: 110d00 scsi 0 state c xfer 0
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: CCB request completed with an error
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Retrying command
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: CCB request completed with an error
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Retrying command
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: SCSI Status Error
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SCSI status: Check Condition
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Error 6, Retries exhausted
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Invalidating pack
Oct 22 19:53:00 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130
Oct 22 19:53:00 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130

The main difference is the Error 6, Retries exhausted, and shortly thereafter, the device was removed from the vdev. I could not find recent examples of that error, the most recent I did find was from 2017.

Try removing the drive, powering off, reseating the cable

At 5:55 PM (21:55 UTC), I shutdown the server.

This server has 20 x 5TB drives, 5 rows of 4. By checking the serial number from above, I found da17 in the top top, 2nd from the left. I reseated that cable and powered up.

The problem recurred, and I redid my work. This server cannot be opened while in the rack, so taking it out of the rack, unscrewing the cover, etc, takes time and effort.

At 7:03 PM (23:03 UTC) the status was good, and I headed out to dinner.

They’re back!

This appeared in /var/log/messages:

Oct 23 01:20:14 knew kernel: (da17:mps2:0:2:0): Invalidating pack

At about 00:00 UTC (8 PM) I logged into the box and found:

[dan@knew:~] $ zpool status system
  pool: system
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 397M in 0h0m with 0 errors on Mon Oct 22 22:46:50 2018
config:

	NAME                     STATE     READ WRITE CKSUM
	system                   DEGRADED     0     0     0
	  raidz2-0               ONLINE       0     0     0
	    da3p3                ONLINE       0     0     0
	    da10p3               ONLINE       0     0     0
	    da15p3               ONLINE       0     0     0
	    da4p3                ONLINE       0     0     0
	    da13p3               ONLINE       0     0     0
	    da12p3               ONLINE       0     0     0
	    da9p3                ONLINE       0     0     0
	    da14p3               ONLINE       0     0     0
	    da11p3               ONLINE       0     0     0
	    da5p3                ONLINE       0     0     0
	  raidz2-1               DEGRADED     0     0     0
	    da6p1                ONLINE       0     0     0
	    da7p1                ONLINE       0     0     0
	    da16p1               ONLINE       0     0     0
	    da8p1                ONLINE       0     0     0
	    da0p1                ONLINE       0     0     0
	    da1p1                ONLINE       0     0     0
	    5840140110512920130  UNAVAIL      0     0     0  was /dev/da17p1
	    da18p1               ONLINE       0     0     0
	    da2p1                ONLINE       0     0     0
	    da19p1               ONLINE       0     0     0

errors: No known data errors

The drive was having trouble again.

I tried a few things:

[dan@knew:~] $ sudo zpool replace system 5840140110512920130 /dev/da17p1
invalid vdev specification
use '-f' to override the following errors:
/dev/da17p1 is part of active pool 'system'

Ahh, you can’t replace itself.

Let’s offline that drive first.

[dan@knew:~] $ sudo zpool offline system da17p1
[dan@knew:~] $ zpool status system
  pool: system
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: resilvered 397M in 0h0m with 0 errors on Mon Oct 22 22:46:50 2018
config:

	NAME                     STATE     READ WRITE CKSUM
	system                   DEGRADED     0     0     0
	  raidz2-0               ONLINE       0     0     0
	    da3p3                ONLINE       0     0     0
	    da10p3               ONLINE       0     0     0
	    da15p3               ONLINE       0     0     0
	    da4p3                ONLINE       0     0     0
	    da13p3               ONLINE       0     0     0
	    da12p3               ONLINE       0     0     0
	    da9p3                ONLINE       0     0     0
	    da14p3               ONLINE       0     0     0
	    da11p3               ONLINE       0     0     0
	    da5p3                ONLINE       0     0     0
	  raidz2-1               DEGRADED     0     0     0
	    da6p1                ONLINE       0     0     0
	    da7p1                ONLINE       0     0     0
	    da16p1               ONLINE       0     0     0
	    da8p1                ONLINE       0     0     0
	    da0p1                ONLINE       0     0     0
	    da1p1                ONLINE       0     0     0
	    5840140110512920130  OFFLINE      0     0     0  was /dev/da17p1
	    da18p1               ONLINE       0     0     0
	    da2p1                ONLINE       0     0     0
	    da19p1               ONLINE       0     0     0

errors: No known data errors

Now I put it online:

[dan@knew:~] $ sudo zpool online system da17p1
[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
  scan: resilvered 397M in 0h0m with 0 errors on Mon Oct 22 22:46:50 2018
config:

	NAME        STATE     READ WRITE CKSUM
	system      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da4p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da12p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da5p3   ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da6p1   ONLINE       0     0     0
	    da7p1   ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da8p1   ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  ONLINE       0     0     0
	    da18p1  ONLINE       0     0     0
	    da2p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0
errors: No known data errors

By 00:16 UTC, it was fixed again.

I think there was no resilvering because there had been no new data written during that time. Seems unlikely?

But wait, there’s more!

According to the logs, the system degraded again at 01:20 UTC (

Oct 23 01:20:14 knew kernel: (da17:mps2:0:2:0): Invalidating pack

By the time I noticed, and logged in, it was 02:51 UTC (10:51 PM). As much as I did not want to leave this for the morning, I did. I ignored the issue and went to sleep.

Tuesday lunch

On Tuesday, I replaced the cable between the HBA and the backplane for that drive.

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 11.2G in 0h7m with 0 errors on Tue Oct 23 16:46:18 2018
config:

	NAME        STATE     READ WRITE CKSUM
	system      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da4p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da12p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da5p3   ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da6p1   ONLINE       0     0     0
	    da7p1   ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da8p1   ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  ONLINE       0     0    18
	    da18p1  ONLINE       0     0     0
	    da2p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0

errors: No known data errors
[dan@knew:~] $

11.2G seems to be a reasonable resilver amount, given some incremental backups occurred overnight.

After a zpool clear system:

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
  scan: resilvered 11.2G in 0h7m with 0 errors on Tue Oct 23 16:46:18 2018
config:

	NAME        STATE     READ WRITE CKSUM
	system      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da3p3   ONLINE       0     0     0
	    da10p3  ONLINE       0     0     0
	    da15p3  ONLINE       0     0     0
	    da4p3   ONLINE       0     0     0
	    da13p3  ONLINE       0     0     0
	    da12p3  ONLINE       0     0     0
	    da9p3   ONLINE       0     0     0
	    da14p3  ONLINE       0     0     0
	    da11p3  ONLINE       0     0     0
	    da5p3   ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
	    da6p1   ONLINE       0     0     0
	    da7p1   ONLINE       0     0     0
	    da16p1  ONLINE       0     0     0
	    da8p1   ONLINE       0     0     0
	    da0p1   ONLINE       0     0     0
	    da1p1   ONLINE       0     0     0
	    da17p1  ONLINE       0     0     0
	    da18p1  ONLINE       0     0     0
	    da2p1   ONLINE       0     0     0
	    da19p1  ONLINE       0     0     0

errors: No known data errors
[dan@knew:~] $

What next?

It is now 21:15 UTC and the system has been back up for about 5 hours and things are looking good. It is too early to declare that it was the cable. Let’s wait a week or two.