On Monday night at about 5:30 PM, I noticed the email: ** PROBLEM alert – knew/zpool is CRITICAL **
What sickened me was that the email was sent at 3:58 PM.
That makes me think I should add a Pushover.net alert …. I could do that at my mail server.
Logging into the server, I saw:
[dan@knew:~] $ zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT system 90.5T 44.3T 46.2T - - 11% 48% 1.00x DEGRADED - zroot 19.9G 14.4G 5.43G - - 51% 72% 1.00x ONLINE - [dan@knew:~] $
[dan@knew:~] $ zpool status system pool: system state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0 in 17h26m with 0 errors on Wed Oct 17 20:59:06 2018 config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da4p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da12p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da5p3 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 da6p1 ONLINE 0 0 0 da7p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da8p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 FAULTED 6 51 0 too many errors da18p1 ONLINE 0 0 0 da2p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 errors: No known data errors
da17 is a known problem, or rather, devices around that number. See my FreeBSD Forums post started in 2017, yes, more than a year ago.
Looking at the drive, there is nothing alarming there.
[dan@knew:~] $ sudo smartctl -a /dev/da17 smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Toshiba 3.5" MD04ACA... Enterprise HDD Device Model: TOSHIBA MD04ACA500 Serial Number: 653AK2MXFS9A LU WWN Device Id: 5 000039 65bb8029e Firmware Version: FP2A User Capacity: 5,000,981,078,016 bytes [5.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Oct 22 22:06:18 2018 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 120) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 540) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 556 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 81 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 8 7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 038 038 000 Old_age Always - 25165 10 Spin_Retry_Count 0x0033 101 100 030 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 81 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 780 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 72 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 676 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 36 (Min/Max 18/51) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 1 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 253 000 Old_age Always - 0 220 Disk_Shift 0x0002 100 100 000 Old_age Always - 0 222 Loaded_Hours 0x0032 038 038 000 Old_age Always - 24996 223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0 224 Load_Friction 0x0022 100 100 000 Old_age Always - 0 226 Load-in_Time 0x0026 100 100 000 Old_age Always - 198 240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 20993 - # 2 Extended offline Completed without error 00% 9 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
These entries from /var/log/messages are very familiar.
Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 59 9f b7 60 00 00 00 08 00 00 length 4096 SMID 796 terminated ioc 804b loginfo 31110d00 scsi 0 state c xfer 0 Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 888 terminated ioc 804b loginfo 31(da17:mps2:0:2:0): WRITE(16). CDB: 8a 00 00 00 00 01 59 9f b7 60 00 00 00 08 00 00 Oct 22 19:52:59 knew kernel: 110d00 scsi 0 state c xfer 0 Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: CCB request completed with an error Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Retrying command Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: CCB request completed with an error Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Retrying command Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): CAM status: SCSI Status Error Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SCSI status: Check Condition Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Error 6, Retries exhausted Oct 22 19:52:59 knew kernel: (da17:mps2:0:2:0): Invalidating pack Oct 22 19:53:00 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130 Oct 22 19:53:00 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130
The main difference is the Error 6, Retries exhausted, and shortly thereafter, the device was removed from the vdev. I could not find recent examples of that error, the most recent I did find was from 2017.
Try removing the drive, powering off, reseating the cable
At 5:55 PM (21:55 UTC), I shutdown the server.
This server has 20 x 5TB drives, 5 rows of 4. By checking the serial number from above, I found da17 in the top top, 2nd from the left. I reseated that cable and powered up.
The problem recurred, and I redid my work. This server cannot be opened while in the rack, so taking it out of the rack, unscrewing the cover, etc, takes time and effort.
At 7:03 PM (23:03 UTC) the status was good, and I headed out to dinner.
They’re back!
This appeared in /var/log/messages:
Oct 23 01:20:14 knew kernel: (da17:mps2:0:2:0): Invalidating pack
At about 00:00 UTC (8 PM) I logged into the box and found:
[dan@knew:~] $ zpool status system pool: system state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://illumos.org/msg/ZFS-8000-2Q scan: resilvered 397M in 0h0m with 0 errors on Mon Oct 22 22:46:50 2018 config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da4p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da12p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da5p3 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 da6p1 ONLINE 0 0 0 da7p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da8p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 5840140110512920130 UNAVAIL 0 0 0 was /dev/da17p1 da18p1 ONLINE 0 0 0 da2p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 errors: No known data errors
The drive was having trouble again.
I tried a few things:
[dan@knew:~] $ sudo zpool replace system 5840140110512920130 /dev/da17p1 invalid vdev specification use '-f' to override the following errors: /dev/da17p1 is part of active pool 'system'
Ahh, you can’t replace itself.
Let’s offline that drive first.
[dan@knew:~] $ sudo zpool offline system da17p1 [dan@knew:~] $ zpool status system pool: system state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: resilvered 397M in 0h0m with 0 errors on Mon Oct 22 22:46:50 2018 config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da4p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da12p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da5p3 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 da6p1 ONLINE 0 0 0 da7p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da8p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 5840140110512920130 OFFLINE 0 0 0 was /dev/da17p1 da18p1 ONLINE 0 0 0 da2p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 errors: No known data errors
Now I put it online:
[dan@knew:~] $ sudo zpool online system da17p1 [dan@knew:~] $ zpool status system pool: system state: ONLINE scan: resilvered 397M in 0h0m with 0 errors on Mon Oct 22 22:46:50 2018 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da4p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da12p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da5p3 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da7p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da8p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 0 da2p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 errors: No known data errors
By 00:16 UTC, it was fixed again.
I think there was no resilvering because there had been no new data written during that time. Seems unlikely?
But wait, there’s more!
According to the logs, the system degraded again at 01:20 UTC (
Oct 23 01:20:14 knew kernel: (da17:mps2:0:2:0): Invalidating pack
By the time I noticed, and logged in, it was 02:51 UTC (10:51 PM). As much as I did not want to leave this for the morning, I did. I ignored the issue and went to sleep.
Tuesday lunch
On Tuesday, I replaced the cable between the HBA and the backplane for that drive.
[dan@knew:~] $ zpool status system pool: system state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 11.2G in 0h7m with 0 errors on Tue Oct 23 16:46:18 2018 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da4p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da12p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da5p3 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da7p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da8p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 18 da18p1 ONLINE 0 0 0 da2p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 errors: No known data errors [dan@knew:~] $
11.2G seems to be a reasonable resilver amount, given some incremental backups occurred overnight.
After a zpool clear system:
[dan@knew:~] $ zpool status system pool: system state: ONLINE scan: resilvered 11.2G in 0h7m with 0 errors on Tue Oct 23 16:46:18 2018 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 da3p3 ONLINE 0 0 0 da10p3 ONLINE 0 0 0 da15p3 ONLINE 0 0 0 da4p3 ONLINE 0 0 0 da13p3 ONLINE 0 0 0 da12p3 ONLINE 0 0 0 da9p3 ONLINE 0 0 0 da14p3 ONLINE 0 0 0 da11p3 ONLINE 0 0 0 da5p3 ONLINE 0 0 0 raidz2-1 ONLINE 0 0 0 da6p1 ONLINE 0 0 0 da7p1 ONLINE 0 0 0 da16p1 ONLINE 0 0 0 da8p1 ONLINE 0 0 0 da0p1 ONLINE 0 0 0 da1p1 ONLINE 0 0 0 da17p1 ONLINE 0 0 0 da18p1 ONLINE 0 0 0 da2p1 ONLINE 0 0 0 da19p1 ONLINE 0 0 0 errors: No known data errors [dan@knew:~] $
What next?
It is now 21:15 UTC and the system has been back up for about 5 hours and things are looking good. It is too early to declare that it was the cable. Let’s wait a week or two.