zpool: FAULTED – too many errors – Dan Langille's Other Diary

This server, knew has had an intermittent problem related to CAM status: SCSI Status Error messages. There is a FreeBSD Forums post about it. On Sunday, the problem returned, and this time it degraded the zpool.

I collected the information in this gist and I will list the relevant portions below.

I had enabled smartd testing and I received this email late on Sunday:

To: dan@langille.org
Subject: SMART error (FailedReadSmartData) detected on host: knew
Message-Id: <20181105004622.8F4BA50B3B@knew.int.unixathome.org>
Date: Mon,  5 Nov 2018 00:46:22 +0000 (UTC)
From: Charlie Root <root@knew.int.unixathome.org>
X-Gm-Original-To: dan@langille.org

This message was generated by the smartd daemon running on:

   host name:  knew
   DNS domain: int.unixathome.org

The following warning/error was logged by the smartd daemon:

Device: /dev/da19, failed to read SMART values

Device info:
[ATA      TOSHIBA MD04ACA5 FP2A], lu id: 0x500003965bb0025d, S/N: 6539K3OJFS9A, 5.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

Despite the promise of more information in the logs, all I found was:

Nov  5 00:46:23 knew smartd[76482]: Device: /dev/da20, SMART Failure: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE
Nov  5 00:46:23 knew smartd[76482]: Device: /dev/da20, Self-Test Log error count increased from 0 to 1

Let’s look at da20:

$ grep da20 /var/run/dmesg.boot 
da20 at mps2 bus 0 scbus2 target 12 lun 0
da20:  Fixed Direct Access SPC-4 SCSI device
da20: Serial Number 653BK12FFS9A
da20: 600.000MB/s transfers
da20: Command Queueing enabled
da20: 4769307MB (9767541168 512 byte sectors)

According to /var/run/dmesg.boot, da20 has serial number 653BK12FFS9A but the email mentioned 6539K3OJFS9A. They are not the same. Do I have two failed drives?

Checking the zpool, I found only one failed drive:

[dan@knew:~] $ zpool status system
  pool: system
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: resilvered 615G in 4h42m with 0 errors on Thu Nov  1 21:19:46 2018
config:

	NAME                        STATE     READ WRITE CKSUM
	system                      DEGRADED     0     0     0
	  raidz2-0                  ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4  ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2  ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1  ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3  ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2  ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4  ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3  ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3  ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1  ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4  ONLINE       0     0     0
	  raidz2-1                  DEGRADED     0     0     0
	    gpt/653BK12JFS9A.r4.c2  ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3  ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4  ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4  ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1  ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2  ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2  ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1  FAULTED      1    21     0  too many errors
	    gpt/653IK1IBFS9A.r5.c1  ONLINE       0     0     0
	    gpt/653BK12FFS9A.r1.c3  ONLINE       0     0     0

errors: No known data errors
[dan@knew:~] $

I had recently assigned labels to my drives (see also this tweet for the export/import process. That means I can instantly see which serial number and the location in the drive array. Case in point, the drive with serial number 6539K3OJFS9A is located at row 1 column 2. That serial number matches the email.

So why is d20 mentioned in the email?

I looked in the logs and found a lot of messages related to da19. There were about 1300 lines related to da19 and CAM status: SCSI Status Error. Of note were these last messages:

Nov  5 00:17:36 knew kernel: (da19:mps2:0:10:0): SCSI status: Check Condition
Nov  5 00:17:36 knew kernel: (da19:mps2:0:10:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Nov  5 00:17:36 knew kernel: (da19:mps2:0:10:0): Error 5, Retries exhausted
Nov  5 00:17:36 knew kernel: GEOM_PART: da19 was automatically resized.
Nov  5 00:17:36 knew kernel: Use `gpart commit da19` to save changes or `gpart undo da19` to revert them.
Nov  5 00:17:36 knew kernel: GEOM_PART: integrity check failed (da19, GPT)

The next entries in the logs were:

Nov  5 00:46:23 knew smartd[76482]: Device: /dev/da20, SMART Failure: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE
Nov  5 00:46:23 knew smartd[76482]: Device: /dev/da20, Self-Test Log error count increased from 0 to 1

I also had an email from smartd about the above.

Those gpart resize messages are interesting. Let’s look at the drive:

$ gpart show da19
gpart: No such geom: da19

Where is da19?

It is there:

[dan@knew:~] $ sudo smartctl -l selftest /dev/da19
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     25529         -
# 2  Short offline       Interrupted (host reset)      30%     25526         -
# 3  Extended offline    Completed without error       00%     25296         -
# 4  Short offline       Completed without error       00%     20994         -
# 5  Extended offline    Completed without error       00%         9         -

[dan@knew:~] $

I had run a short test; it passed.

So why has da19 disappeared?

I was chatting online with people who know more about this. The theory is: SAS fabric is shared. It sounds like da20 went haywire and knocked da19 off the bus. it’s like you’re at a party and someone starts shooting blood from their ear and screaming at the top of their lungs. It’s going to disrupt the other conversations.

There was talk about a doing a taste so that geom might pick up the drive again. It was mentioned, but I did not try this:

# true > /dev/da19

Those folks keep talking about a basic C utility for base called gtaste that will do this for you.

$ sudo gtaste da19

One day, the above command might exist.

The first attempt to fix the zpool

The working theory:

da19 (6539K3OJFS9A) is OK.
da20 (6539K3OJFS9A) needs to be replaced.

The above mentioned device names will change below. I am reconnecting the drives to different cables, because they get moved around.

This server has hot-swap drives, but no spare drive bays. I powered off the server, pulled it from the rack. I moved the faulty drive into the interior of the case, sitting there, loose. You can see what that looks like in this post. The new drive was installed in the now-empty drive bay (row 1, column 3).

After powering up the server, I renamed the label on the failing drive, now present as da16:

$ sudo gpart modify -l 653BK12FFS9A.inside.case da16
$ gpart show -l da16
=>        34  9767541101  da16  GPT  (4.5T)
          34           6        - free -  (3.0K)
          40  9766000000     1  653BK12FFS9A.inside.case  (4.5T)
  9766000040     1541095        - free -  (752M)

I partitioned the new drive:

$ sudo gpart create -s gpt da15
$ sudo gpart add -t freebsd-zfs -a 4K -s 9766000000 -l 57NGK1ZGF57D.r1.c3 da15
$ gpart show da0 da15
=>        34  9767541101  da0  GPT  (4.5T)
          34           6       - free -  (3.0K)
          40  9766000000    1  freebsd-zfs  (4.5T)
  9766000040     1541095       - free -  (752M)

=>        40  9767541088  da15  GPT  (4.5T)
          40  9766000000     1  freebsd-zfs  (4.5T)
  9766000040     1541088        - free -  (752M)

[dan@knew:~] $

The replacement command was issued:

$ sudo zpool replace system gpt/653BK12FFS9A.r1.c3 gpt/57NGK1ZGF57D.r1.c3
[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov  5 16:00:57 2018
	827M scanned out of 45.2T at 207M/s, 63h43m to go
        5.01M resilvered, 0.00% done
config:

	NAME                          STATE     READ WRITE CKSUM
	system                        ONLINE       0     0     0
	  raidz2-0                    ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4    ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2    ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1    ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3    ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2    ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4    ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3    ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3    ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1    ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4    ONLINE       0     0     0
	  raidz2-1                    ONLINE       0     0     0
	    gpt/653BK12JFS9A.r4.c2    ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3    ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4    ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4    ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1    ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2    ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2    ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1    ONLINE       0     0     0
	    gpt/653IK1IBFS9A.r5.c1    ONLINE       0     0     0
	    replacing-9               ONLINE       0     0     0
	      gpt/653BK12FFS9A.r1.c3  ONLINE       0     0     0
	      gpt/57NGK1ZGF57D.r1.c3  ONLINE       0     0     0

errors: No known data errors
[dan@knew:~] $

And with that command came this message:

Nov 5 16:00:56 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=1933292688604201684

I was also seeing these messages about the failed drive:

Nov  5 15:58:11 knew smartd[1068]: Device: /dev/da16 [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Nov  5 15:58:11 knew smartd[1068]: Device: /dev/da16 [SAT], Failed SMART usage Attribute: 240 Head_Flying_Hours.

The full smartctl output for this drive is in the gist.

Now, I wait.

I did not wait long

About two minutes later, more CAM status: SCSI Status Error messages started appearing, and the new drive faulted (details in this gist).

The current status was:

[dan@knew:~] $ zpool status system
  pool: system
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov  5 16:00:57 2018
	560G scanned out of 45.2T at 360M/s, 36h9m to go
        26.0G resilvered, 1.21% done
config:

	NAME                          STATE     READ WRITE CKSUM
	system                        DEGRADED     0     0     0
	  raidz2-0                    ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4    ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2    ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1    ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3    ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2    ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4    ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3    ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3    ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1    ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4    ONLINE       0     0     0
	  raidz2-1                    DEGRADED     0     0     0
	    gpt/653BK12JFS9A.r4.c2    ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3    ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4    ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4    ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1    ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2    ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2    ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1    ONLINE       0     0     0
	    gpt/653IK1IBFS9A.r5.c1    ONLINE       0     0     0
	    replacing-9               DEGRADED     0     0     0
	      gpt/653BK12FFS9A.r1.c3  ONLINE       0     0     0
	      gpt/57NGK1ZGF57D.r1.c3  FAULTED      6   111     0  too many errors

errors: No known data errors
[dan@knew:~] $

The new theory: that drive bay was faulty.

The second fix

With this fix, the new drive was assumed good, but the connection was bad. The system was powered off, the new drive was moved to the interior of the chassis, much like the failing drive. After power up, the resilvering automatically resumed. I did not anticipate that automatic start, but it is a feature of ZFS.

After powering up, the status looked like this:

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Nov  5 16:00:57 2018
	1.28T scanned out of 45.2T at 699M/s, 18h18m to go
        26.0G resilvered, 2.83% done
config:

	NAME                          STATE     READ WRITE CKSUM
	system                        ONLINE       0     0     0
	  raidz2-0                    ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4    ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2    ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1    ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3    ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2    ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4    ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3    ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3    ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1    ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4    ONLINE       0     0     0
	  raidz2-1                    ONLINE       0     0     0
	    gpt/653BK12JFS9A.r4.c2    ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3    ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4    ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4    ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1    ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2    ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2    ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1    ONLINE       0     0     0
	    gpt/653IK1IBFS9A.r5.c1    ONLINE       0     0     0
	    replacing-9               ONLINE       0     0     2
	      da15p1                  ONLINE       0     0     0
	      gpt/57NGK1ZGF57D.r1.c3  ONLINE       0     0     0

errors: No known data errors

NOTE the two checksum errors. That will come into play later.

The resilvering completed successfully at about 2300 UTC that night.

Removing the faulty drive from the zpool

Usually, ZFS will automatically remove the replaced device from the zpool, however that did not happen this time:

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
  scan: resilvered 630G in 6h59m with 0 errors on Mon Nov  5 23:00:14 2018
config:

	NAME                          STATE     READ WRITE CKSUM
	system                        ONLINE       0     0     0
	  raidz2-0                    ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4    ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2    ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1    ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3    ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2    ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4    ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3    ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3    ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1    ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4    ONLINE       0     0     0
	  raidz2-1                    ONLINE       0     0     0
	    gpt/653BK12JFS9A.r4.c2    ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3    ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4    ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4    ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1    ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2    ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2    ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1    ONLINE       0     0     0
	    gpt/653IK1IBFS9A.r5.c1    ONLINE       0     0     0
	    replacing-9               ONLINE       0     0     2
	      da15p1                  ONLINE       0     0     0
	      gpt/57NGK1ZGF57D.r1.c3  ONLINE       0     0     0

errors: No known data errors
[dan@knew:~] $

Asking around, this situation occurs whenever something goes wrong during the resilver. In this case, there were two CKSUM (checksum) errors. They were corrected and all is good. Now I can remove that old device from the zpool with the detach command.

You can see da15 mentioned in the zpool status output. When we started this replacement process, da15 was the new drive. Now it is the faulty drive to be removed. The original device is shown first, under the replacing-9 line. I was confused by this device renumbering, but it makes sense, given that I added the new drive into the chassis interior and added it to the same SFF-8087 to SATA Breakout cable. It just so happens that the new drive was on a lower cable port (I think that’s how it works).

After confirming via smartctl -i /dev/da15 that this was the drive to be removed, I issued this command:

[dan@knew:~] $ sudo zpool detach system da15p1

Now the status looks like this:

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Nov  6 14:48:09 2018
	34.2T scanned out of 45.3T at 578M/s, 5h33m to go
        76.1G resilvered, 75.60% done
config:

	NAME                        STATE     READ WRITE CKSUM
	system                      ONLINE       0     0     0
	  raidz2-0                  ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4  ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2  ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1  ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3  ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2  ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4  ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3  ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3  ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1  ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4  ONLINE       0     0     0
	  raidz2-1                  ONLINE       0     0     0
	    gpt/653BK12JFS9A.r4.c2  ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3  ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4  ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4  ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1  ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2  ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2  ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1  ONLINE       0     0     0
	    gpt/653IK1IBFS9A.r5.c1  ONLINE       0     0     0
	    gpt/57NGK1ZGF57D.r1.c3  ONLINE       0     0     0

errors: No known data errors

I didn’t expect the detach to initiate a resilver.

These were the log messages generated by that change:

Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5892227802261634203
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=9332658639709199239
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=250004220145174872
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=6216472763074854678
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=12795310201775582855
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=13315402097660581553
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=5840140110512920130
Nov  6 14:48:06 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=13603535286907309607
Nov  6 14:48:07 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=4677401754715191854
Nov  6 14:48:07 knew ZFS: vdev state changed, pool_guid=15378250086669402288 vdev_guid=1933292688604201684

Eight hours later

Eight hours later, the status shows that the resilver has completed.

[dan@knew:~] $ zpool status system
  pool: system
 state: ONLINE
  scan: resilvered 668G in 7h38m with 0 errors on Tue Nov  6 22:26:43 2018
config:

	NAME                        STATE     READ WRITE CKSUM
	system                      ONLINE       0     0     0
	  raidz2-0                  ONLINE       0     0     0
	    gpt/X643KHBFF57D.r5.c4  ONLINE       0     0     0
	    gpt/4728K24SF57D.r3.c2  ONLINE       0     0     0
	    gpt/37KVK1JRF57D.r2.c1  ONLINE       0     0     0
	    gpt/37D4KBJPF57D.r5.c3  ONLINE       0     0     0
	    gpt/5782KL6VF57D.r2.c2  ONLINE       0     0     0
	    gpt/6525K2DGFS9A.r2.c4  ONLINE       0     0     0
	    gpt/579HKDZYF57D.r3.c3  ONLINE       0     0     0
	    gpt/579HKDZXF57D.r2.c3  ONLINE       0     0     0
	    gpt/5782KL6MF57D.r3.c1  ONLINE       0     0     0
	    gpt/X6IEKELNF57D.r4.c4  ONLINE       0     0     0
	  raidz2-1                  ONLINE       0     0     0
	    gpt/653BK12JFS9A.r4.c2  ONLINE       0     0     0
	    gpt/579IK5RMF57D.r4.c3  ONLINE       0     0     0
	    gpt/653EK93PFS9A.r1.c4  ONLINE       0     0     0
	    gpt/653DK7WPFS9A.r3.c4  ONLINE       0     0     0
	    gpt/653DK7WCFS9A.r4.c1  ONLINE       0     0     0
	    gpt/653EK93QFS9A.r5.c2  ONLINE       0     0     0
	    gpt/653AK2MXFS9A.r1.c2  ONLINE       0     0     0
	    gpt/6539K3OJFS9A.r1.c1  ONLINE       0     0     0
	    gpt/653IK1IBFS9A.r5.c1  ONLINE       0     0     0
	    gpt/57NGK1ZGF57D.r1.c3  ONLINE       0     0     0

errors: No known data errors

Now let’s see how long this lasts. It has been 29 hours since the last CAM-related error.

I want a new chassis

Given that I have a bad drive bay, and no other empty drive bays, I want to get a new chassis. I want one with extra drive bays so I can avoid opening the case.

I’m considering the SuperMicro 846.