data04: DEGRADED

This morning I noticed this in the logs after doing some pkg upgrade. I was mainly updating openvpn, but in that operation, fail2ban was removed (because I went from python312 to python314). I noticed it missing on one host:

Can't exec "/usr/local/bin/fail2ban-client": No such file or directory at /usr/local/etc/snmp/fail2ban line 116.

I ran this grep to verify fail2ban had been removed from another host:

[12:31 r730-01 dvl ~] % grep fail /var/log/messages
Jun 30 15:38:58 r730-01 upsmon[2715]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Driver not connected
Jun 30 15:39:03 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Connection refused
Jun 30 15:39:28 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 15:40:18 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 15:40:43 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 15:40:53 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Connection refused
Jun 30 21:07:06 r730-01 upsmon[2715]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Driver not connected
Jun 30 21:07:06 r730-01 kernel: Jun 30 21:07:06 r730-01 upsmon[2715]: Poll UPS [ups04@gw01.int.unixathome.org] failed - Driver not connected
Jun 30 21:07:23 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 21:07:23 r730-01 kernel: Jun 30 21:07:23 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 21:07:50 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 21:08:44 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Operation timed out
Jun 30 21:08:51 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Connection refused
Jun 30 21:08:58 r730-01 upsmon[2715]: UPS [ups04@gw01.int.unixathome.org]: connect failed: Connection failure: Connection refused
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing queued i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o

Well, this host doesn’t run fail2ban, but those messages are “interesting”.

How much is nvme mentioned in the logs?

[12:31 r730-01 dvl ~] % grep nvme3 /var/log/messages
Jun 29 18:02:51 r730-01 kernel: nvme3:  mem 0x92000000-0x92003fff at device 0.0 numa-domain 0 on pci7
Jun 29 18:02:51 r730-01 kernel: nda3 at nvme3 bus 0 scbus19 target 0 lun 1
Jun 30 21:11:44 r730-01 kernel: nvme3:  mem 0x92000000-0x92003fff at device 0.0 numa-domain 0 on pci7
Jun 30 21:11:44 r730-01 kernel: nda3 at nvme3 bus 0 scbus19 target 0 lun 1
Jul  2 04:21:30 r730-01 kernel: nvme3: Resetting controller due to a timeout.
Jul  2 04:21:30 r730-01 kernel: nvme3: event="start"
Jul  2 04:21:30 r730-01 kernel: nvme3: Waiting for reset to complete
Jul  2 04:21:50 r730-01 kernel: nvme3: Waiting for reset to complete
Jul  2 04:21:50 r730-01 kernel: nvme3: controller ready did not become 0 within 20500 ms
Jul  2 04:21:50 r730-01 kernel: nvme3: event="timed_out"
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:1 cid:126 nsid:1 lba:5816680880 len:8
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:1 cid:126 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:1 cid:125 nsid:1 lba:5819691576 len:8
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:1 cid:125 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=5ab381b0 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:1 cid:119 nsid:1 lba:7155944008 len:8
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 5, Retries exhausted
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=5ae17238 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:1 cid:119 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 5, Retries exhausted
Jul  2 04:21:50 r730-01 kernel: nvme3: failing queued i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:2 cid:0 nsid:1 lba:3948824896 len:8
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:2 cid:0 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=aa870a48 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 5, Retries exhausted
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=eb5e4940 0 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 5, Retries exhausted
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:2 cid:121 nsid:1 lba:5969482320 len:8
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:2 cid:121 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nda3 at nvme3 bus 0 scbus19 target 0 lun 1
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:3 cid:125 nsid:1 lba:2128907408 len:1360
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:125 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: WRITE (01) sqid:3 cid:116 nsid:1 lba:4236260504 len:16
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:116 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=63cf1250 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:3 cid:121 nsid:1 lba:3846619816 len:8
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee48c90 0 54f 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): WRITE (01). NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=fc803498 0 f 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:121 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=e546c2a8 0 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:4 cid:112 nsid:1 lba:923353176 len:8
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:4 cid:112 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:4 cid:115 nsid:1 lba:7403218720 len:8
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=37094058 0 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:4 cid:115 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=b9442720 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:6 cid:123 nsid:1 lba:4092436328 len:8
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:6 cid:123 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:6 cid:117 nsid:1 lba:4119853392 len:8
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=f3ed9f68 0 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:6 cid:117 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=f58ff950 0 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: FLUSH (00) sqid:11 cid:123 nsid:1
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:11 cid:123 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): FLUSH (00). NCB: opc=0 fuse=0 nsid=1 prp1=0 prp2=0 cdw=0 0 0 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:14 cid:119 nsid:1 lba:2128908768 len:2048
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:14 cid:119 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:14 cid:121 nsid:1 lba:2128910816 len:2048
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:14 cid:121 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:14 cid:123 nsid:1 lba:2128912864 len:1368
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:14 cid:123 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:14 cid:122 nsid:1 lba:2128914232 len:2048
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee491e0 0 7ff 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:14 cid:122 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee499e0 0 7ff 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee4a1e0 0 557 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:14 cid:126 nsid:1 lba:2128916280 len:2048
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:14 cid:126 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee4a738 0 7ff 0 0 0
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:14 cid:125 nsid:1 lba:2128918328 len:1368
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:14 cid:125 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:15 cid:121 nsid:1 lba:7603680584 len:8
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:15 cid:121 cdw0:0
Jul  2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul  2 04:21:50 r730-01 kernel: nvme3: READ (02) sqid:16 cid:126 nsid:1 lba:4406184968 len:8
Jul  2 04:21:50 r730-01 kernel: nvme3: ABORTED_BY_REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:16 cid:126 cdw0:0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee4af38 0 7ff 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=7ee4b738 0 557 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c536f548 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): READ (02). NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=6a10c08 1 7 0 0 0
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): CAM status: NVME Status Error
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): NVMe status: ABORTED_BY_REQUEST (00/07) DNR
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Error 6, Periph was invalidated
Jul  2 04:21:50 r730-01 kernel: nvme3: Failed controller, stopping watchdog timeout.
Jul  2 04:21:50 r730-01 kernel: (nda3:nvme3:0:0:1): Periph destroyed
Jul  2 04:21:50 r730-01 kernel: nvme3: Failed controller, stopping watchdog timeout.

Well, that’s a lot. How’s the status?

[12:31 r730-01 dvl ~] % zpool status
  pool: data01
 state: ONLINE
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Thu Jul  2 03:48:55 2026
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE       0     0     0
	    gpt/Y7P0A022TEVE  ONLINE       0     0     0
	    gpt/Y7P0A02ATEVE  ONLINE       0     0     0
	    gpt/Y7P0A02DTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02GTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02LTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02MTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02QTEVE  ONLINE       0     0     0
	    gpt/Y7P0A033TEVE  ONLINE       0     0     0

errors: No known data errors

  pool: data02
 state: ONLINE
  scan: scrub repaired 0B in 00:03:59 with 0 errors on Thu Jul  2 03:52:59 2026
config:

	NAME                     STATE     READ WRITE CKSUM
	data02                   ONLINE       0     0     0
	  mirror-0               ONLINE       0     0     0
	    gpt/S6WSNJ0T208743F  ONLINE       0     0     0
	    gpt/S6WSNJ0T207774T  ONLINE       0     0     0

errors: No known data errors

  pool: data03
 state: ONLINE
  scan: scrub repaired 0B in 01:16:19 with 0 errors on Thu Jul  2 05:05:31 2026
config:

	NAME                     STATE     READ WRITE CKSUM
	data03                   ONLINE       0     0     0
	  mirror-0               ONLINE       0     0     0
	    gpt/WD_22492H800867  ONLINE       0     0     0
	    gpt/WD_230151801284  ONLINE       0     0     0
	  mirror-1               ONLINE       0     0     0
	    gpt/WD_230151801478  ONLINE       0     0     0
	    gpt/WD_230151800473  ONLINE       0     0     0

errors: No known data errors

  pool: data04
 state: DEGRADED
status: One or more devices have been removed.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
  scan: scrub repaired 0B in 01:11:17 with 0 errors on Thu Jul  2 05:00:37 2026
config:

	NAME                     STATE     READ WRITE CKSUM
	data04                   DEGRADED     0     0     0
	  raidz2-0               DEGRADED     0     0     0
	    gpt/S7KGNU0Y722875X  ONLINE       0     0     0
	    gpt/S7KGNU0Y915666E  ONLINE       0     0     0
	    gpt/S7KGNU0Y912937J  ONLINE       0     0     0
	    gpt/S7KGNU0Y912955D  REMOVED      0     0     0
	    gpt/S7U8NJ0Y716854P  ONLINE       0     0     0
	    gpt/S7U8NJ0Y716801F  ONLINE       0     0     0
	    gpt/S757NS0Y700758M  ONLINE       0     0     0
	    gpt/S757NS0Y700760R  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
	The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:53 with 0 errors on Thu Jul  2 03:50:16 2026
config:

	NAME                               STATE     READ WRITE CKSUM
	zroot                              ONLINE       0     0     0
	  mirror-0                         ONLINE       0     0     0
	    gpt/zfs0_20170718AA0000185556  ONLINE       0     0     0
	    gpt/zfs1_20170719AA1178164201  ONLINE       0     0     0

errors: No known data errors

Then I checked Nagios – it had found the same issue. I hadn’t check Nagios before today. Oh oh.

Let’s look at that device:

[12:47 r730-01 dvl ~] % sudo nvmecontrol identify nvme3
nvmecontrol: Identify request failed
[12:47 r730-01 dvl ~] %

I went to LibreNMS to see if there was any trending information about that drive. It was not found. I suspect when it dropped out, LibreNMS also dropped it. If that’s the case, that’s not helpful.

Let’s try a reboot.

After a reboot, that drive (S7KGNU0Y912955D) was not found. My next idea: open up the case and reseat that device.

I’m hoping that device is not dead. It went into service 7 months ago and priced have jumped more than slightly lately.

When I checked another device:

[13:16 r730-01 dvl ~] % sudo smartctl -a /dev/nvme4
smartctl 7.5 2025-04-30 r5714 [FreeBSD 15.0-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 EVO Plus 4TB
...
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    76,836,586 [39.3 TB]
Data Units Written:                 47,333,046 [24.2 TB]
Host Read Commands:                 2,617,490,888
Host Write Commands:                1,200,934,619
Controller Busy Time:               8,225
Power Cycles:                       23
Power On Hours:                     6,562
...

That usage level is not outrageous. All units in this zpool should be more-or-less identically used.

Drive is not dead

I powered off the host, and pulled out the ASUS Hyper M.2 X16 Gen 4 card. I move the NVMe card in question to a portable carrier. I hooked that up to my Macbook. It was identified as a “Samsung SSD 990 PRO 4TB” – that tells me it’s not completely dead.

bsdimp suggested I hook that up to a FreeBSD box.

While monitoring /var/log/messages, I did just not. Nothing. :(

I tried another USB port; nothing. I then tried a USB port on the back of the host:

Jul  2 15:02:59 r730-03 kernel: da8 at umass-sim1 bus 1 scbus19 target 0 lun 0
Jul  2 15:02:59 r730-03 kernel: da8: <Samsung SSD 990 PRO 4TB 1.00> Fixed Direct Access SPC-4 SCSI device
Jul  2 15:02:59 r730-03 kernel: da8: Serial Number 01293805127E
Jul  2 15:02:59 r730-03 kernel: da8: 40.000MB/s transfers
Jul  2 15:02:59 r730-03 kernel: da8: 3815447MB (7814037168 512 byte sectors)
Jul  2 15:02:59 r730-03 kernel: da8: quirks=0x2<NO_6_BYTE>

That give me hope. As does this:

[15:03 r730-03 dvl ~] % sudo smartctl -a /dev/da8
smartctl 7.5 2025-04-30 r5714 [FreeBSD 15.0-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-25, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 4TB
Serial Number:                      S7KGNU0Y912955D
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            3,057,326,026,752 [3.05 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4951a0eec6
Local Time is:                      Thu Jul  2 15:03:56 2026 UTC
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Log0_FISE_MI
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        34 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    56,743,286 [29.0 TB]
Data Units Written:                 11,391,387 [5.83 TB]
Host Read Commands:                 2,383,774,544
Host Write Commands:                425,045,375
Controller Busy Time:               802
Power Cycles:                       21
Power On Hours:                     5,614
Unsafe Shutdowns:                   10
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               34 Celsius
Temperature Sensor 2:               36 Celsius

Warning: NVMe Get Log truncated to 0x200 bytes, 0x200 bytes zero filled
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Warning: NVMe Get Log truncated to 0x200 bytes, 0x034 bytes zero filled
Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
No Self-tests Logged

Back into the box

I disconnected that mobile carrier from the FreeBSD USB port. I installed it back onto the PCIe card, swapping it with another device. It was in the slot farthest from the fan. Now it’s one slow closer to the fan.

I booted up the host. And I see:

[15:29 r730-01 dvl ~] % zpool status data04
  pool: data04
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jul  2 15:28:49 2026
	1.73T / 9.30T scanned, 10.9G / 7.58T issued at 1.81G/s
	1.84G resilvered, 0.14% done, 01:11:23 to go
config:

	NAME                     STATE     READ WRITE CKSUM
	data04                   ONLINE       0     0     0
	  raidz2-0               ONLINE       0     0     0
	    gpt/S7KGNU0Y722875X  ONLINE       0     0     0
	    gpt/S7KGNU0Y915666E  ONLINE       0     0     0
	    gpt/S7KGNU0Y912937J  ONLINE       0     0     0
	    gpt/S7KGNU0Y912955D  ONLINE       0     0     2  (resilvering)
	    gpt/S7U8NJ0Y716854P  ONLINE       0     0     0
	    gpt/S7U8NJ0Y716801F  ONLINE       0     0     0
	    gpt/S757NS0Y700758M  ONLINE       0     0     0
	    gpt/S757NS0Y700760R  ONLINE       0     0     0

errors: No known data errors

This is as good as can be expected. :)

About 10 minutes later:

[15:29 r730-01 dvl ~] % zpool status data04
  pool: data04
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 1.84G in 00:00:20 with 0 errors on Thu Jul  2 15:29:09 2026
config:

	NAME                     STATE     READ WRITE CKSUM
	data04                   ONLINE       0     0     0
	  raidz2-0               ONLINE       0     0     0
	    gpt/S7KGNU0Y722875X  ONLINE       0     0     0
	    gpt/S7KGNU0Y915666E  ONLINE       0     0     0
	    gpt/S7KGNU0Y912937J  ONLINE       0     0     0
	    gpt/S7KGNU0Y912955D  ONLINE       0     0     2
	    gpt/S7U8NJ0Y716854P  ONLINE       0     0     0
	    gpt/S7U8NJ0Y716801F  ONLINE       0     0     0
	    gpt/S757NS0Y700758M  ONLINE       0     0     0
	    gpt/S757NS0Y700760R  ONLINE       0     0     0

errors: No known data errors

All good. Let’s do a scrub before I clear out those errors.

[15:39 r730-01 dvl ~] % sudo zpool scrub data04
[15:40 r730-01 dvl ~] % zpool status data04    
  pool: data04
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Jul  2 15:40:05 2026
	182G / 9.30T scanned at 45.6G/s, 0B / 9.30T issued
	0B repaired, 0.00% done, no estimated completion time
config:

	NAME                     STATE     READ WRITE CKSUM
	data04                   ONLINE       0     0     0
	  raidz2-0               ONLINE       0     0     0
	    gpt/S7KGNU0Y722875X  ONLINE       0     0     0
	    gpt/S7KGNU0Y915666E  ONLINE       0     0     0
	    gpt/S7KGNU0Y912937J  ONLINE       0     0     0
	    gpt/S7KGNU0Y912955D  ONLINE       0     0     2
	    gpt/S7U8NJ0Y716854P  ONLINE       0     0     0
	    gpt/S7U8NJ0Y716801F  ONLINE       0     0     0
	    gpt/S757NS0Y700758M  ONLINE       0     0     0
	    gpt/S757NS0Y700760R  ONLINE       0     0     0

errors: No known data errors

Logs

Let’s find that device.

[15:41 r730-01 dvl ~] % grep S7KGNU0Y912955D /var/run/dmesg.boot 
nda2: <Samsung SSD 990 PRO 4TB 4B2QJXD7 S7KGNU0Y912955D>
nda2: Serial Number S7KGNU0Y912955D

It is now nda2 – given it was : failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
Jul 2 04:21:50 r730-01 kernel: nvme3: failing outstanding i/o
[/sourcecode]