booting without mrsas driver – lots of errors

Today I noticed that all the drives in the R730 host had 150MB/s speeds mentioned in the boot messages. They are all SSDs on an SAS bus. They should be at 600.000MB/s.

The HBA in question is a AVAGO MegaRAID SAS FreeBSD (aka PERC H730P Mini). It can use either the mfi driver (default) or the mrsas driver (by setting hw.mfi.mrsas_enable="1" in /boot/loader.conf).

I’m now using that value (hw.mfi.mrsas_enable="1") on all hosts with that type of HBA.

This is an SSD in a PCIe slot:

ada1 at ahcich13 bus 0 scbus17 target 0 lun 0
ada1: <INTEL SSDSC2BB480G6K G201CS01> ACS-2 ATA SATA 3.x device
ada1: Serial Number [redacted]
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada1: Command Queueing enabled
ada1: 457862MB (937703088 512 byte sectors)

This is a WD Blue SSD recently purchased:

da11 at mrsas0 bus 1 scbus1 target 11 lun 0
da11: <ATA WDC  WDS400T2B0A 20WD> Fixed Direct Access SPC-4 SCSI device
da11: Serial Number [redacted]        
da11: 150.000MB/s transfers
da11: 3815447MB (7814037168 512 byte sectors)

This came to my attention while I was adding new SSDs into the host. I pulled an existing drive and reinserted it to see what happened. The console also mentioned 150.000MB/s transfers for that drive. So it wasn’t just my new drives. I think it’s important to note this was a running system. Here is that remove/insert:

Feb 24 18:15:37 r730-01 kernel: (da7:mrsas0:1:7:0): Invalidating pack
Feb 24 18:15:37 r730-01 kernel: da7 at mrsas0 bus 1 scbus1 target 7 lun 0
Feb 24 18:15:37 r730-01 kernel: da7: <TOSHIBA PX05SMB080Y AS0B>  s/n Y7P0A02DTEVE detached
Feb 24 18:15:37 r730-01 kernel: (da7:mrsas0:1:7:0): Periph destroyed
Feb 24 18:15:37 r730-01 ZFS[47352]: vdev state changed, pool_guid=801222743406015344 vdev_guid=2468393772196895325
Feb 24 18:15:37 r730-01 ZFS[47356]: vdev is removed, pool_guid=801222743406015344 vdev_guid=2468393772196895325

Feb 24 18:15:43 r730-01 kernel: mrsas0: System PD deleted target ID: 0x7
Feb 24 18:15:54 r730-01 kernel: mrsas0: System PD created target ID: 0x7
Feb 24 18:15:54 r730-01 kernel: da7 at mrsas0 bus 1 scbus1 target 7 lun 0
Feb 24 18:15:54 r730-01 kernel: da7: <TOSHIBA PX05SMB080Y AS0B> Fixed Direct Access SPC-4 SCSI device
Feb 24 18:15:54 r730-01 kernel: da7: Serial Number Y7P0A02DTEVE
Feb 24 18:15:54 r730-01 kernel: da7: 150.000MB/s transfers
Feb 24 18:15:54 r730-01 kernel: da7: 763097MB (1562824368 512 byte sectors)

A diskinfo test showed it was getting 460 MB/s, which far exceeds the mrsas value.

I thought: what speeds go I get from the mfi driver? I commented out the hw.mfi.mrsas_enable="1" in /boot/loader.conf and booted into single user mode.

Single user mode didn’t help because I could not see/find the messages I wanted. I disabled jails via sysrc jail_enable=”NO” and restarted the host.

After getting connected via ssh, I found there were no such messages created by the mfi driver.

Looking back in the logs now, this is what I found:

Feb 24 19:31:07 r730-01 kernel: mfi0: <Invader> port 0x3000-0x30ff mem 0x92600000-0x9260ffff,0x92500000-0x925fffff irq 26 at device 0.0 numa-domain 0 on pci3
Feb 24 19:31:07 r730-01 kernel: mfi0: Using MSI
Feb 24 19:31:07 r730-01 kernel: mfi0: Megaraid SAS driver Ver 4.23 
Feb 24 19:31:07 r730-01 kernel: mfi0: FW MaxCmds = 928, limiting to 128
Feb 24 19:31:07 r730-01 kernel: mfi0: MaxCmd = 928, Drv MaxCmd = 128, MaxSgl = 70, state = 0xb73c03a0
Feb 24 19:31:07 r730-01 kernel: mfi0: 18285 (730582070s/0x0020/info) - Shutdown command received from host
Feb 24 19:31:07 r730-01 kernel: mfi0: 18286 (boot + 10s/0x0020/info) - Firmware initialization started (PCI ID 005d/1000/1f47/1028)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18287 (boot + 10s/0x0020/info) - Firmware version 4.300.00-8368
Feb 24 19:31:07 r730-01 kernel: mfi0: 18288 (boot + 11s/0x0008/info) - Battery Present
Feb 24 19:31:07 r730-01 kernel: mfi0: 18289 (boot + 11s/0x0020/info) - Package version 25.5.9.0001
Feb 24 19:31:07 r730-01 kernel: mfi0: 18290 (boot + 11s/0x0020/info) - Board Revision A03
Feb 24 19:31:07 r730-01 kernel: mfi0: 18291 (boot + 15s/0x0008/info) - Battery temperature is normal
Feb 24 19:31:07 r730-01 kernel: mfi0: 18292 (boot + 15s/0x0008/info) - Current capacity of the battery is above threshold
Feb 24 19:31:07 r730-01 kernel: mfi0: 18293 (boot + 27s/0x0004/info) - Enclosure PD 20(c None/p1) communication restored
Feb 24 19:31:07 r730-01 kernel: mfi0: 18294 (boot + 27s/0x0002/info) - Inserted: Encl PD 20
Feb 24 19:31:07 r730-01 kernel: mfisyspd0 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd0: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 0)
Feb 24 19:31:07 r730-01 kernel: mfisyspd0:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd1 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd1: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 1)
Feb 24 19:31:07 r730-01 kernel: mfisyspd1:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd2 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd2: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 2)
Feb 24 19:31:07 r730-01 kernel: mfisyspd2:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd3 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd3: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 3)
Feb 24 19:31:07 r730-01 kernel: mfisyspd3:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd4 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd4: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 4)
Feb 24 19:31:07 r730-01 kernel: mfisyspd4:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd5 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd5: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 5)
Feb 24 19:31:07 r730-01 kernel: mfisyspd5:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd6 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd6: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 6)
Feb 24 19:31:07 r730-01 kernel: mfisyspd6:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd7 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd7: 763097MB (1562824368 sectors) SYSPD volume (deviceid: 7)
Feb 24 19:31:07 r730-01 kernel: mfisyspd7:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd8 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd8: 3815447MB (7814037168 sectors) SYSPD volume (deviceid: 8)
Feb 24 19:31:07 r730-01 kernel: mfisyspd8:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd9 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd9: 3815447MB (7814037168 sectors) SYSPD volume (deviceid: 9)
Feb 24 19:31:07 r730-01 kernel: mfisyspd9:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd10 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd10: 3815447MB (7814037168 sectors) SYSPD volume (deviceid: 10)
Feb 24 19:31:07 r730-01 kernel: mfisyspd10:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd11 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd11: 3815447MB (7814037168 sectors) SYSPD volume (deviceid: 11)
Feb 24 19:31:07 r730-01 kernel: mfisyspd11:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd12 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd12: 1831420MB (3750748848 sectors) SYSPD volume (deviceid: 14)
Feb 24 19:31:07 r730-01 kernel: mfisyspd12:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfisyspd13 numa-domain 0 on mfi0
Feb 24 19:31:07 r730-01 kernel: mfisyspd13: 1831420MB (3750748848 sectors) SYSPD volume (deviceid: 15)
Feb 24 19:31:07 r730-01 kernel: mfisyspd13:  SYSPD volume attached
Feb 24 19:31:07 r730-01 kernel: mfi0: 18295 (boot + 27s/0x0002/info) - Inserted: PD 20(c None/p1) Info: enclPd=20, scsiType=d, portMap=00, sasAddr=500056b32a7798fd,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18296 (boot + 27s/0x0002/info) - Inserted: PD 00(e0x20/s0)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18297 (boot + 27s/0x0002/info) - Inserted: PD 00(e0x20/s0) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5bfa,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18298 (boot + 27s/0x0002/info) - Inserted: PD 01(e0x20/s1)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18299 (boot + 27s/0x0002/info) - Inserted: PD 01(e0x20/s1) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5nvme0: Allocated 200MB host memory buffer
Feb 24 19:31:07 r730-01 kernel: mfi0: 18300 (boot + 27s/0x0002/info) - Inserted: PD 02(e0x20/s2)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18301 (boot + 27s/0x0002/info) - Inserted: PD 02(e0x20/s2) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5d26,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18302 (boot + 27s/0x0002/info) - Inserted: PD 03(e0x20/s3)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18303 (boot + 27s/0x0002/info) - Inserted: PD 03(e0x20/s3) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5nvd0: 238475MB (488397168 512 byte sectors)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18304 (boot + 27s/0x0002/info) - Inserted: PD 04(e0x20/s4)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18305 (boot + 27s/0x0002/info) - Inserted: PD 04(e0x20/s4) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5c76,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18306 (boot + 27s/0x0002/info) - Inserted: PD 05(e0x20/s5)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18307 (boot + 27s/0x0002/info) - Inserted: PD 05(e0x20/s5) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5b1a,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18308 (boot + 27s/0x0002/info) - Inserted: PD 06(e0x20/s6)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18309 (boot + 27s/0x0002/info) - Inserted: PD 06(e0x20/s6) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5bbe,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18310 (boot + 27s/0x0002/info) - Inserted: PD 07(e0x20/s7)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18311 (boot + 27s/0x0002/info) - Inserted: PD 07(e0x20/s7) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=58ce38ee200f5b86,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18312 (boot + 27s/0x0002/info) - Inserted: PD 08(e0x20/s8)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18313 (boot + 27s/0x0002/info) - Inserted: PD 08(e0x20/s8) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b32a7798c8,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18314 (boot + 27s/0x0002/WARN) - PD 08(e0x20/s8) is not a certified drive
Feb 24 19:31:07 r730-01 kernel: mfi0: 18315 (boot + 27s/0x0002/info) - Inserted: PD 09(e0x20/s9)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18316 (boot + 27s/0x0002/info) - Inserted: PD 09(e0x20/s9) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b32a7798c9,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18317 (boot + 27s/0x0002/WARN) - PD 09(e0x20/s9) is not a certified drive
Feb 24 19:31:07 r730-01 kernel: mfi0: 18318 (boot + 27s/0x0002/info) - Inserted: PD 0a(e0x20/s10)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18319 (boot + 27s/0x0002/info) - Inserted: PD 0a(e0x20/s10) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b32a7798ca,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18320 (boot + 27s/0x0002/WARN) - PD 0a(e0x20/s10) is not a certified drive
Feb 24 19:31:07 r730-01 kernel: mfi0: 18321 (boot + 27s/0x0002/info) - Inserted: PD 0b(e0x20/s11)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18322 (boot + 27s/0x0002/info) - Inserted: PD 0b(e0x20/s11) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b32a7798cb,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18323 (boot + 27s/0x0002/WARN) - PD 0b(e0x20/s11) is not a certified drive
Feb 24 19:31:07 r730-01 kernel: mfi0: 18324 (boot + 27s/0x0002/info) - Inserted: PD 0e(e0x20/s14)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18325 (boot + 27s/0x0002/info) - Inserted: PD 0e(e0x20/s14) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b32a7798ce,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18326 (boot + 27s/0x0002/info) - Inserted: PD 0f(e0x20/s15)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18327 (boot + 27s/0x0002/info) - Inserted: PD 0f(e0x20/s15) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=500056b32a7798cf,0000000000000000
Feb 24 19:31:07 r730-01 kernel: mfi0: 18328 (boot + 27s/0x0020/info) - Controller operating temperature within normal range, full operation restored
Feb 24 19:31:07 r730-01 kernel: mfi0: 18329 (730582125s/0x0020/info) - Time established as 02/24/23 19:28:45; (49 seconds since power on)
Feb 24 19:31:07 r730-01 kernel: mfi0: 18330 (730582233s/0x0020/info) - Host driver is loaded and operational
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed061eb0, status=0x3c, scsi_status=0

More ugently, these errors started right away:

Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed061eb0, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 127, sense_key 15, asc 72, ascq 139
Feb 24 19:31:07 r730-01 kernel: mfisyspd5: hard error cmd=read 574923824-574924375
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed061058, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 21, sense_key 12, asc 198, ascq 5
Feb 24 19:31:07 r730-01 kernel: mfisyspd5: hard error cmd=read 574924376-574924927
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed060db0, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 67, sense_key 0, asc 0, ascq 0
Feb 24 19:31:07 r730-01 kernel: mfisyspd5: hard error cmd=read 574924928-574925479
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed0610e0, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 96, sense_key 0, asc 0, ascq 244
Feb 24 19:31:07 r730-01 kernel: mfisyspd6: hard error cmd=read 574923888-574924439
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed060e38, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 0, sense_key 0, asc 246, ascq 255
Feb 24 19:31:07 r730-01 kernel: mfisyspd6: hard error cmd=read 574924440-574924991
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed05e5d8, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 64, sense_key 6, asc 96, ascq 0
Feb 24 19:31:07 r730-01 kernel: mfisyspd6: hard error cmd=read 574924992-574925543
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed062048, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 72, sense_key 4, asc 236, ascq 72
Feb 24 19:31:07 r730-01 kernel: mfisyspd0: hard error cmd=read 574923888-574924439
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed0619e8, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 112, sense_key 13, asc 192, ascq 72
Feb 24 19:31:07 r730-01 kernel: mfisyspd6: hard error cmd=read 1341126416-1341126967
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed05f298, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 36, sense_key 8, asc 131, ascq 236
Feb 24 19:31:07 r730-01 kernel: mfisyspd6: hard error cmd=read 1341127520-1341128071
Feb 24 19:31:07 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed05f320, status=0x3c, scsi_status=0
Feb 24 19:31:07 r730-01 kernel: mfi0: sense error 0, sense_key 8, asc 116, ascq 32
Feb 24 19:31:07 r730-01 kernel: mfisyspd5: hard error cmd=read 1341127536-1341128087

.. note the gap in time here... the full logs are not shown.

Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed060ec0, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 11, sense_key 4, asc 204, ascq 204
Feb 24 19:36:56 r730-01 kernel: mfisyspd2: hard error cmd=read 4423416-4423967
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed060068, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 72, sense_key 4, asc 36, ascq 48
Feb 24 19:36:56 r730-01 kernel: mfisyspd6: hard error cmd=read 4424304-4424855
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed061410, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 72, sense_key 4, asc 236, ascq 40
Feb 24 19:36:56 r730-01 kernel: mfisyspd1: hard error cmd=read 4424320-4424871
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed05ff58, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 64, sense_key 5, asc 0, ascq 0
Feb 24 19:36:56 r730-01 kernel: mfisyspd5: hard error cmd=read 4424344-4424895
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed05f298, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 36, sense_key 8, asc 131, ascq 236
Feb 24 19:36:56 r730-01 kernel: mfisyspd7: hard error cmd=read 4424336-4424887
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed0621e0, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 53, sense_key 15, asc 72, ascq 195
Feb 24 19:36:56 r730-01 kernel: mfisyspd3: hard error cmd=read 4424896-4425447
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed060d28, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 13, sense_key 3, asc 133, ascq 192
Feb 24 19:36:56 r730-01 kernel: mfisyspd3: hard error cmd=read 4426392-4426943
Feb 24 19:36:56 r730-01 kernel: mfi0: I/O error, cmd=0xfffffe02ed05fed0, status=0x3c, scsi_status=0
Feb 24 19:36:56 r730-01 kernel: mfi0: sense error 72, sense_key 12, asc 0, ascq 0
Feb 24 19:36:56 r730-01 kernel: mfisyspd3: hard error cmd=read 4426944-4427495
...

For many pages…

At 19:32:52, I looked at zpool status:

[r730-01 dvl ~] % zpool status data01                                                                                                                                                                                                                                                                              
  pool: data01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 147M in 00:00:02 with 0 errors on Fri Feb 24 19:30:55 2023
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE       2     0     0
	    gpt/Y7P0A022TEVE  ONLINE       0     0     2
	    gpt/Y7P0A02ATEVE  ONLINE     351     0     1
	    gpt/Y7P0A02DTEVE  ONLINE       0     0   129
	    gpt/Y7P0A02GTEVE  ONLINE     378     0     0
	    gpt/Y7P0A02LTEVE  ONLINE     163     0     1
	    gpt/Y7P0A02MTEVE  ONLINE       0     0     2
	    gpt/Y7P0A02QTEVE  ONLINE       0     0     2
	    gpt/Y7P0A033TEVE  ONLINE       0     0     2

errors: No known data errors

At the time, I thought was was because I pulled a drive from this zpool. I still think that.

At 19:32:57, 5 seconds later, I start a scrub:

[r730-01 dvl ~] % sudo zpool scrub data01

The errors

I tailed the log and saw man more vdev related messages. The 7000 lines or so are at https://gist.github.com/dlangille/9d783e40b3461de5c93d6379b5d5b767

It’s about this time that I noticed the errors you see above; they were scrolling past in my IPMI session I had used during single-user mode.

zpool status

At 19:38:01:

[r730-01 dvl ~] % zpool status
  pool: data01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Feb 24 19:33:38 2023
	4.26T scanned at 16.4G/s, 40.1G issued at 154M/s, 4.26T total
	674M repaired, 0.92% done, 07:57:18 to go
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE   83.2K     0     0
	    gpt/Y7P0A022TEVE  ONLINE   81.1K     0     2  (repairing)
	    gpt/Y7P0A02ATEVE  ONLINE   81.1K     0     1  (repairing)
	    gpt/Y7P0A02DTEVE  ONLINE   83.7K     0   129  (repairing)
	    gpt/Y7P0A02GTEVE  ONLINE   84.3K     0     0  (repairing)
	    gpt/Y7P0A02LTEVE  ONLINE   79.2K     0     1  (repairing)
	    gpt/Y7P0A02MTEVE  ONLINE   86.0K     0     2  (repairing)
	    gpt/Y7P0A02QTEVE  ONLINE   80.9K     0     2  (repairing)
	    gpt/Y7P0A033TEVE  ONLINE   80.5K     0     2  (repairing)

errors: No known data errors

At 19:39:42 I stopped the scrub to reboot. I know scrubs will resume, but this is what I wanted to do.

[r730-01 dvl ~] % sudo zpool scrub -s data01

Before rebooting, I enabled mrsas again (hw.mfi.mrsas_enable="1").

At 19:43:53, after the reboot

[r730-01 dvl ~] % sudo zpool scrub data01                                                                                                                                                                                                                                                                           
[r730-01 dvl ~] % zpool status data01
  pool: data01
 state: ONLINE
  scan: scrub in progress since Fri Feb 24 19:44:05 2023
	102G scanned at 12.8G/s, 3.59M issued at 459K/s, 4.26T total
	0B repaired, 0.00% done, no estimated completion time
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE       0     0     0
	    gpt/Y7P0A022TEVE  ONLINE       0     0     0
	    gpt/Y7P0A02ATEVE  ONLINE       0     0     0
	    gpt/Y7P0A02DTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02GTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02LTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02MTEVE  ONLINE       0     0     0
	    gpt/Y7P0A02QTEVE  ONLINE       0     0     0
	    gpt/Y7P0A033TEVE  ONLINE       0     0     0

errors: No known data errors

Interesting. Now no errors.

Later, at 19:45:11:

[r730-01 dvl ~] % zpool status data01                                                                                                                                                                                                                                                                              
  pool: data01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Feb 24 19:44:05 2023
	4.26T scanned at 11.7G/s, 249G issued at 683M/s, 4.26T total
	81.3M repaired, 5.71% done, 01:42:36 to go
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE       0     0     0
	    gpt/Y7P0A022TEVE  ONLINE       0     0 2.30K  (repairing)
	    gpt/Y7P0A02ATEVE  ONLINE       0     0 2.75K  (repairing)
	    gpt/Y7P0A02DTEVE  ONLINE       0     0 2.57K  (repairing)
	    gpt/Y7P0A02GTEVE  ONLINE       0     0 2.81K  (repairing)
	    gpt/Y7P0A02LTEVE  ONLINE       0     0 2.24K  (repairing)
	    gpt/Y7P0A02MTEVE  ONLINE       0     0 2.70K  (repairing)
	    gpt/Y7P0A02QTEVE  ONLINE       0     0 2.08K  (repairing)
	    gpt/Y7P0A033TEVE  ONLINE       0     0 2.16K  (repairing)

By 20:21:55, there’s is less than a minute to go:

[r730-01 dvl ~] % zpool status data01                                                                                                                                                                                                                                                                               
  pool: data01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Fri Feb 24 19:44:05 2023
	4.26T scanned at 1.78G/s, 4.18T issued at 1.75G/s, 4.26T total
	81.3M repaired, 98.21% done, 00:00:44 to go
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE       0     0     0
	    gpt/Y7P0A022TEVE  ONLINE       0     0 2.30K  (repairing)
	    gpt/Y7P0A02ATEVE  ONLINE       0     0 2.75K  (repairing)
	    gpt/Y7P0A02DTEVE  ONLINE       0     0 2.57K  (repairing)
	    gpt/Y7P0A02GTEVE  ONLINE       0     0 2.81K  (repairing)
	    gpt/Y7P0A02LTEVE  ONLINE       0     0 2.24K  (repairing)
	    gpt/Y7P0A02MTEVE  ONLINE       0     0 2.70K  (repairing)
	    gpt/Y7P0A02QTEVE  ONLINE       0     0 2.08K  (repairing)
	    gpt/Y7P0A033TEVE  ONLINE       0     0 2.16K  (repairing)

errors: No known data errors

By 20:24:48, the scrub was done.

[r730-01 dvl ~] % zpool status data01                                                                                                                                                                                                                                                                               
  pool: data01
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 81.3M in 00:41:29 with 0 errors on Fri Feb 24 20:25:34 2023
config:

	NAME                  STATE     READ WRITE CKSUM
	data01                ONLINE       0     0     0
	  raidz2-0            ONLINE       0     0     0
	    gpt/Y7P0A022TEVE  ONLINE       0     0 2.30K
	    gpt/Y7P0A02ATEVE  ONLINE       0     0 2.75K
	    gpt/Y7P0A02DTEVE  ONLINE       0     0 2.57K
	    gpt/Y7P0A02GTEVE  ONLINE       0     0 2.81K
	    gpt/Y7P0A02LTEVE  ONLINE       0     0 2.24K
	    gpt/Y7P0A02MTEVE  ONLINE       0     0 2.70K
	    gpt/Y7P0A02QTEVE  ONLINE       0     0 2.08K
	    gpt/Y7P0A033TEVE  ONLINE       0     0 2.16K

errors: No known data errors
[r730-01 dvl ~] %

As far as I can tell, this wasn’t hardware. The zpool is fine.

Please tell me your insights into this.