Yesterday, I discovered I had removed the wrong drive from a zpool.
In this post:
- FreeBSD 14.2
Today, the zpool replace command has completed.
Next, I carefully chose the right drive to pull from the drive bays.
Status
This is the zpool status, just before it completed:
[20:27 r730-03 dvl ~] % zpool status data01 pool: data01 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 21 21:10:09 2025 22.1T / 22.1T scanned, 7.36T / 7.37T issued at 95.7M/s 7.40T resilvered, 99.83% done, 00:02:15 to go config: NAME STATE READ WRITE CKSUM data01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/SEAG_ZJV4HFPE ONLINE 0 0 0 replacing-1 ONLINE 0 0 0 gpt/SG_ZHZ16KEX ONLINE 0 0 0 gpt/SEAG_ZHZ16KEX ONLINE 0 0 0 (resilvering) mirror-1 ONLINE 0 0 0 gpt/SG_ZHZ03BAT ONLINE 0 0 0 gpt/HGST_8CJW1G4E ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 gpt/SG_ZL2NJBT2 ONLINE 0 0 0 gpt/HGST_5PGGTH3D ONLINE 0 0 0 errors: No known data errors
After it completed:
[20:27 r730-03 dvl ~] % zpool status data01 pool: data01 state: ONLINE scan: resilvered 7.41T in 23:19:38 with 0 errors on Tue Apr 22 20:29:47 2025 config: NAME STATE READ WRITE CKSUM data01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/SEAG_ZJV4HFPE ONLINE 0 0 0 gpt/SEAG_ZHZ16KEX ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gpt/SG_ZHZ03BAT ONLINE 0 0 0 gpt/HGST_8CJW1G4E ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 gpt/SG_ZL2NJBT2 ONLINE 0 0 0 gpt/HGST_5PGGTH3D ONLINE 0 0 0 errors: No known data errors
I see gpt/SG_ZHZ16KEX is no longer in the zpool.
Not trusting myself, I checked this way:
[20:33 r730-03 dvl ~] % zpool status data01 | grep gpt/SG_ZHZ16KEX [20:33 r730-03 dvl ~] %
Checking again
Now, remember, the drives had the wrong label, and I fixed that. However, the device numbers were wrong, I’m sure.
I ran this command and searched for the removed device:
[20:49 r730-03 dvl ~] % glabel list ... Geom name: da1p1 Providers: 1. Name: gpt/SG_ZHZ16KEX Mediasize: 12000138547200 (11T) Sectorsize: 512 ...
OK, that connects da1 and the device id.
It also matches up with the following:
[20:51 r730-03 dvl ~] % tail /var/log/messages Apr 22 19:17:22 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 22 19:17:22 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors Apr 22 19:47:21 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 22 19:47:21 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors Apr 22 20:17:21 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 22 20:17:21 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors Apr 22 20:47:22 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 22 20:47:22 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors
Turning the lights on
Let’s try my old method for identifying the drive:
[12:17 r730-03 dvl ~] % sudo dd if=/dev/zero of=/dev/gpt/SG_ZHZ16KEX bs=4M ^C400+0 records in 399+0 records out 1673527296 bytes transferred in 9.567196 secs (174923489 bytes/sec) [12:18 r730-03 dvl ~] %
I saw the drive which had the LED on full time… That was the drive I removed.
Looking in /var/log/messages, I found:
Apr 23 12:17:21 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 23 12:17:21 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors Apr 23 12:18:11 r730-03 kernel: mrsas0: System PD deleted target ID: 0x2 Apr 23 12:18:11 r730-03 kernel: da1 at mrsas0 bus 1 scbus1 target 2 lun 0 Apr 23 12:18:11 r730-03 kernel: da1:s/n 8CJVT8YE detached Apr 23 12:18:11 r730-03 kernel: (da1:mrsas0:1:2:0): Periph destroyed
All consistent. Of note, this was drive bay 2 (or, target 2, as seen above).
More important is that the zpool status is still fine:
[12:20 r730-03 dvl ~] % zpool status pool: data01 state: ONLINE scan: resilvered 7.41T in 23:19:38 with 0 errors on Tue Apr 22 20:29:47 2025 config: NAME STATE READ WRITE CKSUM data01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/SEAG_ZJV4HFPE ONLINE 0 0 0 gpt/SEAG_ZHZ16KEX ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gpt/SG_ZHZ03BAT ONLINE 0 0 0 gpt/HGST_8CJW1G4E ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 gpt/SG_ZL2NJBT2 ONLINE 0 0 0 gpt/HGST_5PGGTH3D ONLINE 0 0 0 errors: No known data errors pool: zroot state: ONLINE scan: scrub repaired 0B in 00:01:09 with 0 errors on Thu Apr 17 04:49:19 2025 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 errors: No known data errors
Checking one last thing:
[12:21 r730-03 dvl ~] % grep da1 /var/run/dmesg.boot | grep '^da1:' | sort -u da1: 11444224MB (23437770752 512 byte sectors) da1: 150.000MB/s transfers da1:Fixed Direct Access SPC-4 SCSI device da1: Serial Number 8CJVT8YE [12:21 r730-03 dvl ~] %
The serial number there matches the drive I just pulled.
Phew.
That will stop the most nagging part of this issue: the smartd reports of the troublesome sectors.
Next tasks: box up the drive (already done), buy some postage, and put it into the mail system for the RMA.