So there I was… ready to remove the drive from the system. This was the drive which was giving errors and which had already been replaced.
In this post:
- FreeBSD 14.2
Let’s look at the drive I just wiped … I was doing this command:
[20:11 r730-03 dvl ~] % sudo dd if=/dev/zero of=/dev/gpt/HGST_8CJVT8YE bs=4M
Let’s run it again and see which drive LED lights up.
Yep, there it is. CTL-C, LED goes off.
Run it again, put my finger on the cage release button, CTL-C it, press the button, got the drive.
And:
Apr 21 20:12:59 r730-03 kernel: da2 at mrsas0 bus 1 scbus1 target 3 lun 0 Apr 21 20:12:59 r730-03 kernel: da2:Fixed Direct Access SPC-4 SCSI device Apr 21 20:12:59 r730-03 kernel: da2: Serial Number ZHZ16KEX Apr 21 20:12:59 r730-03 kernel: da2: 150.000MB/s transfers Apr 21 20:12:59 r730-03 kernel: da2: 11444224MB (23437770752 512 byte sectors)
Umm, that’s not the drive. The drive giving the errors is:
Apr 21 20:17:22 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 16 Currently unreadable (pending) sectors Apr 21 20:17:22 r730-03 smartd[16597]: Device: /dev/da1 [SAT], 8 Offline uncorrectable sectors
A quick check: zpool status is fine:
[20:10 r730-03 dvl ~] % zpool status pool: data01 state: ONLINE scan: scrub repaired 0B in 19:52:50 with 0 errors on Thu Apr 17 07:09:29 2025 config: NAME STATE READ WRITE CKSUM data01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/SEAG_ZJV4HFPE ONLINE 0 0 0 gpt/SG_ZHZ16KEX ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gpt/SG_ZHZ03BAT ONLINE 0 0 0 gpt/HGST_8CJW1G4E ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 gpt/SG_ZL2NJBT2 ONLINE 0 0 0 gpt/HGST_5PGGTH3D ONLINE 0 0 0 errors: No known data errors pool: zroot state: ONLINE scan: scrub repaired 0B in 00:01:09 with 0 errors on Thu Apr 17 04:49:19 2025 config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 errors: No known data errors
No problems there.
I put the drive back in, because: clearly, I still need it.
I see:
Apr 21 20:12:59 r730-03 kernel: da2 at mrsas0 bus 1 scbus1 target 3 lun 0 Apr 21 20:12:59 r730-03 kernel: da2:Fixed Direct Access SPC-4 SCSI device Apr 21 20:12:59 r730-03 kernel: da2: Serial Number ZHZ16KEX Apr 21 20:12:59 r730-03 kernel: da2: 150.000MB/s transfers Apr 21 20:12:59 r730-03 kernel: da2: 11444224MB (23437770752 512 byte sectors)
Looking in the logs, that serial still matches, OK, nothing seems wrong yet. Except for the labels.
[20:11 r730-03 dvl ~] % grep da2 /var/run/dmesg.boot da2 at mrsas0 bus 1 scbus1 target 3 lun 0 da2:Fixed Direct Access SPC-4 SCSI device da2: Serial Number ZHZ16KEX da2: 150.000MB/s transfers da2: 11444224MB (23437770752 512 byte sectors) da2 at mrsas0 bus 1 scbus1 target 3 lun 0 da2: Fixed Direct Access SPC-4 SCSI device da2: Serial Number ZHZ16KEX da2: 150.000MB/s transfers da2: 11444224MB (23437770752 512 byte sectors)
Yeah, OK, everything is fine.
What?
I’m writing this up because now is no time for confusion. I’ve got to get this right.
The drive I pulled has label HGST_8CJVT8YE: I used that during the dd command above.
The drive I pulled has a serial number of ZHZ16KEX, which does not match that label.
The problem drive identity
The problem drive is da1.
Looking for that, I find:
[20:48 r730-03 dvl ~] % grep '^da1: ' /var/run/dmesg.boot | sort | uniq da1: 11444224MB (23437770752 512 byte sectors) da1: 150.000MB/s transfers da1:Fixed Direct Access SPC-4 SCSI device da1: Serial Number 8CJVT8YE
Looking at all the labels for gpart:
[20:13 r730-03 dvl ~] % gpart show -l => 40 937703008 ada0 GPT (447G) 40 1024 1 gptboot1 (512K) 1064 984 - free - (492K) 2048 67108864 2 swap1 (32G) 67110912 870590464 3 zfs1 (415G) 937701376 1672 - free - (836K) => 40 937703008 ada1 GPT (447G) 40 1024 1 gptboot0 (512K) 1064 984 - free - (492K) 2048 67108864 2 swap0 (32G) 67110912 870590464 3 zfs0 (415G) 937701376 1672 - free - (836K) => 40 23437770672 da0 GPT (11T) 40 23437770600 1 HGST_5PGGTH3D (11T) 23437770640 72 - free - (36K) => 40 23437770672 da4 GPT (11T) 40 23437770600 1 HGST_8CJW1G4E (11T) 23437770640 72 - free - (36K) => 40 23437770672 da1 GPT (11T) 40 23437770600 1 SG_ZHZ16KEX (11T) 23437770640 72 - free - (36K) => 34 23437770685 da5 GPT (11T) 34 6 - free - (3.0K) 40 23437770600 1 SG_ZL2NJBT2 (11T) 23437770640 79 - free - (40K) => 40 23437770672 da3 GPT (11T) 40 23437770600 1 SG_ZHZ03BAT (11T) 23437770640 72 - free - (36K) => 40 23437770672 da7 GPT (11T) 40 23437770600 1 SEAG_ZJV4HFPE (11T) 23437770640 72 - free - (36K) => 40 23437770672 da2 GPT (11T) 40 23437770600 1 HGST_8CJVT8YE (11T) 23437770640 72 - free - (36K)
Well, damn. The label for da2 contains the serial number for da1.
That needs to be fixed.
All the serial numbers
Here are all the serial numbers:
[20:41 r730-03 dvl ~] % grep 'Serial Number' /var/run/dmesg.boot | sort | uniq ada0: Serial Number BTWA602402P7480FGN ada1: Serial Number BTWA604405H2480FGN cd0: Serial Number KZDHB6D4311 da0: Serial Number 5PGGTH3D da1: Serial Number 8CJVT8YE da2: Serial Number ZHZ16KEX da3: Serial Number ZHZ03BAT da4: Serial Number 8CJW1G4E da5: Serial Number ZL2NJBT2 da6: Serial Number 012345678901
Fixing da1
Let’s fix the wrong label for da1, the problem drive, the one to be removed:
[20:48 r730-03 dvl ~] % sudo gpart modify -i 1 -l HGST_8CJVT8YE da1 da1p1 modified [20:49 r730-03 dvl ~] % gpart show -l da1 => 40 23437770672 da1 GPT (11T) 40 23437770600 1 HGST_8CJVT8YE (11T) 23437770640 72 - free - (36K) [20:49 r730-03 dvl ~] %
There, now it matches:
[20:49 r730-03 dvl ~] % grep 8CJVT8YE /var/run/dmesg.boot | sort | uniq da1: Serial Number 8CJVT8YE [20:50 r730-03 dvl ~] %
Fix the others
First, I’ll fix da2 upon which we found the wrong serial number. What’s the right one?
[20:50 r730-03 dvl ~] % grep '^da2: ' /var/run/dmesg.boot | sort | uniq da2: 11444224MB (23437770752 512 byte sectors) da2: 150.000MB/s transfers da2:Fixed Direct Access SPC-4 SCSI device da2: Serial Number ZHZ16KEX
Here’s the fix:
[20:53 r730-03 dvl ~] % sudo gpart modify -i 1 -l SEAG_ZHZ16KEX da2 da2p1 modified [20:53 r730-03 dvl ~] % gpart show -l da2 => 40 23437770672 da2 GPT (11T) 40 23437770600 1 SEAG_ZHZ16KEX (11T) 23437770640 72 - free - (36K) [20:53 r730-03 dvl ~] %
Checking the others
This is the current status:
[20:54 r730-03 dvl ~] % gpart show -l => 40 937703008 ada0 GPT (447G) 40 1024 1 gptboot1 (512K) 1064 984 - free - (492K) 2048 67108864 2 swap1 (32G) 67110912 870590464 3 zfs1 (415G) 937701376 1672 - free - (836K) => 40 937703008 ada1 GPT (447G) 40 1024 1 gptboot0 (512K) 1064 984 - free - (492K) 2048 67108864 2 swap0 (32G) 67110912 870590464 3 zfs0 (415G) 937701376 1672 - free - (836K) => 40 23437770672 da0 GPT (11T) 40 23437770600 1 HGST_5PGGTH3D (11T) 23437770640 72 - free - (36K) => 40 23437770672 da4 GPT (11T) 40 23437770600 1 HGST_8CJW1G4E (11T) 23437770640 72 - free - (36K) => 40 23437770672 da1 GPT (11T) 40 23437770600 1 HGST_8CJVT8YE (11T) 23437770640 72 - free - (36K) => 34 23437770685 da5 GPT (11T) 34 6 - free - (3.0K) 40 23437770600 1 SG_ZL2NJBT2 (11T) 23437770640 79 - free - (40K) => 40 23437770672 da3 GPT (11T) 40 23437770600 1 SG_ZHZ03BAT (11T) 23437770640 72 - free - (36K) => 40 23437770672 da7 GPT (11T) 40 23437770600 1 SEAG_ZJV4HFPE (11T) 23437770640 72 - free - (36K) => 40 23437770672 da2 GPT (11T) 40 23437770600 1 SEAG_ZHZ16KEX (11T) 23437770640 72 - free - (36K)
I’ve compared the above list against the serial number. All good.
zpool status
Next, let’s figure out if this all make sense too, just in case. I have manually add in the device numbers.
[20:57 r730-03 dvl ~] % zpool status data01 pool: data01 state: ONLINE scan: scrub repaired 0B in 19:52:50 with 0 errors on Thu Apr 17 07:09:29 2025 config: NAME STATE READ WRITE CKSUM data01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/SEAG_ZJV4HFPE ONLINE 0 0 0 da7 gpt/SG_ZHZ16KEX ONLINE 0 0 0 da1 mirror-1 ONLINE 0 0 0 gpt/SG_ZHZ03BAT ONLINE 0 0 0 da3 gpt/HGST_8CJW1G4E ONLINE 0 0 0 da4 mirror-2 ONLINE 0 0 0 gpt/SG_ZL2NJBT2 ONLINE 0 0 0 da5 gpt/HGST_5PGGTH3D ONLINE 0 0 0 dao errors: No known data errors
That list of devices does not include the drive I removed: da2.
The plan
Let’s do another replace. Let’s replace da1 (gpt/SG_ZHZ16KEX) with da2 (gpt/SEAG_ZHZ16KEX).
I plan to issue this command: sudo zpool replace data01 gpt/SG_ZHZ16KEX gpt/SEAG_ZHZ16KEX
Checking the man page, I confirm, it is OLDDEV NEWDEV.
OK, let’s get that started, in a tmux session.
[21:09 r730-03 dvl ~] % sudo zpool replace data01 gpt/SG_ZHZ16KEX gpt/SEAG_ZHZ16KEX [21:10 r730-03 dvl ~] % zpool status data01 pool: data01 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 21 21:10:09 2025 56.4G / 22.1T scanned at 1.41G/s, 0B / 22.1T issued 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM data01 ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/SEAG_ZJV4HFPE ONLINE 0 0 0 replacing-1 ONLINE 0 0 0 gpt/SG_ZHZ16KEX ONLINE 0 0 0 gpt/SEAG_ZHZ16KEX ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 gpt/SG_ZHZ03BAT ONLINE 0 0 0 gpt/HGST_8CJW1G4E ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 gpt/SG_ZL2NJBT2 ONLINE 0 0 0 gpt/HGST_5PGGTH3D ONLINE 0 0 0 errors: No known data errors [21:10 r730-03 dvl ~] %
See! See right there, the serial number is on both devices under replacing-1. :/
Hindsight
What did I do wrong? Looking at my blog post where I did the previous replace (search for “I did a replace”). In there, I did:
sudo zpool replace data01 gpt/HGST_8CJVT8YE gpt/SEAG_ZJV4HFPE
That removed gpt/HGST_8CJVT8YE from the zpool. How did I choose that device? I did this:
[21:17 r730-03 dvl ~] % grep ^da1 /var/run/dmesg.boot | sort | uniq da1 at mrsas0 bus 1 scbus1 target 2 lun 0 da1: 11444224MB (23437770752 512 byte sectors) da1: 150.000MB/s transfers da1:Fixed Direct Access SPC-4 SCSI device da1: Serial Number 8CJVT8YE
That told me the serial number: 8CJVT8YE – then I looked at zpool status and picked the device. The device with the wrong serial number in the label.
Tracking this down, this situation has been wrong since this blog post from August 2023. If you search that blog post for 8CJVT8YE you will find:
[23:39 r730-03 dvl ~] % grep da2 /var/run/dmesg.boot da2 at mrsas0 bus 1 scbus1 target 2 lun 0 da2: Fixed Direct Access SPC-4 SCSI device da2: Serial Number 8CJVT8YE [14:07 r730-03 dvl ~] % sudo diskinfo -cit da2 da2 512 # sectorsize 12000138625024 # mediasize in bytes (11T) 23437770752 # mediasize in sectors 4096 # stripesize 0 # stripeoffset 1458933 # Cylinders according to firmware. 255 # Heads according to firmware. 63 # Sectors according to firmware. ATA HGST HUH721212AL # Disk descr. 8CJVT8YE # Disk ident. ... [21:34 r730-03 dvl ~] % sudo smartctl -x /dev/da2 smartctl 7.4 2023-08-01 r5530 [FreeBSD 13.2-RELEASE-p2 amd64] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: HGST Ultrastar DC HC520 (He12) Device Model: HGST HUH721212ALE600 Serial Number: 8CJVT8YE ... [14:24 r730-03 dvl ~] % sudo gpart add -t freebsd-zfs -a 4K -s 23437770600 -l HGST_8CJVT8YE da3 da3p1 added
You can see it right there. da2, da2, da2… da3… oh no…
And in the lines under “Creating partitions”, you can see where I mislabeled the other drive with ZHZ16KEX. I did both drives with the wrong labels.
OK, we know who to blame.