During the 2025-01-22 OpenZFS Production User Call, ‘atomic operations’ was mentioned with respect to
In this post:
- FreeBSD 14.1
- r730-03
Let’s do a test
Speculation about empty snapshots was mentioned during the call. I did a test with 3000 snapshots
First, I create a filesystem for this testing:
[21:29 r730-03 dvl ~] % sudo zfs create data01/snapshots/deleting
Then, use jot to create 3000 snapshots:
[21:32 r730-03 dvl ~] % jot 3000 | xargs -I % -n 1 echo sudo zfs snapshot data01/snapshots/deleting@% | head sudo zfs snapshot data01/snapshots/deleting@1 sudo zfs snapshot data01/snapshots/deleting@2 sudo zfs snapshot data01/snapshots/deleting@3 sudo zfs snapshot data01/snapshots/deleting@4 sudo zfs snapshot data01/snapshots/deleting@5 sudo zfs snapshot data01/snapshots/deleting@6 sudo zfs snapshot data01/snapshots/deleting@7 sudo zfs snapshot data01/snapshots/deleting@8 sudo zfs snapshot data01/snapshots/deleting@9 sudo zfs snapshot data01/snapshots/deleting@10 xargs: echo: terminated with signal 13; aborting [21:32 r730-03 dvl ~] % jot 3000 | xargs -I % -n 1 sudo zfs snapshot data01/snapshots/deleting@% [21:38 r730-03 dvl ~] % zfs list -r -t snapshot data01/snapshots/deleting | wc -l 3001
That creation took 6 minutes.
Let’s delete.
[21:41 r730-03 dvl ~] % time sudo zfs destroy data01/snapshots/deleting@1%3000 sudo zfs destroy data01/snapshots/deleting@1%3000 0.01s user 0.01s system 0% cpu 39.270 total [21:43 r730-03 dvl ~] %
40 seconds to destroy. That’s impression.
Next, more.
Let’s try 60,000 empty snapshots
For my next trick, let’s create 60,000 snapshots
[21:43 r730-03 dvl ~] % zfs list -r -t snapshot data01/snapshots/deleting | wc -l no datasets available 0 [21:45 r730-03 dvl ~] % jot 60000 | xargs -I % -n 1 sudo zfs snapshot data01/snapshots/deleting@% [4:56 r730-03 dvl ~] %
So that took 7 hours to create. Wow. It ran over night. It is now the 23rd.
How long does it take to list them?
[12:43 r730-03 dvl ~] % time zfs list -r -t snapshot data01/snapshots/deleting > ~/tmp/deleting zfs list -r -t snapshot data01/snapshots/deleting > ~/tmp/deleting 2.54s user 48.47s system 99% cpu 51.042 total
50 seconds. That’s OK.
60,000 deletes starting on the 23rd
I started the delete. Actually, it’s not 60,000 deletes. It’s one destroy, of 60,000 snapshots.
[12:52 r730-03 dvl ~] % time sudo zfs destroy data01/snapshots/deleting@1%60000
After starting the above command, I started btop, and ran several zfs list. Eventually, the zfs list command hung (did not come back to the command line.
I stopped btop and tried running it again, it did not start and did not come back to the command line.
4 hours later
It’s been running about 4 hours now.
At present, I cannot ssh to the host:
[11:44 pro02 dan ~] % r730-03 kex_exchange_identification: read: Connection reset by peer Connection reset by 10.55.0.143 port 22
I have tried to ssh into various jails on that host: same result.
There are plenty of Nagios notifications:
swap issues
Connecting to the console, I see lots of swap related messages.
The console is scrolling, so the system is still alive. I’m going to leave it for a bit longer.
NOTE: the zfs destroy command is not responding to CTL-t.
23:39
The zfs destroy started at about 12:52. It’s now 12.5 hours later…
The host is responding the pings:
[18:38 pro04 dvl ~] % ping r730-03 PING r730-03.int.unixathome.org (10.55.0.143): 56 data bytes 64 bytes from 10.55.0.143: icmp_seq=0 ttl=63 time=5.167 ms 64 bytes from 10.55.0.143: icmp_seq=1 ttl=63 time=7.554 ms ^C --- r730-03.int.unixathome.org ping statistics --- 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 5.167/6.361/7.554/1.193 ms [18:41 pro04 dvl ~] %
Still no ssh response.
I’m headed out for dinner, so we’ll check back later.
13:30 – the next day
This morning, all the existing ssh sessions have been disconnected. The host is no longer responding to pings. Attempts to ssh time out. Samba mounts have disconnected.
[12:52 r730-03 dvl ~] % time sudo zfs destroy data01/snapshots/deleting@1%60000 client_loop: send disconnect: Broken pipe [22:59 pro02 dan ~] % ping r730-03 PING r730-03.int.unixathome.org (10.55.0.143): 56 data bytes Request timeout for icmp_seq 0 ^C --- r730-03.int.unixathome.org ping statistics --- 2 packets transmitted, 0 packets received, 100.0% packet loss [8:30 pro02 dan ~] % [12:21 pro02 dan ~] % r730-03 ssh: connect to host r730-03.int.unixathome.org port 22: Operation timed out [8:31 pro02 dan ~] %
The console is still scrolling the swap_pager: indefinite wait buffer: bufobj 0: blkno: 7301: size: 12288 (for example).
None of the overnight backups succeeded (this host is the destination).
21:05 – the 24th
Still completely unresponsive. The console is still scrolling with swap_pager messages.
23:54
I’ve noticed that the screen shots of the console seem to be cycling and repeating the same sequences.
I’ve put the two shots here (the first is repeated from above).
It is time to terminate the experiment.
After the reboot
After the reboot, the host is back, and everything is green on Nagios.
The bad news is: none of the 60,000 snapshots were deleted. It is an atomic operation and it did not complete.
I’ll go back to the xargs solution.