Sometimes you want to go back to a known good state. And you want to do it quickly, without waiting around. In my case, that’s going to be a zfs snapshot.
Today I am preparing to test some code changes so that FreshPorts properly creates a new category. See these recent blog posts for details:
In this post:
- FreeBSD 14.1-RELEASE-p5
- postgresql16-server-16.4
dbclone
The FreeBSD jail dbclone collects database backups from the database servers. It then does a test restore to make sure the restore process works. The commit in question occurred at 2024-11-06 15:17:35 UTC – so I want a back earlier than that.
Looking in the jail, I found:
[10:53 dbclone dvl ~rsyncer/backups/aws-1/postgresql] % ls -l total 3510909 -rw-r--r-- 1 rsyncer rsyncer 3594468260 2024.11.08 02:13 freshports.org.dump -rw-r--r-- 1 rsyncer rsyncer 3963 2024.11.08 02:13 globals.sql
That won’t do. Let’s check snapshots.
[10:53 r730-01 dvl ~] % zfs list | grep dbclone [10:53 r730-01 dvl ~] %
Wrong host….
[10:53 r730-03 dvl ~] % zfs list | grep dbclone data01/dbclone.backups.rsyncer 421G 3.09T 137G /jails/dbclone/usr/home/rsyncer/backups data01/dbclone.postgres 54.7G 8.77T 54.7G /jails/dbclone/var/db/postgres data01/jails/dbclone 25.1G 8.77T 18.6G /jails/dbclone
Let’s go look in snapshots here:
[11:08 r730-03 dvl ~] % cd /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot cd: permission denied: /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot [11:09 r730-03 dvl ~] % sudo ls -l /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot total 51 drwxr-x--- 15 10839 10839 16 Jun 2 12:04 autosnap_2024-11-06_00:00:03_daily drwxr-x--- 15 10839 10839 16 Jun 2 12:04 autosnap_2024-11-07_00:00:01_daily drwxr-x--- 15 10839 10839 16 Jun 2 12:04 autosnap_2024-11-08_00:00:12_daily
Two things come from that:
- I want more snapshots
- autosnap_2024-11-06_00:00:03_daily might do
[11:11 r730-03 dvl ~] % sudo ls -l /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot/autosnap_2024-11-06_00:00:03_daily/aws-1/postgresql total 7030162 -rw-r--r-- 1 10839 10839 3598726805 Nov 5 02:14 freshports.org.dump -rw-r--r-- 1 10839 10839 3963 Nov 5 02:14 globals.sql [11:11 r730-03 dvl ~] %
Well, let’s try the later snapshot.
[11:11 r730-03 dvl ~] % sudo ls -l /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot/autosnap_2024-11-07_00:00:01_daily/aws-1/postgresql total 7028826 -rw-r--r-- 1 10839 10839 3598138018 Nov 6 02:13 freshports.org.dump -rw-r--r-- 1 10839 10839 3963 Nov 6 02:13 globals.sql
That’ll do.
Copying
Get it to the server:
[11:13 r730-03 dvl ~] % sudo scp /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot/autosnap_2024-11-07_00:00:01_daily/aws-1/postgresql/freshports.org.dump dvl@pg03:freshports.org.dump freshports.org.dump 100% 3431MB 173.7MB/s 00:19
Let’s check that file:
[11:15 pg03 dvl ~] % md5 freshports.org.dump MD5 (freshports.org.dump) = 8609e1ecb8b7da74a09d3c56ac850b24 [11:13 r730-03 dvl ~] % sudo md5 /jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot/autosnap_2024-11-07_00:00:01_daily/aws-1/postgresql/freshports.org.dump MD5 (/jails/dbclone/usr/home/rsyncer/backups/.zfs/snapshot/autosnap_2024-11-07_00:00:01_daily/aws-1/postgresql/freshports.org.dump) = 8609e1ecb8b7da74a09d3c56ac850b24
They match.
Loading into the server
SQL_ASCII is used because I cannot control what committers put into files.
[11:14 pg03 dvl ~] % createdb freshports.dvl-category-testing -E SQL_ASCII -T template0 -O postgres [11:18 pg03 dvl ~] % time pg_restore -j 60 -d freshports.dvl-category-testing -Fc freshports.org.dump pg_restore: error: could not execute query: ERROR: role "rdsadmin" does not exist Command was: ALTER SCHEMA public OWNER TO rdsadmin;
The rdsadmin role is what Amazon RDS uses. It come with the dump.
Now I wait….
After a walk and a coffee, it’s done:
pg_restore: warning: errors ignored on restore: 1 pg_restore -j 60 -d freshports.dvl-category-testing -Fc freshports.org.dump 163.15s user 34.95s system 7% cpu 45:34.95 total
I think that was 45 minutes.
However, I’m going to do this again, on a different server. I want to make use of ZFS snapshots to rollback the database. That will allow me to test, rollback, test again, with a known good starting point. When doing that rollback, I want to stop PostgreSQL. This other server has no databases on it now. Shutting down my main development server would affect FreshPorts dev, test, and stage. Let’s not do that.
Copy over:
[13:14 pg03 dvl ~] % scp freshports.org.dump pg02: freshports.org.dump 100% 3431MB 124.0MB/s 00:27 [13:21 pg03 dvl ~] %
Create and load. On this host, I don’t have rights, so I become the postgres user first:
[13:22 pg02 dvl ~] % sudo su -l postgres $ createdb freshports.dvl-category-testing -E SQL_ASCII -T template0 -O postgres $ time pg_restore -j 60 -d freshports.dvl-category-testing -Fc ~dvl/freshports.org.dump pg_restore: error: could not execute query: ERROR: role "rdsadmin" does not exist Command was: ALTER SCHEMA public OWNER TO rdsadmin;
Then I cleaned up the other server while waiting:
[13:21 pg03 dvl ~] % dropdb freshports.dvl-category-testing [13:24 pg03 dvl ~] % rm freshports.org.dump [13:24 pg03 dvl ~] %
Snapshot changes
While waiting, I looked at snapshots. I’m holding only three days. Let’s see the config
[13:15 r730-03 dvl ~] % sudoedit /usr/local/etc/sanoid/sanoid.conf
Where I found:
[data01/dbclone.backups.rsyncer] use_template = dbclone_backups recursive = yes [template_dbclone_backups] frequently = 0 hourly = 0 daily = 3 monthly = 0 autosnap = yes autoprune = yes
How much space is that using?
[13:15 r730-03 dvl ~] % zfs list -r -t snapshot data01/dbclone.backups.rsyncer NAME USED AVAIL REFER MOUNTPOINT data01/dbclone.backups.rsyncer@autosnap_2024-11-06_00:00:03_daily 94.4G - 137G - data01/dbclone.backups.rsyncer@autosnap_2024-11-07_00:00:01_daily 94.4G - 137G - data01/dbclone.backups.rsyncer@autosnap_2024-11-08_00:00:12_daily 94.4G - 137G - [13:15 r730-03 dvl ~] % zfs list data01/dbclone.backups.rsyncer NAME USED AVAIL REFER MOUNTPOINT data01/dbclone.backups.rsyncer 421G 3.09T 137G /jails/dbclone/usr/home/rsyncer/backups [13:29 r730-03 dvl ~] % zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT data01 32.7T 23.8T 8.94T - - 24% 72% 1.00x ONLINE - zroot 412G 8.96G 403G - - 16% 2% 1.00x ONLINE -
So that zpool has about 9TB free. There’s another 3TB or so until it gets to the lauded 80% full mark.
Each day of backups seems to take about 100GB. So keeping another 4 days will take up an additional 400GB.
Taking two weekly snapshots will give take another 200GB. Let’s go 5 instead. Let’s also do two monthly backups.
So that 400GB + 500GB + 200GB or about 1.1TB additional space. We can try that and see how it goes.
This is what I have now:
# This should take about 100GB / day - so this approach should be about 1.5TB [template_dbclone_backups] frequently = 0 hourly = 0 daily = 7 weekly = 5 monthly = 2 autosnap = yes autoprune = yes
What about backups?
In Bacula, I have these backups of that file, about 2 weeks.
Select the Client (1-57): 7 freshports.org.dumppath):freshports.org.dump +---------+------------------------------------------------------------+---------------------+---------+-----------+----------+----------------+ | jobid | name | starttime | jobtype | jobstatus | jobfiles | jobbytes | +---------+------------------------------------------------------------+---------------------+---------+-----------+----------+----------------+ | 370458 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-08 03:05:01 | B | T | 313 | 70510052692 | | 370430 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-07 03:05:01 | B | T | 313 | 70510670150 | | 370402 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-06 03:05:01 | B | T | 313 | 70506778230 | | 370374 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-05 03:05:01 | B | T | 313 | 70507566372 | | 370346 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-04 03:05:01 | B | T | 313 | 70508440245 | | 370316 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-03 03:05:02 | B | T | 313 | 70512076117 | | 370288 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-02 03:05:01 | B | T | 313 | 70514046375 | | 370260 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-11-01 03:05:01 | B | T | 313 | 70515975713 | | 370232 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-31 03:05:01 | B | T | 313 | 70517150624 | | 370204 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-30 03:05:02 | B | T | 313 | 70518734908 | | 370176 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-29 03:05:02 | B | T | 313 | 70519297782 | | 370148 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-28 03:05:01 | B | T | 313 | 70519894667 | | 370120 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-27 03:05:01 | B | T | 313 | 70520230768 | | 370092 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-26 03:05:01 | B | T | 313 | 70520094047 | | 370064 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-25 03:05:01 | B | T | 313 | 70521136255 | | 370036 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-24 03:05:01 | B | T | 313 | 70521941315 | | 370008 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-23 03:05:02 | B | T | 313 | 70524146942 | | 369980 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-22 03:05:02 | B | T | 313 | 70521638676 | | 369952 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-21 03:05:01 | B | T | 313 | 70522558679 | | 369924 | /home/rsyncer/backups/aws-1/postgresql/freshports.org.dump | 2024-10-20 03:05:01 | B | T | 313 | 70528981985 | +---------+------------------------------------------------------------+---------------------+---------+-----------+----------+----------------+
Coming back to the restore
45 minutes later, I came back to many GRANT errors, such as:
pg_restore: error: could not execute query: ERROR: role "www" does not exist Command was: GRANT SELECT ON TABLE public.watch_notice TO www; GRANT SELECT,INSERT,UPDATE ON TABLE public.watch_notice TO commits; GRANT SELECT ON TABLE public.watch_notice TO rsyncer; GRANT SELECT ON TABLE public.watch_notice TO freshsource_ro;
I should have loaded the globals.sql file first.
Here we go again.
Copy over the globals.sql file from last night’s backup:
[14:18 pg03 dvl ~rsyncer/backups/database-backup/postgresql] % ls -l globals.sql -rw-r--r-- 1 rsyncer rsyncer 18145 2024.11.08 03:36 globals.sql [14:18 pg03 dvl ~rsyncer/backups/database-backup/postgresql] % scp globals.sql pg02:
Load up:
$ psql template1 < ~dvl/globals.sql SET SET SET CREATE ROLE ALTER ROLE CREATE ROLE ALTER ROLE ... CREATE ROLE ALTER ROLE CREATE ROLE ALTER ROLE ERROR: role "postgres" already exists ALTER ROLE CREATE ROLE ALTER ROLE CREATE ROLE ... ALTER ROLE CREATE ROLE ALTER ROLE CREATE ROLE ALTER ROLE CREATE ROLE ALTER ROLE ERROR: role "rsyncer" already exists ALTER ROLE ERROR: role "snmpd" already exists ALTER ROLE CREATE ROLE ALTER ROLE CREATE ROLE ALTER ROLE CREATE ROLE ALTER ROLE CREATE ROLE ... GRANT ROLE GRANT ROLE
And start again:
$ createdb freshports.dvl-category-testing -E SQL_ASCII -T template0 -O postgres $ time pg_restore -j 60 -d freshports.dvl-category-testing -Fc ~dvl/freshports.org.dump pg_restore: error: could not execute query: ERROR: role "rdsadmin" does not exist Command was: ALTER SCHEMA public OWNER TO rdsadmin;
Not shown: Alexa, 45 minutes
Later, we get:
pg_restore: warning: errors ignored on restore: 1 2800.34 real 164.01 user 34.47 sys
OK, we're done loading up.
Snapshot
First, shutdown PostgreSQL on the database server (jail):
[15:16 pg02 dvl ~] % sudo service postgresql stop
Then, on the jail host:
[15:16 r730-01 dvl ~] % zfs list | grep pg02 data02/jails/pg02 17.9G 565G 13.1G /jails/pg02 data03/pg02 85.9G 5.98T 88K none data03/pg02/postgres 85.9G 5.98T 26.7G /jails/pg02/var/db/postgres data03/pg02/rsyncer 856K 5.98T 112K /jails/pg02/usr/home/rsyncer/backups
We want data03/pg02/postgres. Here goes the snapshot:
[15:16 r730-01 dvl ~] % sudo zfs snapshot data03/pg02/postgres@before-filesystems-category [15:17 r730-01 dvl ~] % sudo zfs hold freshports data03/pg02/postgres@before-filesystems-category [15:18 r730-01 dvl ~] %
And, because I just learned about zfs hold, I place a hold.
Testing the snapshot
Here, I did a simple table change:
freshports.dvl-category-testing=# select * from portcount; count ------- 13 (1 row) freshports.dvl-category-testing=# update portcount set count = 12; UPDATE 1 freshports.dvl-category-testing=# select * from portcount; count ------- 12 (1 row) freshports.dvl-category-testing=#
That's an unused table. I think it should be deleted.
Next, I shutdown the postgresql service, did the rollback, started postgresql, and did the query again:
freshports.dvl-category-testing=# select * from portcount; count ------- 13 (1 row)
That's good for me.