The backups have been wrong for about 97 days

I backup my databases daily. They get dumped to a disk file and copied somewhere else. Then they are backed up into a file system. Once a month they get copied to tape.

Yesterday I upgraded a PostgreSQL database server, and it went well. Overnight, a backup failed. This is the story of how that led to me discovering my backups have been at risk for nearly 100 days.

The error

This morning I found this in my email:

From: (Bacula) dan@langille.org
Subject: Bacula: Backup Fatal Error of dbclone-fd Differential
Sender: bacula@bacula.int.unixathome.org
To: dan@langille.org
Date: Sun, 16 Feb 2020 03:06:07 +0000 (UTC)
16-Feb 03:06 bacula-dir JobId 306682: Start Backup JobId 306682, Job=BackupCatalog.2020-02-16_03.05.05_29
16-Feb 03:06 bacula-dir JobId 306682: There are no more Jobs associated with Volume "DiffAuto-10100". Marking it purged.
16-Feb 03:06 bacula-dir JobId 306682: All records pruned from Volume "DiffAuto-10100"; marking it "Purged"
16-Feb 03:06 bacula-dir JobId 306682: Recycled volume "DiffAuto-10100"
16-Feb 03:06 bacula-dir JobId 306682: Using Device "vDrive-DiffFile-6" to write.
16-Feb 03:06 dbclone-fd JobId 306682: shell command: run ClientRunBeforeJob "/usr/local/bacula/dump_catalog.sh"
16-Feb 03:06 dbclone-fd JobId 306682: ClientRunBeforeJob: pg_dump: error: connection to database "bacula" failed: FATAL:  no pg_hba.conf entry for host "10.55.0.140", user "bacula", database "bacula", SSL off
16-Feb 03:06 dbclone-fd JobId 306682: Error: Runscript: ClientRunBeforeJob returned non-zero status=1. ERR=Child exited with code 1
16-Feb 03:06 bacula-dir JobId 306682: Fatal error: Bad response to ClientRunBeforeJob command: wanted 2000 OK RunBefore
, got 2905 Bad RunBeforeJob command.

In short:

The job failed – Fatal Error
It failed because the script (/usr/local/bacula/dump_catalog.sh) it runs to dump the database failed
The script failed because the database server was not told, via pg_hba.conf, to allow such connections.

This is bad!

A job should not suddenly fail like that. It’s been run fine for months. Something must have changed.

At this point, I know what changed.

pg_hba.conf changed.

When the server was upgraded from PostgreSQL 11.6 to PostgreSQL 11.2, a brand new pg_hba.conf is installed. That files goes into /var/db/postgresql/data12/, and the old one is retained at /var/db/postgresql/data11/.

The issue is making me dread looking at the server.

Proving my theory

When I migrate from one major version of PostgreSQL to another, I create a new server instance on another host. pg02 and pg03 are FreeBSD jails on the server named r720-01. When I moved to PostgreSQL 12 back in November, the new version was installed on pg02, I ran pg_dump on pg02 and dumped the databases from pg03. As each database was migrated to the new server, the application configurations were adjusted.

Except for scripts.

This is the Bacula configuration in /usr/local/etc/bacula-dir.conf:

Catalog {
  Name = MyCatalog
  dbname = bacula; dbaddress = pg02.int.unixathome.org; user = bacula; password = "[redacted]"
}

The Catalog is refers to the database used by Bacula and the two terms are often used interchangeably.

Looking in that database, I find:

[dan@pg02:~] $ psql bacula
psql (12.1)
Type "help" for help.

bacula=# select * from job order by jobid desc limit 1;
 jobid  |                   job                    |       name        | type | level | clientid | jobstatus |      schedtime      |      starttime      |       endtime       |  jobtdate  | volsessionid | volsessiontime | jobfiles | jobbytes | joberrors | jobmissingfiles | poolid | filesetid | purgedfiles | hasbase |     realendtime     | priorjobid | readbytes | hascache | reviewed | comment 
--------+------------------------------------------+-------------------+------+-------+----------+-----------+---------------------+---------------------+---------------------+------------+--------------+----------------+----------+----------+-----------+-----------------+--------+-----------+-------------+---------+---------------------+------------+-----------+----------+----------+---------
 306699 | mailjail_snapshot.2020-02-16_14.00.00_58 | mailjail snapshot | B    | I     |       55 | T         | 2020-02-16 14:00:00 | 2020-02-16 14:00:07 | 2020-02-16 14:05:26 | 1581861926 |         2716 |     1580407349 |        0 |        0 |         0 |               0 |     29 |       291 |           0 |       0 | 2020-02-16 14:05:26 |          0 |         0 |        0 |        0 | 
(1 row)

bacula=#

That is today’s date. Good. This database is in use.

Now, let’s look at that script:

[dan@dbclone:~] $ sudo head /usr/local/bacula/dump_catalog.sh
#!/bin/sh

DIR=/usr/home/rsyncer/backups/bacula-database/postgresql
USER=bacula
HOST=pg03.int.unixathome.org
DB=bacula

/usr/local/bin/pg_dump -Z0 -Fc --host ${HOST} --username ${USER} ${DB} > ${DIR}/${DB}.dump.tmp
DUMP_RESULT=$?
if [ ${DUMP_RESULT} != 0 ]; then
[dan@dbclone:~] $

Yes, as I feared, that is the old database server.

If I look at that database, it is clearly not fresh:

[dan@pg03:~] $ psql bacula
psql (12.2)
Type "help" for help.

bacula=# select * from job order by jobid desc limit 1;
 jobid  |                 job                 |     name     | type | level | clientid | jobstatus |      schedtime      |      starttime      |       endtime       |  jobtdate  | volsessionid | volsessiontime | jobfiles | jobbytes | joberrors | jobmissingfiles | poolid | filesetid | purgedfiles | hasbase |     realendtime     | priorjobid | readbytes | hascache | reviewed | comment 
--------+-------------------------------------+--------------+------+-------+----------+-----------+---------------------+---------------------+---------------------+------------+--------------+----------------+----------+----------+-----------+-----------------+--------+-----------+-------------+---------+---------------------+------------+-----------+----------+----------+---------
 298856 | RestoreFiles.2019-11-11_17.21.23_03 | RestoreFiles | R    | F     |       61 | T         | 2019-11-11 17:21:07 | 2019-11-11 17:21:26 | 2019-11-11 17:21:27 | 1573492887 |          915 |     1572029437 |        2 |      597 |         0 |               0 |      0 |         0 |           0 |       0 | 2019-11-11 17:21:27 |          0 |       597 |        0 |        0 | 
(1 row)

bacula=#

Yes, that is stale. That is three months old. If I’d lost the database server, I’d have have had no fresh backups of my Bacula Catalog.

How are other database backups done?

I thought I had a solution for this, and wrote about what I’m doing for other databases before I realized that was a dead end. Instead of removing what I wrote, I’m leaving it here because it might be useful.

The Bacula Catalog backup is always initiated from a Bacula job so that I know the dump has been taken after the daily backups. I did not want this done from a cron job.

The Bacula database has always been treated differently in this regard. All other databases are backed up using a different approach:

Each database host has a special user, rsyncer.
This user is permitted to dump any database and has read-only access.
The user has a daily cron job which dumps a specified list of databases.
The user then initiates an ssh session to the dbclone server.
That ssh session can only run a script which rsyncs from the database server to the dbclone server.

In effect, the rsyncer user does an ssh session to dbclone, which then does an ssh session back to the server and pulls in the database dumps.

On dbclone, the rsyncer user has several entries in ~/.ssh/authorized_keys, one for each database server. An entry looks like this:

from="pg02.int.vpn.unixathome.org,10.55.0.32",command="/home/rsyncer/bin/rsync-backup-from-pg02.sh" ssh-ed25519 [redacted] rsyncer@pg02.int.unixathome.org

When the ssh session comes in from pg02, it runs the specified command. That script looks like this:

[rsyncer@dbclone ~]$ cat /home/rsyncer/bin/rsync-backup-from-pg02.sh
#!/bin/sh
#
# This file does a backup of each database on the server.
# It relies upon a file on the server to do the actual backup,
# then uses rsync to copy the files from the server to here.
#

# the ~/.ssh/authorized keys entry on the server must look like this:
#
# from="10.55.0.140",command="/usr/local/sbin/rrsync /usr/home/dan/backups/" [ssh key goes here]
#
# invoking rrsync ensures the incoming rsync process is chrooted to /usr/home/dan/backups/
# Thus, BACKUPDIR is relative to that location.

BACKUPDIR=${HOME}/backups/pg02

IDENTITY_FILE_RSYNC=${HOME}/.ssh/id_ed25519
SERVER_TO_RSYNC=pg02.int.unixathome.org

cd ${BACKUPDIR}

#
# use rsync to get local copies
#
/usr/local/bin/rsync -e "/usr/bin/ssh -i ${IDENTITY_FILE_RSYNC}" --recursive -av --stats --progress --exclude 'archive' ${SERVER_TO_RSYNC}:/ ${BACKUPDIR}
[rsyncer@dbclone ~]$

Back on pg02, the ~rsyncer/.ssh/authorized_keys looks like this

[dan@pg02:~] $ sudo cat ~rsyncer/.ssh/authorized_keys
from="dbclone.int.unixathome.org,10.55.0.140",command="/usr/local/sbin/rrsync /usr/home/rsyncer/backups/" ssh-ed25519 [redacted] rsyncer@dbclone.int.unixathome.org

This may sound rather complex, but if you have questions, please ask. In short:

user on database server (pg02) dumps the database file[s]
user ssh’s to the backup server (dbclone) and invokes a script
that script copies the database file[s] to the backup server

It just so happens that the same user (rsyncer) on both hosts is used. I created that user for my own needs. Create whatever user you want but I encourage you not to use your own login for this.

Avoiding it

At first, I thought I could avoid this by changing the Bacula job to run the same script run by rsyncer, but this does not fix the problem.

The problem is having the Catalog server identified in two different places:

Within the Catalog resource of the /usr/local/etc/bacula/bacula-dir.conf file
In the backup script or Bacula job which dumps the Catalog.

I know the Debian backup script for Bacula parsed bacula-dir.conf to extract the database credentials and proceed base upon that.

I could also make sure that I delete, or at least rename, databases which are no longer in use.

Ideally, I’d have one place.

I thought about having a DNS entry to point to the correct database server, but I’d need one per database…

How to monitor this

I could have a script which checks both the script and the configuration file for the same values. They are on different servers.

There is a cronjob which restores each database to test the dump. I could connect to that database and see if there is a recent job in there. That would require changing a generic script to be specific to Bacula.
.
There is already monitoring in place to make sure the database dumps on the backup server are recent. They triggered today when the backup failed.

What is not in place is monitoring of the database used by the script.

I think I have it!

On dbclone, the host which runs the dump_catalog.sh script:

grab the database file name from the script
connect to the bacula database on that host
if the latest backup job is over 24 hours old, trigger an alert

This is the check script:

[dan@dbclone:/usr/local/bacula] $ sudo cat check_catalog.sh
#!/bin/sh

CONF=/usr/local/bacula/config.sh

if [ -r $CONF ]
then
      . $CONF
else
   echo Configurtion file does not exist: $CONF
   exit 3
fi

lastjob=`/usr/local/bin/psql --host ${HOST} --username ${USER} ${DB} --tuples-only --command="SELECT to_char(endtime, 'YYYYMMDD') FROM job ORDER BY jobid DESC LIMIT 1"`
JOB_RESULT=$?
if [ ${JOB_RESULT} != 0 ]; then
  exit ${JOB_RESULT}
fi

today=`date +"%Y%m%d"`

#echo $lastjob
#echo $today

if [ $lastjob -lt $today ]
then
  echo ABORT, this database seems stale: last job is from $lastjob
  exit 2
fi

echo Latest job is from $lastjob
exit 0

A new configuration file has been created, used by both the check script and the backup script.

[dan@dbclone:/usr/local/bacula] $ cat /usr/local/bacula/config.sh
#!/bin/sh

# this is the configuration file for other scripts in this directory

USER=bacula
HOST=pg02.int.unixathome.org
DB=bacula