I’ve been a fan of PostgreSQL since 2000 when I switched to it from MySQL. I wanted stored procedures and functions. I got that with PostgreSQL. I was used to having such features from my years working with other big databases such as DB2, Oracle, and Sybase.
I’ve been moving towards using the custom format of pg_dump. In conjunction with that, I’ve been using pg_restore for testing those dumps. As part of my daily backup strategy, every database is dumped to disk via pg_dump. Those backups are then copied to a test server and restored. Automatically. All done by script without any human intervention.
After modifying those scripts, I started monitoring the logs to verify that all was well. I started seeing errors such as this:
Jun 12 11:31:21 dbclone postgres[35257]: [2-1] ERROR: unexpected message type 0x58 during COPY from stdin Jun 12 11:31:21 dbclone postgres[35257]: [2-2] CONTEXT: COPY event_attachment, line 226 Jun 12 11:31:21 dbclone postgres[35257]: [2-3] STATEMENT: COPY event_attachment (event_attachment_id, attachment_type, event_id, mime_type, filename, title, pages, data, public) FROM stdin; Jun 12 11:31:21 dbclone postgres[35257]: [2-4] Jun 12 11:31:21 dbclone postgres[35257]: [4-1] FATAL: connection to client lost
I wasn’t sure what was going on, or why, but I started checking the .dump files. This is the file pg_restore was working on:
-rw-r--r-- 1 dan dan 2305703936 Jun 12 05:44 pentabarf_pgcon.dump
Let’s check the file on the source system:
-rw-r--r-- 1 dan dan 2305703936 Jun 12 05:44 pentabarf_pgcon.dump
Yes, same size. And same MD5 (I checked).
Then I realized this .dump file originates from within a jail on this system. The jail host merely copies it out of the jail. Let’s check the file within the jail:
-rw-r--r-- 1 dan dan 6425157966 Jun 7 05:26 pentabarf_pgcon.sql
Clearly, the file is truncated. Now I know what’s happened. There are two scripts involved. One dumps the file. The other copies the file. They are overlapping and the copy is starting before pg_dump has completed. Checking the crontabs on the two systems involved supported this hypothesis.
My solution: start pg_dump a few hours earlier. I’ll know the answer tomorrow. If there’s no followup to this post, you’ll know it worked.
Hi Dan
Would it not be better to make touch variable (e.g backup_finished) in the jailed directory, and let the external copy script check for this variable?. If the variable is set, then pg_dump is finished, and you can copy/move the backup file (and remove the touch variable). Then you do not have any overlab (ever)
That’s a good idea. Thanks.
On other systems, the dbclone system ssh’s to the remote box, dumps the db itself, then does an rsync to grab the file. This one system is handled differently, but it should not be. I should bring it into line with everything else.
I now thing a better idea is man 1 lockf