I attended a few of the talks at PGCon 2013 last month. One talk, for which I took several notes and made a few choice tweets. The main one I’m following up on is using the -Fc option on pg_dump. It was during Magnus Haganders’ talk on PostgreSQL Backup Strategies that I posted that tweet. In the two weeks since that talk, I managed to do some testing. Let’s have a look at that.
What’s this -Fc option anyway?
The -F option on pg_dump allows you to specify the format of the output. The default output is plain-text. That’s what you get when you do this:
pg_dump bacula > bacula.sql
and is functionally equivalent to doing:
pg_dump -Fp bacula > bacula.sql
There are other formats (directory, tar), but today I’m concerned with the custom (c) option. By default, this output is also compressed (see the -Z option). What is of most value is the ability to use ` with the custom output.
What’s the big deal about custom format?
The more interesting option on pg_restore was -j. This allows you to run multiple concurrent jobs for the loading of data and building of indexes. This can greatly reduce the amount of time needed to restore a large database if your server has multiple processors.
pg_restore can be much more useful than just doing:
psql bacula < bacula.sql
With pg_restore you can be very selective about what you restore. See the -L and -l options.
How much does pg_dump compress?
I was curious. I wanted to compare the various pg_dump times for each level of compression.
I did a series of test such as this:
$ time pg_dump bacula > bacula.sql real 4m22.654s user 0m17.080s sys 0m7.766s
In the following table, I'm using the 'real' value shown above. In addition, I'm also showing the time it took to restore the dump via pg_store -d testing.
|command||dump time||file size||restore time|
|pg_dump -Z0 -Fc||4m18.742s||14G||36m8.156s|
|pg_dump -Z1 -Fc||5m0.498s||5.4G||36m26.047s|
|pg_dump -Z2 -Fc||5m24.376s||5.2G||36m11.556s|
|pg_dump -Z3 -Fc||6m20.513s||5.1G||36m12.259s|
|pg_dump -Z4 -Fc||6m46.074s||4.9G||36m31.050s|
|pg_dump -Z5 -Fc||8m22.397s||4.7G||34m53.817s|
|pg_dump -Z6 -Fc||11m18.154s||4.6G||35m54.344s|
|pg_dump -Z7 -Fc||14m21.447s||4.5G||35m14.134s|
|pg_dump -Z8 -Fc||25m15.000s||4.5G (45MB smaller)||35m22.793s|
|pg_dump -Z9 -Fc||26m40.550s||4.5G (1.8MB smaller)||35m11.992s|
Looking at the above, the default compression (i.e. without specifying -Z) is about the same as -Z6, but takes only 4 minutes versus 11 minutes.
I will be using just plain -Fc, and I will not be using the -Z option.
As you can see from the above times, compression level has very little performance hit on pg_restore.
What about -j?
Let's try -j with pg_restore:
$ time pg_restore -j 8 -d testing < bacula.dump.11m12.443s pg_restore: [custom archiver] parallel restore from standard input is not supported real 0m0.073s user 0m0.000s sys 0m0.000s
Oh, OK, let's go with this format:
$ time pg_restore -j 8 -d testing bacula.dump.11m12.443s
The original test for this took about 34.5 minutes. I won't consider -j 8 to be much of a savings unless it's under 32 minutes. Anything less just might be noise.
*insert pause here*
Wow! 17 minutes! That's a huge performance increase! I am impressed. I will be using -j every time now.
I'm not just a PGCon attendee. I founded the conference and I'm the organizer.
4 thoughts on “Using compression with PostgreSQL’s pg_dump”
> Looking at the above, the default compression (i.e. without specifying -Z) is about the same as -Z6, but takes only 4 minutes versus 11 minutes.
I think you misread your own table:
pg_dump -Fc 11m12.443s 4.6G 34m27.218s
pg_dump -Z6 -Fc 11m18.154s 4.6G 35m54.344s
gzip (and many other zlib-using programs) simply default to compression level six. That’s why the numbers were so close, I believe.
From the bottom of this page of the gzip manual:
Yes. I agree. I think I was looking at plain pg_dump (no options).
If PostgreSQL was built on a system with the zlib compression library installed, the custom dump format will compress data as it writes it to the output file. This will produce dump file sizes similar to using gzip, but it has the added advantage that tables can be restored selectively.
Source: Official PostgreSQL Site