Reproducing a batch insert for Bacula on PostgreSQL

This work from August 2022 was interesting at the time.

My database server gets slow from time to time. It seems to happen repeatedly. I’m not sure why, but inserting large numbers of rows takes a very long time. This is a typical use-case for Bacula when recording what was backed up.

I started with this, the duplication of the batch table created when Bacula is storing attribute for later insertion into the table.

baculatest=# begin; 
BEGIN
baculatest=# drop table my_batch;
DROP TABLE
baculatest=# explain analyse
baculatest-# CREATE TABLE my_batch AS
baculatest-# SELECT F.FileIndex, 1 AS JobId, P.path, FN.name, F.LStat, F.MD5, F.DeltaSeq 
baculatest-#   FROM file F
baculatest-#        JOIN Path     P  ON (F.pathid     = P.pathid) 
baculatest-#        JOIN Filename FN ON (F.filenameid = FN.filenameid)
baculatest-#  WHERE F.jobid = 219920;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=422996.35..631574.90 rows=125958 width=235) (actual time=94826.901..95260.394 rows=4749 loops=1)
   Hash Cond: (f.filenameid = fn.filenameid)
   ->  Hash Join  (cost=149927.11..355514.16 rows=125958 width=200) (actual time=2090.849..2519.090 rows=4749 loops=1)
         Hash Cond: (f.pathid = p.pathid)
         ->  Index Scan using file_jobid_idx on file f  (cost=0.57..202911.02 rows=125958 width=101) (actual time=319.150..744.433 rows=4749 loops=1)
               Index Cond: (jobid = 219920)
         ->  Hash  (cost=102547.35..102547.35 rows=3790335 width=107) (actual time=1770.288..1770.288 rows=3790435 loops=1)
               Buckets: 524288  Batches: 1  Memory Usage: 519843kB
               ->  Seq Scan on path p  (cost=0.00..102547.35 rows=3790335 width=107) (actual time=0.009..548.173 rows=3790435 loops=1)
   ->  Hash  (cost=165419.66..165419.66 rows=8611966 width=43) (actual time=92733.117..92733.117 rows=8614267 loops=1)
         Buckets: 1048576  Batches: 1  Memory Usage: 651816kB
         ->  Seq Scan on filename fn  (cost=0.00..165419.66 rows=8611966 width=43) (actual time=0.007..89487.772 rows=8614267 loops=1)
 Planning time: 0.400 ms
 Execution time: 95735.278 ms
(14 rows)

baculatest=# 

From here, I tried a test insert:

baculatest=# begin;
BEGIN
baculatest=# explain analyse
baculatest-# INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5, DeltaSeq) 
baculatest-# SELECT B.FileIndex, B.JobId, P.PathId, FN.FilenameId, B.LStat, B.MD5, B.DeltaSeq 
baculatest-#    FROM my_batch B
baculatest-#        JOIN Path     P  ON (B.Path = P.Path) 
baculatest-#        JOIN Filename FN ON (B.Name = FN.Name);
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Insert on file  (cost=422995.77..451240.89 rows=360576 width=82) (actual time=2264425.067..2264425.067 rows=0 loops=1)
   ->  Hash Join  (cost=422995.77..451240.89 rows=360576 width=82) (actual time=4736.298..97566.308 rows=264825 loops=1)
         Hash Cond: (b.name = fn.name)
         ->  Hash Join  (cost=149926.54..168706.54 rows=360576 width=110) (actual time=1573.740..93180.433 rows=264825 loops=1)
               Hash Cond: (b.path = p.path)
               ->  Seq Scan on my_batch b  (cost=0.00..11117.76 rows=360576 width=138) (actual time=16.278..91044.256 rows=264825 loops=1)
               ->  Hash  (cost=102547.35..102547.35 rows=3790335 width=107) (actual time=1556.063..1556.063 rows=3790435 loops=1)
                     Buckets: 524288  Batches: 1  Memory Usage: 514378kB
                     ->  Seq Scan on path p  (cost=0.00..102547.35 rows=3790335 width=107) (actual time=0.005..381.064 rows=3790435 loops=1)
         ->  Hash  (cost=165419.66..165419.66 rows=8611966 width=43) (actual time=3144.386..3144.386 rows=8614267 loops=1)
               Buckets: 1048576  Batches: 1  Memory Usage: 635819kB
               ->  Seq Scan on filename fn  (cost=0.00..165419.66 rows=8611966 width=43) (actual time=0.007..786.008 rows=8614267 loops=1)
 Planning time: 0.257 ms
 Execution time: 2264502.071 ms
(14 rows)

baculatest=# 

That took a very long time. The explain analyze details are over here.

Let’s try some stuff.

baculatest=# delete from file where jobid = 1;
DELETE 264825

baculatest=# analyze file;
ANALYZE

baculatest=# analyze path;
ANALYZE

baculatest=# select count(*) from path;
  count  
---------
 3790435
(1 row)

baculatest=# select count(*) from file;

   count   
-----------
 369890493
(1 row)

Yes, that’s 370 million files with 3.7 million paths.

Let’s try another insert:

baculatest=# begin;
BEGIN
baculatest=# explain analyse
baculatest-# INSERT INTO File (FileIndex, JobId, PathId, FilenameId, LStat, MD5, DeltaSeq) 
baculatest-# SELECT B.FileIndex, B.JobId, P.PathId, FN.FilenameId, B.LStat, B.MD5, B.DeltaSeq 
baculatest-#    FROM my_batch B
baculatest-#        JOIN Path     P  ON (B.Path = P.Path) 
baculatest-#        JOIN Filename FN ON (B.Name = FN.Name);
                                                                 QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
 Insert on file  (cost=423005.04..445744.48 rows=264825 width=96) (actual time=1325888.705..1325888.705 rows=0 loops=1)
   ->  Hash Join  (cost=423005.04..445744.48 rows=264825 width=96) (actual time=4858.415..74111.454 rows=264825 loops=1)
         Hash Cond: (b.name = fn.name)
         ->  Hash Join  (cost=149935.81..165723.59 rows=264825 width=118) (actual time=1709.446..69526.403 rows=264825 loops=1)
               Hash Cond: (b.path = p.path)
               ->  Seq Scan on my_batch b  (cost=0.00..10160.25 rows=264825 width=195) (actual time=136.491..67382.356 rows=264825 loops=1)
               ->  Hash  (cost=102551.47..102551.47 rows=3790747 width=107) (actual time=1571.442..1571.442 rows=3790435 loops=1)
                     Buckets: 524288  Batches: 1  Memory Usage: 514378kB
                     ->  Seq Scan on path p  (cost=0.00..102551.47 rows=3790747 width=107) (actual time=0.008..379.669 rows=3790435 loops=1)
         ->  Hash  (cost=165419.66..165419.66 rows=8611966 width=43) (actual time=3145.986..3145.986 rows=8614267 loops=1)
               Buckets: 1048576  Batches: 1  Memory Usage: 635819kB
               ->  Seq Scan on filename fn  (cost=0.00..165419.66 rows=8611966 width=43) (actual time=0.010..782.871 rows=8614267 loops=1)
 Planning time: 475.820 ms
 Execution time: 1325969.915 ms
(14 rows)
Website Pin Facebook Twitter Myspace Friendfeed Technorati del.icio.us Digg Google StumbleUpon Premium Responsive

Leave a Comment

Scroll to Top