All we need is an easy explanation of the problem, so here it is.
I am trying for PITR on a duplicate node. Here are the steps I am following.
- Start backup on source node.
- Copy all the files on destination node using these rsync commands on the destination node:
rsync -av sourcedb:/TPINFO01/wal_archive/ /TPINFO01/wal_archive rsync --delete -av sourcedb:/TPINFO01/datadg/tablespace/ /TPINFO01/datadg/tablespace rsync --delete -av sourcedb:/TPINFO01/datadg/data/ /TPINFO01/datadg/data --exclude 'pg_log' --exclude 'pg_replslot' --exclude 'postgresql.conf' --exclude 'recovery.conf'
- Stop the backup on source node
- Through crontab, keep copying the WAL files from the source DB every min. There is continuous WAL archiving enabled on source DB.
* * * 08 * /home/postgres/rsync_wal.sh > /dev/null 2>&1
#!/bin/sh rsync -av sourcedb:/TPINFO01/wal_archive/ /TPINFO01/wal_archive
- On next day, I create
recovery.confon targetdb as follows:
recovery_target_time = '2021-08-16 02:00:00 JST' recovery_target_action = 'pause' recovery_target_inclusive = 'false' restore_command = 'rsync -a /TPINFO01/wal_archive/%f %p'
- It restores the files but stops with the following error:
2021-08-16 16:33:52.853 JST,,,3608,,611a14e0.e18,1,,2021-08-16 16:33:52 JST,,0,LOG,00000,"database system was interrupted while in recovery at log time 2021-08-16 02:27:59 JST",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,"" 2021-08-16 16:33:53.023 JST,,,3608,,611a14e0.e18,2,,2021-08-16 16:33:52 JST,,0,LOG,00000,"starting point-in-time recovery to 2021-08-16 02:00:00+09",,,,,,,,,"" 2021-08-16 16:33:53.120 JST,,,3608,,611a14e0.e18,3,,2021-08-16 16:33:52 JST,,0,LOG,00000,"restored log file ""0000000100009928000000A7"" from archive",,,,,,,,,"" 2021-08-16 16:33:53.144 JST,,,3608,,611a14e0.e18,4,,2021-08-16 16:33:52 JST,1/0,0,LOG,00000,"redo starts at 9928/A73B0DA8",,,,,,,,,"" 2021-08-16 16:33:53.145 JST,,,3608,,611a14e0.e18,5,,2021-08-16 16:33:52 JST,1/0,0,LOG,00000,"recovery stopping before commit of transaction 74385349, time 2021-08-16 02:30:01.469553+09",,,,,,,,,"" 2021-08-16 16:33:53.145 JST,,,3608,,611a14e0.e18,6,,2021-08-16 16:33:52 JST,1/0,0,FATAL,XX000,"requested recovery stop point is before consistent recovery point",,,,,,,,,""
What is wrong here? How can the DB on the target node be corrupted when it wasn’t even started? Am I missing some step?
This is what the backup file looks like
START WAL LOCATION: 9928/9D000028 (file 00000001000099280000009D) STOP WAL LOCATION: 9928/9D000168 (file 00000001000099280000009D) CHECKPOINT LOCATION: 9928/9D000098 BACKUP METHOD: pg_start_backup BACKUP FROM: master START TIME: 2021-08-15 21:58:00 JST LABEL: backup START TIMELINE: 1 STOP TIME: 2021-08-15 22:03:44 JST STOP TIMELINE: 1
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
As mentioned in the comments by Laurenz Albe, I was missing the WAL file during the
I repeated the steps above, with the only difference that the cronjob to copy the WAL segments from source to target DB was running continuously during the backup as well so as to ensure that no WAL segments were missed.
It worked this way and I could do PITR for my desired time.
It seems like your WAL archiving is not working reliably.
Rather than using
rsync from a cron job to copy WAL segments, which seems to miss segments occasionally, use PostgreSQL’s
archive_command to archive WAL segments. Then PostgreSQL will make sure that all segments are archived.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂