PANIC: could not write to file “pg_xlog/xlogtemp”: No space left on device

All we need is an easy explanation of the problem, so here it is.

We received an outage on our postgres environment relating to the below error:

Jul  1 00:36:04 test1 postgres[219259]: [770-2] user=,db=,app=client= CONTEXT:  writing block 199237 of relation pg_tblspc/16402/PG_9.6_201608131/7358881/41721132
Jul  1 00:36:05 test1 postgres[219252]: [3-1] user=,db=,app=client= LOG:  checkpointer process (PID 219259) was terminated by signal 6: Aborted
Jul  1 00:36:05 test1 postgres[219252]: [4-1] user=,db=,app=client= LOG:  terminating any other active server processes
Jul  1 00:36:05 test1 postgres[110539]: [5-1] user=postgres,db=product,app=psqlclient=[local] WARNING:  terminating connection because of crash of another server process
Jul  1 00:36:05 test1 postgres[110539]: [5-2] user=postgres,db=product,app=psqlclient=[local] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Jul  1 00:36:05 test1 postgres[110539]: [5-3] user=postgres,db=product,app=psqlclient=[local] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
Jul  1 00:36:04 test1 postgres[219259]: [770-1] user=,db=,app=client= PANIC:  could not write to file "pg_xlog/xlogtemp.219259": No space left on device
Jul  1 00:36:04 test1 postgres[219259]: [770-2] user=,db=,app=client= CONTEXT:  writing block 199237 of relation pg_tblspc/16402/PG_9.6_201608131/7358881/41721132
Jul  1 00:36:05 test1 postgres[219252]: [3-1] user=,db=,app=client= LOG:  checkpointer process (PID 219259) was terminated by signal 6: Aborted
Jul  1 00:36:05 test1 postgres[219252]: [4-1] user=,db=,app=client= LOG:  terminating any other active server processes

Postgres crashed and then recovered, on observing the space where pg_xlog is stored can see the drive fill up, usually 80GB drive with about 10% usage, it happens around the same time every night. I am trying to find the cause but there is nothing in the postgres log file that points to the culprit.

We have datadog monitoring the DB server and can see at the time of issue an error but not pointing to what it could be.

Any help is appreciated.

Can see 1 millisecond after:

Jul  1 00:36:04 test1 postgres[219259]: [770-1] user=,db=,app=client= PANIC:  could not write to file "pg_xlog/xlogtemp.219259": No space left on device
Jul  1 00:36:04 test1 postgres[219259]: [770-2] user=,db=,app=client= CONTEXT:  writing block 199237 of relation pg_tblspc/16402/PG_9.6_201608131/7358881/41721132
Jul  1 00:36:05 test1 postgres[219252]: [3-1] user=,db=,app=client= LOG:  checkpointer process (PID 219259) was terminated by signal 6: Aborted
Jul  1 00:36:05 test1 postgres[219252]: [4-1] user=,db=,app=client= LOG:  terminating any other active server processes
Jul  1 00:36:05 test1 postgres[110539]: [5-1] user=postgres,db=product,app=psqlclient=[local] WARNING:  terminating connection because of crash of another server process
Jul  1 00:36:05 test1 postgres[110539]: [5-2] user=postgres,db=product,app=psqlclient=[local] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
Jul  1 00:36:05 test1 postgres[110539]: [5-3] user=postgres,db=product,app=psqlclient=[local] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
Jul  1 00:36:05 test1 postgres[110539]: [5-4] user=postgres,db=product,app=psqlclient=[local] CONTEXT:  SQL statement "INSERT INTO AGGREGATES.agg_item_part_count
Jul  1 00:36:05 test1 postgres[110539]: [5-5] #011#011#011SELECT b.item_id,
Jul  1 00:36:05 test1 postgres[110539]: [5-6] #011#011              count(bp.item_id) as item_parts
Jul  1 00:36:05 test1 postgres[110539]: [5-7] #011        #011FROM item.item b
Jul  1 00:36:05 test1 postgres[110539]: [5-8] #011#011LEFT JOIN item.item_part bp USING (item_id)
Jul  1 00:36:05 test1 postgres[110539]: [5-9] #011        #011WHERE b.item_id >= starting_item
Jul  1 00:36:05 test1 postgres[110539]: [5-10] #011#011        GROUP by b.item_id
Jul  1 00:36:05 test1 postgres[110539]: [5-11] #011#011ON CONFLICT (item_id) DO
Jul  1 00:36:05 test1 postgres[110539]: [5-12] #011#011#011UPDATE SET
Jul  1 00:36:05 test1 postgres[110539]: [5-13] #011#011#011item_parts = EXCLUDED.item_parts"
Jul  1 00:36:05 test1 postgres[110539]: [5-14] #011PL/pgSQL function etl.update_bpart_agg() line 39 at SQL statement

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

You only have 1 second resolution on your log (you should change that, %t to %m), so those are from the next second. But we don’t know how close you were to the second boundary to start with, and these log messages are spot on what you would expect.

It looks likely that that INSERT…SELECT statement is the culprit in filling up the wal directory. If the SELECT returns lots of rows, it would make sense that the INSERT would generate a lot of WAL quickly. We don’t know why it completely filled up though, maybe you are archiving and the archive command could not keep up, or have replication slots and the replicas could not keep up, or maybe you don’t have any of those things and it was just the checkpointer that couldn’t keep up.

If your temp files get written to the same partition as your WAL files, then they might have helped fill up the partition.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply