Free Space in pg_largeobject Not Being Reused?

All we need is an easy explanation of the problem, so here it is.

postgres noobie, PG 9.4 (please don’t say I need to update PG, thx)

Database has huge pg_largeobject table (~450GB, 169.5M rows, lots of DML leading to ~95GB/~22% free space – all per pgstattuple.

I would expect ALL new inserts and updates to pg_largeobject to create ZERO new space in table due to the massive amount of free space.

Yet table size continues to grow, including space added to the last physical file in the chain for the table. File /data/base/31130/302605.429 increased in size from 574,005,248 to 625,180,672 (51MB) in 4 hours yesterday.

What am I missing? Is there something I need to do to make the vacuum’d free space in the table be reused? We are struggling with free disk space, so this is quite important to sort out. Note that vacuum full is not an option either.

Thanks for the help!

UPDATE 1:
pg_freespace(‘pg_largeobject’) returns 56.3M rows. I did an aggregate avail, count(*) and found avail all over the place, but the vast majority of the rows were < 2000. The 3 BIG count values were:

Avail rowcount
1856 48799038
1888 6601315
3968 141702

UPDATE 2:
I sadly have very little info (nor does anyone at the client) regarding usage. I "think" blob documents are placed in here. I have corelated the vast majority of pg_largeobject loid values to INTEGER values in rows in the documents table and claimnotes table.

The research I did all seemed to mention OID’s being the linkage from pg_largeobject back to actual rows in tables (including tools like vacuumlo, the lo maintenance trigger, etc). But that didn’t seem to be the way the application developers handled their blobs. I would love to try to look for "completely unreferenced" rows so I could forcibly delete them (via vacuumlo?), but being a noob I can’t figure out how to exhaustively look for parent rows.

FOLLOW UP QUESTION 1:
Can someone help me with a query that can exhaustively search every table in the database for rows that might be associated with the pg_largeobject table rows?

FOLLOW UP QUESTION 2:

I do know that around 100K to 150K rows per day become dead, and vacuum will delete them to free up the space. Is that correct?!?

FOLLOW UP QUESTION 3:

a) Blobs are sharded to 2K chunks and then inserted into pg_largeobject. Can those 2K chunks be placed in any EXISTING 8K page in pg_largeobject that has at least 2K of free space (which the about 150K pages do)?

b) And given the 2K chunk size, how did pg_largeobject wind up with 95ish% of the table having just 1856 or 1888 bytes free?? Is that due to row/page overhead taking away some of the 8192 bytes on the page?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This might be due to the inefficient packing of large object.

Large objects are chunked before compression into slices of 2048 bytes (in fact pagesize/4, but normally pagesize=8192). The slices get compressed individually (if applicable) and stored in pg_largeobject.data

In a 8192 byte page, only 3 slices of 2048 bytes can fit. The rest of the space is occupied by the other columns loid and pageno, plus the page header (24 bytes), plus the tuple headers (27 bytes each), see Database Page Layout for the details.
The rest of the page consists of free space (that would be about 1900 bytes if I’m counting right).

22% free space seems about right, without any bloat due to MVCC that could be reclaimed by a VACUUM FULL.

When is that free space going to be used? If you rarely insert large objects slices that are smaller than these 1900 bytes, it’s quite possible that there is no opportunity to use it.


As for the space that is getting freed and gets potentially reusable when large objects are removed (through lo_unlink), it works for pg_largeobject as with any other table.
The MVCC lifecycle of the tuples is essentially:

  1. the tuples with a given loid are live
  2. when a lo_unlink of the large object occurs: an implicit delete from pg_largeobject where loid=<large object id> causes the tuples to be marked as "old".
  3. later, VACUUM (automatic or manual) sees that the "old" tuples are no longer visible to any live transaction, and marks their space as free space.
  4. later, when new tuples are inserted, they may use that space if they fit.

Method 2

I have corelated the vast majority of pg_largeobject loid values to INTEGER values in rows in the documents table and claimnotes table.

Since you can correlate the vast majority with entries in INT columns, there seem to be few unreferenced rows. So it doesn’t seem worthwhile trying to find them. Finding unreferenced rows is exactly what vacuumlo is for, but it relies on their OID being stored in columns of the proper types. I think the easiest thing to find unreferenced rows would be to hack vacuumlo to include INT columns (once you can compile your own code against the server, that hack looks trivial) but what if some OIDs were stored in bigint, or in text, or just scrawled in hand writing on an invoice? Since you already said the vast majority are referenced, I would say the danger here outweighs any possible reward.

Blobs are sharded to 2K chunks and then inserted into pg_largeobject. Can those 2K chunks be placed in any EXISTING 8K page in pg_largeobject that has at least 2K of free space (which the about 150K pages do)?

It should be possible. Maybe all the ones with 3968 bytes free were created by VACUUM after the period where the table expansion occurred. Given that the pages with 3968 bytes free are less than 0.2% of all pages, I would say it is clear they are getting reused on average, it is just a question of when you take your snapshot during that dead-vacuum-free-reused cycle.

And given the 2K chunk size, how did pg_largeobject wind up with 95ish% of the table having just 1856 or 1888 bytes free?? Is that due to row/page overhead taking away some of the 8192 bytes on the page?

Yes. It uses the page while it has enough freespace, and then (obviously) stops using it.

Note that the newer TOAST facility made the less ridiculous decision to make the chunks 1996 bytes rather than 2048, so that 4 rather than 3 chunks fit per page once overhead is taken into account.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply