Bad PostgreSQL planner decisions on column with uneven distribution

All we need is an easy explanation of the problem, so here it is.

Facts

I have following simple table that represents financial transactions:

CREATE TABLE transaction (
    "user_id" text not null,
    "offset" serial not null,
    "tx_data" text not null
);

CREATE INDEX transaction_offset on transaction("offset");
CREATE INDEX transaction_user_offset on transaction("user_id", "offset");
  • Table contains ~billion rows.
  • There are few users which account for most of the transactions (taken from pg_stats):
most_common_vals: {'userA', 'userB', .... 'userC'}
most_common_freq: {0.6, 0.07, ...., 0.00001}
  • First transaction for userA is at offset 2_000_000
  • Transactions for users are not distributed evenly. Even though userA has a lot of transactions those transactions are "batched" close to each other. There are period of times when user was very active and then there are period of times when it was not active at all.

Problem

If I simply try to query first 100 transactions by user:

select * from transaction where user_id = 'someUser' order by "offset" limit 100

If I query by userC (which accounts for only 0.0001 of transactions) it’s very fast and uses transaction_user_offset_idx index:

Limit (cost=0.71..370.76 rows=100 width=294)
-> Index Scan using transaction_user_offset_idx on transaction (cost=0.71..112511575.20 rows=30404045 width=294)
Index Cond: (user_id = 'userC')

If I query by userA (which accounts for 60% of transactions) it’s extremly slow. Query planner uses only transaction_offset as it figures out since userA is so frequent it doesn’t make sense to use any index for user and just filter sequentially:

Limit (cost=0.58..11.36 rows=100 width=294)
-> Index Scan using transaction_offset_idx on transaction (cost=0.58..359836920.47 rows=401760686 width=294)
Filter: (user_id = 'userA')

It is so slow because table is very big and the first row for userA is at offset 2_000_000. Query therefore has to traverse 2mln rows sequentially before it gets to the first row for userA.

db-fiddle to reproduce an issue

Solutions I tried

  • Lowering random_page_cost. I don’t like this solution as there are for sure places where sequential scans might have been more optimal. On top of that even if this param is set to 1.0 frequency of userA is so high query planner still chooses to use filter.
  • Disabling planner statistics completely for user column. This works for this query (it always uses transaction_user_offset index) but it feels "hackish" and I can’t predict how it will affect other queries.

What other possible solutions are out there?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

A more localized solution than munging the entire columns stats just for the sake of one query is to hide the actual constant from the planner at planning time, by putting it into a dummy subselect.

select * from transaction 
where "user_id" = (select 'userA') 
order by "offset" limit 100;

I don’t understand why the planner doesn’t do the right thing in the first place, though. I can sort of reproduce it, but for me, lowering random_page_cost to 1.1 is enough to get back to the right plan (and also happens to make sense given my hardware)

[Edit] I’ve explored a bit more on what is going on the planner. I think you already know the gross error, it doesn’t know how the ordering of user_id and "offet" are related. It thinks that to find 100 qualifying rows it only needs to scan 100/0.6 = 160 rows of the transaction_offset index, rather than 2000100 like it does. But still, that would leave that index slightly more expensive, not less expensive, than the better index. But that one-column index is smaller than the two-column index, and the index size does play a role in the cost estimation. A weak role, but enough to tip the balance.

This does leave me with a naughty thought. If you append NUL bytes (ascii-0) to the transaction_offset index until it is the same size as the transaction_user_offset index, it should remove this total size difference as a factor in the planning. In my hands this is enough to get the planner to start making the correct decisions again. This is obviously not something which should be done cavalierly on a production system.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply