Postgres: Faster query when adding useless JOIN

All we need is an easy explanation of the problem, so here it is.

I was looking at a query to make it more performant and I encountered an interesting case. This query executes much faster when I add a (redundant) JOIN that doesn’t change the actual result set.

Original Query:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
select
    *
from
    product_reservations l1_
inner join product_occupancy_items l0_ on l1_.id = l0_.id
inner join products l2_ on l0_.product_id = l2_.id
where
    l2_.customer_id = 'a19917c2-5ee8-47c2-a757-7799c0e54b0d'
    and l0_.date_range && '[2019-08-09,2019-08-11)' = true

The query executes at ~88ms, this is the explain
https://explain.dalibo.com/plan/yJG

Adding a redundant JOIN:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)
select
    *
from
    product_reservations l1_
inner join product_occupancy_items l0_ on l1_.id = l0_.id
inner join products l2_ on l0_.product_id = l2_.id
inner join products l3_ on l0_.product_id = l3_.id
where
    l2_.customer_id = 'a19917c2-5ee8-47c2-a757-7799c0e54b0d'
    and l0_.date_range && '[2019-08-09,2019-08-11)' = true

This produces the following explain https://explain.dalibo.com/plan/Op3

This query runs at 20ms. About 4 times faster. As you maybe noticed, the only difference is just an useless JOIN on a table already JOINed (products).

When looking at the EXPLAIN for differences, we can find something in the product_occupancy_items section:

  • In the 1st query, the planner uses only the index of the external key with products and takes all the matching products applying the date range filter without index. This is much faster even though there are ~12k matching products to deal with (and then after applying the filter just 65)
  • In the 2nd query, the planner uses the index idx_occupancy_items_date_range to filter rows with the overlap (this is a gist index on date range) that’s actually good on paper, but it takes a whole 80ms. This is understandable since it’s used as bitmap index scan and there are many rows with that date range (~40k)

What’s going on is kind of clear from a "what" point of view. I’m not sure to understand the "why". It looks to me that purely for an implementation point of view, it changed the part of the plan to access product_occupancy_items and got lucky.

I’m wondering if we can do anything to allow the planner to make the same decision without requiring to modify the query in order to fool the planner.

Note how while the difference may sound in the order of few ms, this query is executed many times for a bulk process. The whole bulk execution runs at about 3s with the double join and at about 30s without it. So the difference is sensible.

EDIT: Adding relevant part of the schema definition (just redacted columns not useful for this question)

CREATE TABLE public.product_occupancy_items (
    id uuid NOT NULL,
    product_id uuid NOT NULL,
    date_range daterange NOT NULL,
    CONSTRAINT product_occupancy_items_pkey PRIMARY KEY (id),
    CONSTRAINT fk_4a50848387335af1 FOREIGN KEY (product_id) REFERENCES public.products(id) ON DELETE CASCADE
);
CREATE INDEX idx_4a50848387335af1 ON public.product_occupancy_items USING btree (product_id);
CREATE INDEX idx_occupancy_items_date_range ON public.product_occupancy_items USING gist (date_range);

CREATE TABLE public.products (
    id uuid NOT NULL,
    customer_id uuid NOT NULL,
    quantity int2 NOT NULL,
    CONSTRAINT products_pkey PRIMARY KEY (id),
    CONSTRAINT fk_c03833c58565851 FOREIGN KEY (customer_id) REFERENCES public.customers(id) ON DELETE CASCADE
);
CREATE INDEX idx_c03833c58565851 ON public.products USING btree (customer_id);

CREATE TABLE public.product_reservations (
    id uuid NOT NULL,
    order_id uuid NOT NULL,
    CONSTRAINT product_reservations_pkey PRIMARY KEY (id),
    CONSTRAINT fk_93e513ae8d9f6d38 FOREIGN KEY (order_id) REFERENCES public.orders(id) ON DELETE CASCADE
);
CREATE INDEX idx_93e513ae8d9f6d38 ON public.product_reservations USING btree (order_id);

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The easiest thing would be to just get rid of the GiST index. It can’t be misused if it doesn’t exist. Assuming you created that index for a good (but un-shown) reason and still need it, then if your goal is not to change the original query, I think the best shot is to create a new index:

USING gist (date_range, product_id);

Or maybe

USING gist (product_id, date_range);

The 2nd of those is probably better for your shown query; while the 1st is probably "good enough" with that query, plus will probably work well with the other queries that motivated you to create the original GiST index in the first place (and so can be used as a replacement for that original index, saving both space and future index maintenance work). You will need the btree_gist extension to include product_id into a GiST index.

I don’t know why the extra join makes things better. But if your goal is to not change the original query, then figuring out why is probably not worth the effort as I doubt it will lead to any insights that can be acted upon without changing the query.

Regarding the 12,000 rows that get filtered out with your truly faster plan, PostgreSQL thinks that it will need to read them from disk before it can filter them. So it thinks that that will be quite expensive. In reality, it seems like all that data was already cached (or maybe you just have really fast disks) so that expense did not actually occur. So it is possible you could get a similar plan to the fast one with the original query and original indexes just by changing random_page_cost, and maybe seq_page_cost. But if that does work, I suspect the new settings good for this query would not be good for the system as a whole (unless your system settings are just wrong to start with).

What is your distribution of date_range like? Is each range about the same length, or do they vary greatly in length? What is the total time period covered by all ranges combined? Your GiST index seems to be underperforming compared to what I can simulate, but if rebuilding the index didn’t help then there is probably nothing you can do about this, it is probably baked into your data distribution.

Method 2

It may be that all you need is to avoid using the date-range index. I would have done it like this on Oracle:

select
    *
from
    product_reservations l1_
inner join product_occupancy_items l0_ on l1_.id = l0_.id
inner join products l2_ on l0_.product_id = l2_.id
where
    l2_.customer_id = 'a19917c2-5ee8-47c2-a757-7799c0e54b0d'
    and l0_.date_range + 0 && '[2019-08-09,2019-08-11)' = true

Any function that prevents Postgresql from using the index will do.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply