Efficient unique Inner Join

All we need is an easy explanation of the problem, so here it is.

Context

I have a query that is something like where I want to query all of the days that each product was on sale at at least 1 store. This query runs as part of a larger query and runs whenever a user loads a specific page.

TL;DR Schema and query

(data is for example purposes only)

https://dbfiddle.uk/?rdbms=postgres_14&fiddle=0ad70a90a446e811e08aa1bd779550d1

Schema

I have a table sales_periods that tracks the date periods that products were on sale at each store.

product_id since till store_id
42 Aug 12, 2016 Jan 27, 2018 19
42 Jan 1, 2020 Jan 27, 2021 19
43 Feb 14, 2019 Jan 27, 2022 20

Query

WITH dates(day) AS(
  SELECT day FROM calendar WHERE day BETWEEN '2021-04-14'::date AND '2022-04-13'::date
)
SELECT
  sales_periods.product_id,
  COUNT(DISTINCT dates.day)
FROM
  sales_periods
  INNER JOIN dates ON dates.day >= sales_periods.since
  AND dates.day <= sales_periods.till
GROUP BY
  sales_periods.product_id

The problem

There could be 100s of stores and 10,000s of products. And I want to look at an entire year! From what I think I see from EXPLAIN, Postgres will do the JOIN (creating a lot of data in memory = 100 stores x 10,000 products x 365 days) and then it will reduce the data with the group (max 10,000 products x 365 days).

Goal

I would like the query to run <100ms. I think <<50ms is possible if the query can be made properly. I can see what it’s doing is wrong but I don’t know how to do better!

What I’ve tried

  1. Combining the sales_periods <- Takes too much time compared with the amount of time saved
  2. Use DISTINCT ON instead of GROUP BY <- slower
  3. Switching to daterange (https://dbfiddle.uk/?rdbms=postgres_14&fiddle=55fa23c2bbb7d44374ef6ab2acc570a2) <- slower
  4. I played around a lot with the indexes, including trying out GIST indexes, sorted index, etc. <- no change, sometimes worse

What I think I want

If there’s a way to tell Postgres to stop as soon as it knows that each product is available for purchase from at least 1 store on each day.

Environment

Postgres 13

Constraints

It would be very hard to change the column structure of the sales_periods table but it could be done. Adding and removing columns, adding index and views are all easy.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

There are a couple of indexes that might help a query of your kind. Like, a btree index with inverted sort order (possibly multicolumn, or "covering"). See:

Or a GiST or SP-GiST index on daterange. See:

However, while your filters are hardly selective, indexes can only do so much. What makes your query expensive is "unnesting" rows with long time ranges to many rows before folding duplicate days with DISTINCT and counting. That can be improved gradually.

Faster EXISTS variant

Your attempt at an EXISTS query has an expensive flaw. You still COUNT(DISTINCT dates.day), but there’s no need any more as EXISTS already eliminated duplicates. count(*) is substantially faster. Plus some other minor modifications:

WITH dates AS (
   SELECT day FROM calendar
   WHERE  day BETWEEN '2021-04-14' AND '2022-04-13'
   )
SELECT p.product_id, count(*) AS days
FROM   dates d
CROSS  JOIN products p
WHERE  EXISTS (
   SELECT FROM sales_periods s
   WHERE  s.product_id = p.product_id
   AND    d.day >= s.since
   AND    d.day <= s.till
   )
GROUP  BY 1;

db<>fiddle here

Pre-select rows from sales_periods

Another possible optimization, especially while indexes on the main table don’t help anyway, or for more selective filter ranges: eliminate irrelevant sales early:

WITH s AS (
   SELECT product_id, since, till
   FROM   sales_periods
   WHERE  since <= '2022-04-13'
   AND    till  >= '2021-04-14'
   )
SELECT p.product_id, count(*) AS days
FROM  (
   SELECT day FROM calendar
   WHERE  day BETWEEN '2021-04-14' AND '2022-04-13'
   ) d
CROSS  JOIN products p
WHERE  EXISTS (
   SELECT FROM s
   WHERE  s.product_id = p.product_id
   AND    d.day >= s.since
   AND    d.day <= s.till
   )
GROUP BY 1;

db<>fiddle here

Aside:
Your calendar table holds days till the year 2222, which seems like excessive waste. There is no need to handle a table of 81083 rows. But that’s a minor issue.

Much faster with range aggregation

What I really want to post is the following query making use of range aggregation, and then count the days in the range. 200x faster in my hands with your sample data (~ 3 ms vs ~ 650 ms for your original query). But there’s a snag: we need multiranges introduced with Postgres 14.

Create this auxiliary function to count days in a datemultirange:

CREATE OR REPLACE FUNCTION f_days_in_multirange(datemultirange)
  RETURNS int
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
BEGIN ATOMIC
SELECT sum(upper(r) - lower(r))::int
FROM   unnest($1) t(r);
END;

COMMENT ON FUNCTION f_days_in_multirange(datemultirange) IS 'Counts days in given datemultirange.
Input with any unbounded range results in NULL value!';

See:

Then the query can be:

SELECT product_id
     , CASE WHEN sales_ct = 1 THEN upper(date_range) - lower(date_range)
            ELSE f_days_in_multirange(date_range) END AS days
FROM  (
   SELECT product_id, count(*) AS sales_ct
        , range_agg(daterange(since, till, '[]'))
        * multirange(daterange('2021-04-14', '2022-04-13', '[]')) AS date_range        
   FROM   sales_periods
   WHERE  since <= '2022-04-13'
   AND    till  >= '2021-04-14'
   GROUP  BY 1
   ) sub;

db<>fiddle here

Most of your resulting date ranges consist of a single range, and most of those come from a single source. So I added a cheap count sales_ct and used that to take a shortcut to great effect (almost 10x). CASE WHEN sales_ct = 1 THEN .... Depending on your actual data distribution, other shortcuts may be possbile.

You need to understand range and multirange types, range aggregation, and the (multi-)range operator anyrange * anyrange → anyrange to compute the intersection of the ranges.

Method 2

In your example data, most products are only sold in one store. How does this compare to your real data? Will that matrix be dense? Will it be over populated (multiple different ranges per store/product pairing)?

If this exists as part of a larger query, maybe the best optimization opportunities only exist in the context of that larger query.

You said that combining ranges took too much time. Why would you have to do that for every execution? Store the combined ranges (or for that matter, the results of this query) in a materialized view which is only refreshed daily.

If there’s a way to tell Postgres to stop as soon as it knows that each product is available for purchase from at least 1 store on each day.

Yes, but since each product times each day >= 3,650,000, I doubt this will actually be as fast as you are hoping. You can get a semi-join by using EXISTS (...).

This formulation requires you to have a table named "products" which lists each product, which I assume you have already.

SELECT
    product_id,
    COUNT(DISTINCT dates.day)
FROM generate_series(<START_DATE>, <END_DATE>, '1 day'::interval) AS dates(day)
CROSS JOIN products
WHERE EXISTS (
    SELECT 1 from sales_periods 
    WHERE sales_periods.product_id=products.product_id
    AND dates.day >= sales_periods.since
    AND dates.day <= sales_periods.till
)
GROUP BY
    product_id

So it is possible, but as I said I don’t think it will have the benefit you want.

By the way, in your fiddle to show the usage of daterange type, you didn’t add the GiST index to index it. I doubt that it would make much difference anyway, but it is rather unfair to not even try it.

Method 3

There is two problems in your query…

FIRST generating data on the fly by using generate_series is a bad trick that will drag down performances… To optimize a query, the optimizer (planer in PostGreSQL terminology…) must know how many rows will be used in every table, before executing the query, thus to build a good execution plan, that will be used to solve the "demand". It is impossible to know in advance the number of rows with generate_series because this number will be known, after executing the function… This trouble is known as an X/Y problem…

SECOND, due to MVCC, PostGreSQL is unnable to perform a quick execution of a COUNT because it have to check every rows in every pages of the table to do so, while professionnal RDBMS like Oracle or MS SQL Server does not need to read everything, only the header of the pages that contains exact "live" rows…

To see the difference in terms of performances between SQL Server and PostGreSQL for COUNT queries, read the benchmark I have made… You can reproduce the benchmark I have used for your own purpose… The differences are quite important (between 61 and 1,533 times faster for SQL Server)

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply