Optimizing a PostgreSQL query with many small lookups

All we need is an easy explanation of the problem, so here it is.

I have a database about books and book sales.
The tables look like this:

CREATE TABLE purchase (
    person_id integer,
    book_id integer,
    CONSTRAINT purchase_pk PRIMARY KEY (person_id, book_id)
);

CREATE TABLE person (
    person_id integer PRIMARY KEY
);

CREATE TABLE book (
    book_id integer PRIMARY KEY
);

CREATE INDEX purchase_idx ON purchase (person_id, book_id);

I would need to know the number of people who have brought both of the books for each pair of books, so basically:

book_1_id book_2_id people_count_who_bought_both_books
1 2 45
1 3 789

I have defined the following query for this:

CREATE MATERIALIZED VIEW book_sales_stat AS
SELECT b1.book_id AS book_1_id,
       b2.book_id AS book_2_id,
       (SELECT count(*)
        FROM person AS per
        WHERE EXISTS (SELECT FROM purchase pur
                      WHERE pur.person_id = per.person_id
                      AND pur.book_id = b1.book_id)
        AND EXISTS (SELECT FROM purchase pur
                      WHERE pur.person_id = per.person_id
                      AND pur.book_id = b2.book_id)
       ) AS person_count
FROM book AS b1, book AS b2
WHERE b1.book_id < b2.book_id;

But unfortunately it is incredibly slow.

The execution plan looks like this:

 Nested Loop  (cost=0.29..1564886083010.59 rows=90420300 width=16)
   ->  Seq Scan on book b1  (cost=0.00..237.70 rows=16470 width=4)
   ->  Index Only Scan using book_pk on book b2  (cost=0.29..96.38 rows=5490 width=4)
         Index Cond: (book_id > b1.book_id)
   SubPlan 1
     ->  Aggregate  (cost=17306.76..17306.77 rows=1 width=8)
           ->  Nested Loop  (cost=1.14..17306.76 rows=1 width=0)
                 ->  Nested Loop  (cost=0.72..17238.11 rows=123 width=8)
                       ->  Index Only Scan using purchase_idx on purchase pur_1  (cost=0.42..16791.97 rows=123 width=4)
                             Index Cond: (book_id = b2.book_id)
                       ->  Index Only Scan using person_pk on person per  (cost=0.29..3.63 rows=1 width=4)
                             Index Cond: (person_id = pur_1.person_id)
                 ->  Index Only Scan using purchase_idx on purchase pur  (cost=0.42..0.56 rows=1 width=4)
                       Index Cond: ((person_id = per.person_id) AND (book_id = b1.book_id))
 JIT:
   Functions: 13
   Options: Inlining true, Optimization true, Expressions true, Deforming true

Seems pretty good in so far that the purchase_idx index is used for pretty much everything inside the loop.

But why does it not use a parallel execution plan? I tried to go over all the requirements for parallel execution in the Postgres docs, but did not find anything that I would not be satisfied.

What can I do to speed the query up and get it to execute in parallel?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This lists all pairs of books that were actually bought by at least one person. (All other pairs were never bought together, no point in diluting the result):

SELECT p1.book_id AS book1_id
     , p2.book_id AS book2_id
     , count(*) AS count_of_people_who_bought_both_books
FROM   purchase p1
JOIN   purchase p2 USING (person_id)
WHERE  p1.book_id < p2.book_id
GROUP  BY 1, 2
ORDER  BY 1, 2;

Should be substantially faster.

All it needs is the existing PK index on (person_id, book_id) – with person_id first, just like you have it. See:

Postgres gets it done with two index-only scans in my hands, provided the table purchase has been vacuum’ed enough.

Assuming a proper many-to-many implementation with referential integrity enforced by FKs, there is no need to involve the tables person and book. That would just add cost. Related:

As opposed to your correlated subqueries (see Jeff’s answer), this query plan can be parallelized, too.

Method 2

Your query falls under this condition:

The following operations are always parallel restricted:

  • Plan nodes that reference a correlated SubPlan.

The top nested loop could be parallelized, but it would have to be gathered before dispatching to the SubPlan. There is just no point in doing it that way.

Method 3

You can achieve the same with something like

select  book_1_id, book_2_id, sum(cnt)  person_count
from (
select  b1.book_id book_1_id, b2.book_id book_2_id, 0 cnt
from    book b1
join    book b2
  on    b1.book_id < b2.book_id
union all
select  p1.book_id, p2.book_id, count(*) cnt
from    purchase p1
join    purchase p2
  on    p1.person_id = p2.person_id
 and    p1.book_id < p2.book_id
group by  p1.book_id, p2.book_id
) Sq
group by book_1_id, book_2_id

This assumes that there’s no rows in purchase that don’t correspond to a row in person or book (there’s no foreign key at the moment). It also requires person_id,book_id to be unique in purchase – it is right now but if that’s just for demo purposes you will need to distinct in the query, like:

with unique_purchase as (select distinct person_id, book_id from purchase)
select  book_1_id, book_2_id, sum(cnt)  person_count
from (
select  b1.book_id book_1_id, b2.book_id book_2_id, 0 cnt
from    book b1
join    book b2
  on    b1.book_id < b2.book_id
union all
select  p1.book_id, p2.book_id, count(*) cnt
from    unique_purchase  p1
join    unique_purchase  p2
  on    p1.person_id = p2.person_id
 and    p1.book_id < p2.book_id
group by  p1.book_id, p2.book_id
) sq
group by book_1_id, book_2_id

This performs faster for me, but it’s still not going to be instant if you have lots of data due to the nature of the joins involved.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply