How to get Postgres to use a MergeJoin?

All we need is an easy explanation of the problem, so here it is.

I’ve got two tables with the same type of primary key (and thus sorted on that key, I’m assuming). I want to update rows in one table based on matching / joined rows in the other table. I expect a Merge Join since the keys of both tables are sorted, but instead I’m getting a Hash Join. How do I get it to use a Merge Join?

CREATE TABLE b (
  id int PRIMARY KEY,
  num int NOT NULL
);
INSERT INTO b
VALUES
(1, 2),
(2, 3),
(3, 4),
(4, 5);

CREATE TABLE a (
  id int PRIMARY KEY,
  num int NOT NULL
);
INSERT INTO a
VALUES
(2, 1),
(4, 1);

EXPLAIN
UPDATE b
SET num = b.num - a.num
FROM a
WHERE b.id = a.id;

                             QUERY PLAN
--------------------------------------------------------------------
 Update on b  (cost=1.04..2.11 rows=2 width=20)
   ->  Hash Join  (cost=1.04..2.11 rows=2 width=20)
         Hash Cond: (b.id = a.id)
         ->  Seq Scan on b  (cost=0.00..1.04 rows=4 width=14)
         ->  Hash  (cost=1.02..1.02 rows=2 width=14)
               ->  Seq Scan on a  (cost=0.00..1.02 rows=2 width=14)
  • PostgreSQL 13.4
  • Ubuntu 21.04

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

I’ve got two tables with the same type of primary key (and thus sorted on that key, I’m assuming)

Your assumption is incorrect. Key values are stored in order in the index, but the table rows can be in any order on disk. Since your statement requires that columns other than key columns be read, and both tables must be accessed in their entirety, it is more efficient to perform table scans (and subsequently a hash join) than index scans with random row reads: the latter effectively doubles the number of I/O operations.

On tables with so few rows there is no reason whatsoever to use any access method other that a table scan anyway, since all rows of each table can be retrieved by a single I/O operation.

With more reasonably sized tables creating an index on (id, num) or clustering the table by the primary key may or may not lead to the optimiser choosing a merge join.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply