UPDATE FROM with a large table is slow and uses Seq Scans

All we need is an easy explanation of the problem, so here it is.

I have a large table (ultimately maybe a billion rows but currently ~26 million) for which I want to set a flag on the highest PK for a given grouping, in a one-off batch.

I chose to create a temporary table that stores the PKs that should be set current=true, the rest should be set current=false. I made a temporary table rather than a materialized view, but I think it would make no real difference.

The process of discovering the maximum ID for each is not too painful:

CREATE TABLE assertion (
    pk integer NOT NULL,
    a bigint NOT NULL,
    b bigint NOT NULL,
    c bigint NOT NULL,
    d integer NOT NULL,
    current boolean DEFAULT false NOT NULL
);

CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true);
CREATE INDEX assertion_current_idx1 ON assertion USING btree (current);
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);

SELECT COUNT(pk) FROM assertion;

-- 26916858
-- Time: 2912.403 ms (00:02.912)

CREATE TEMPORARY TABLE assertion_current AS
    (SELECT MAX(pk) as pk, a, b, c, d
      FROM assertion
      GROUP BY a, b, c, d);

-- Time: 72218.755 ms (01:12.219)

ANALYZE assertion_current;

CREATE INDEX ON assertion_current(pk);

-- Time: 22107.698 ms (00:22.108)

SELECT COUNT(pk) FROM assertion_current;

-- 26455092
-- Time: 15650.078 ms (00:15.650)

Per the count of assertion_current, we need to set the ‘current’ flag true for 98% of rows.

The tricky thing is how to update the assertion table in reasonable time based on the current values. There is a unique constraint on a, b, c, d, current which must be maintained, therefore the update to the current column needs to be atomic to avoid breaking the constraint.

I have a few options:

Option 1

Update only those current values that change. This has the benefit of updating the smallest number of rows necessary based on an indexed field:


BEGIN;
UPDATE assertion
   SET current = false
   WHERE assertion.current = true AND PK NOT IN (SELECT pk FROM assertion_current);
UPDATE assertion
   SET current = true
   WHERE assertion.current = false AND PK IN (SELECT pk FROM assertion_current);
COMMIT;

But both queries involve sequence scans on assertion_current which (I think) would have to be multiplied by a large number of rows.

Update on assertion  (cost=0.12..431141.55 rows=0 width=0)
   ->  Index Scan using assertion_current_idx on assertion  (cost=0.12..431141.55 rows=1 width=7)
         Index Cond: (current = true)
         Filter: (NOT (SubPlan 1))
         SubPlan 1
           ->  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                 ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

and

 Update on assertion  (cost=595242.56..596693.92 rows=0 width=0)
   ->  Nested Loop  (cost=595242.56..596693.92 rows=17974196 width=13)
         ->  HashAggregate  (cost=595242.00..595244.00 rows=200 width=10)
               Group Key: assertion_current.pk
               ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=10)
         ->  Index Scan using assertion_pkey on assertion  (cost=0.56..8.58 rows=1 width=10)
               Index Cond: (pk = assertion_current.pk)
               Filter: (NOT current)

This means that one of these queries (many current true or many current false) always takes a very long time.

Option 2

Single pass but has to touch every row unnecessarily.

UPDATE assertion
   SET current =
     (CASE WHEN assertion.pk IN (select PK from assertion_current)
     THEN TRUE ELSE FALSE END)

but this results in a sequence scan on assertion_current again

 Update on assertion  (cost=0.00..15498697380303.70 rows=0 width=0)
   ->  Seq Scan on assertion  (cost=0.00..15498697380303.70 rows=35948392 width=7)
         SubPlan 1
           ->  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                 ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

Option 3

Like option 1, but use WHERE in the update:

BEGIN;
UPDATE assertion SET current = false WHERE current = true;
UPDATE assertion SET current = true FROM assertion_current
  WHERE assertion.pk = assertion_current.pk;
COMMIT;

but the second query involves two seq scans:

 Update on assertion  (cost=1654256.82..2721576.65 rows=0 width=0)
   ->  Hash Join  (cost=1654256.82..2721576.65 rows=29982560 width=13)
         Hash Cond: (assertion_current.pk = assertion.pk)
         ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=10)
         ->  Hash  (cost=1029371.92..1029371.92 rows=35948392 width=10)
               ->  Seq Scan on assertion  (cost=0.00..1029371.92 rows=35948392 width=10)

Option 4

Thanks @jjanes, this took > 6 hours, so I cancelled it.

UPDATE assertion
   SET current = not current
   WHERE current <>
     (CASE WHEN assertion.pk IN (select PK from assertion_current)
     THEN TRUE ELSE FALSE END)

produces

 Update on assertion  (cost=0.00..11832617068493.14 rows=0 width=0)
   ->  Seq Scan on assertion  (cost=0.00..11832617068493.14 rows=27307890 width=7)
         Filter: (current <> CASE WHEN (SubPlan 1) THEN true ELSE false END)
         SubPlan 1
           ->  Materialize  (cost=0.00..787318.40 rows=29982560 width=4)
                 ->  Seq Scan on assertion_current  (cost=0.00..520285.60 rows=29982560 width=4)

Option 5

Thanks @a_horse_with_no_name. This takes 24 minutes on my machine.

UPDATE assertion tg SET current = EXISTS (SELECT pk FROM assertion_current cr WHERE cr.pk = tg.pk);

gives

 Update on assertion tg  (cost=0.00..233024784.94 rows=0 width=0)
   ->  Seq Scan on assertion tg  (cost=0.00..233024784.94 rows=27445116 width=7)
         SubPlan 1
           ->  Index Only Scan using assertion_current_pk_idx on assertion_current cr  (cost=0.44..8.46 rows=1 width=0)
                 Index Cond: (pk = tg.pk)

Is there a better way to achieve this in a timely fashion?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

… we need to set the ‘current’ flag true for 98% of rows

So NOT current will be the rare case. Seems like you have been trying to do things back to front so far.

Your current indexes are actively unhelpful for the given data distribution:

CREATE INDEX assertion_current_idx ON assertion USING btree (current) WHERE (current = true);  
CREATE INDEX assertion_current_idx1 ON assertion USING btree (current);  
CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion USING btree (a, b, c, d) WHERE (current = true);

We need to keep the UNIQUE index to enforce your requirements, and it’s useful, too:

CREATE UNIQUE INDEX assertion_a_b_c_d_idx ON assertion (a, b, c, d) WHERE current;

But simplify (current = true) to just current. No point in executing a redundant expression, just use the boolean value.

assertion_current_idx is absolutely pointless and will never be used. But it still has to be kept up to date. Drop it.

assertion_current_idx1 is almost as pointless. At least it would make sense for queries looking for current = false. But it’s much cheaper to have this partial index instead – which also supports my suggested 2nd query below:

CREATE INDEX assertion_not_current_idx ON assertion (a, b, c, d, pk) WHERE NOT current;

Create this index after the "initial query" below.

Note that I skip the temporary table completely in favor of a CTE. Less overhead.

Initial query

You disclosed in a later comment that "all rows are set current = false" initially. We can use a much simpler and faster query to init. Just update the one row per group where there is no other with a bigger PK. No unique violations possible, no updates to "not current":

UPDATE assertion a
SET    current = true
WHERE  NOT EXISTS (
   SELECT FROM assertion x
   WHERE (x.a, x.b, x.c, x.d)
       = (a.a, a.b, a.c, a.d)
   AND   x.pk > a.pk
   );

General query

Assuming there can be no current row per group.

WITH new_current AS (  -- only these groups require updates
   SELECT *
   FROM  (
      -- get row with greatest non-current pk per group
      SELECT a, b, c, d, MAX(pk) AS new_pk
          , (SELECT a2.pk  -- get current row of the same group
             FROM   assertion a2
             WHERE (a2.a, a2.b, a2.c, a2.d)
                 = (a1.a, a1.b, a1.c, a1.d)
             AND    a2.current
            ) AS old_pk
      FROM   assertion a1
      WHERE  NOT current
      GROUP  BY a, b, c, d
      ) a1
   WHERE (old_pk < new_pk   -- only if old pk is lower (!)
       OR old_pk IS NULL)   -- or does not exist
   )
, up1(dummy) AS (  -- update current to false FIRST
   UPDATE assertion a
   SET    current = false
   FROM   new_current n
   WHERE  (a.a, a.b, a.c, a.d, a.pk)
        = (n.a, n.b, n.c, n.d, n.old_pk)  -- only matches existing old pk
   RETURNING true
   )
UPDATE assertion a  -- then update the new current row
SET    current = true
FROM   new_current n
LEFT   JOIN (SELECT FROM up1 LIMIT 1) AS force_update_order ON true  -- !!!
WHERE  (a.a, a.b, a.c, a.d, a.pk)
     = (n.a, n.b, n.c, n.d, n.new_pk);

Should dwarf anything else you tried so far.

Subquery a1 gets the row with greatest non-current pk per group with a simple max() aggregation. That’s the optimum while there are few candidate rows (NOT current) per group. Else, optimize this step further with an emulated index skip scan:

The tricky part is to keep the UNIQUE constraint happy. It does not allow two current rows per group at any given time. There is no order of execution among CTEs of the same query – as long as one does not reference the other. I put in such a dummy reference to force the order of updates.

When there was no current row, yet, the CTE up1 returns no row. So we use LEFT JOIN. And LIMIT 1 to never duplicate rows.

Method 2

I don’t think you should try to avoid every seq scan. But you do not want a seq scan which will be executed a large number of times (like in a un-hashed subplan, or in the 2nd child of a nested loop).

In your first plan, it does have a seq scan in an unhashed subplan, but it claims this will only be executed once, so it would not be too bad if that is accurate. But that does seem to contradict your description of "the ‘current’ flag is true for 98% of rows", so maybe the stats are just grossly wrong.

You could increase our work_mem until the subplan switches to a hashed subplan, or you could rewrite your query from NOT IN to NOT EXISTS.

In the case of option 2, you could just add a WHERE to eliminate the inoperative updates:

UPDATE assertion
   SET current = not current
   WHERE current <>  
     (CASE WHEN assertion.pk IN (select PK from assertion_current)
     THEN TRUE ELSE FALSE END)

But again using EXISTS rather IN is likely to be better.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply