Can PostgreSQL use indexes to expedite count(distinct) queries?

All we need is an easy explanation of the problem, so here it is.

Consider the following example:

CREATE TABLE test (
  id SERIAL,
  some_integer INT
);

INSERT INTO test (some_integer)
SELECT FLOOR(RANDOM()*100000) from generate_series(1,100000) s(i);

CREATE INDEX some_integer_idx ON test (some_integer);

EXPLAIN ANALYZE SELECT COUNT(DISTINCT some_integer) from test;

It returns the following query plan:

                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1693.00..1693.01 rows=1 width=8) (actual time=26.343..26.344 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.011..7.282 rows=100000 loops=1)
 Planning Time: 0.052 ms
 Execution Time: 26.365 ms
(4 rows)

I was surprised it still did a sequential scan on test. Wouldn’t it be faster to simply count the number of rows in the index?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

There is a separate entry in a B-tree index for every row, for duplicates, too. So we can never "simply count the number of rows in the index". There are as many index rows to read and aggregate as there are table rows.

Index deduplication in Postgres 13 or later can compress duplicates, offering more potential for optimization, but the principal logic stands (1 row, 1 index tuple). See:

Since a B-tree index is typically smaller than its table, an index-only scan is still an attractive option. But that adds some overhead: checking the visibility map, heap fetches for data pages that are not "all-visible". And indexes tend to bloat more than tables (voiding the initial advantage). And indexes default to fillfactor 90, while tables default to fillfactor 100. And the visibility map has to be up to date to begin with – the table has to be vacuumed enough. Add VACUUM ANALYZE to your test for a start.

Your test table has 8 bytes of data, which is the smallest row size possible. And you create very few duplicates. So very little to gain from an index-only scan, even under perfect conditions. Postgres doesn’t even try.

Add a payload column to grow the table row, and create some more duplicates and, voilá, we see an index-only scan:

CREATE TABLE test (
  id serial
, some_integer int
, payload text
);

INSERT INTO test (some_integer, payload)
SELECT trunc(random() * 10000)  -- fewer distinct values, some dupes
     , 'Lorem ipsum dolor sit amet, consectetur adipisicing elit'  -- bigger row !!
FROM   generate_series(1,100000) s;

CREATE INDEX some_integer_idx ON test (some_integer);
VACUUM ANALYZE test;
EXPLAIN ANALYZE SELECT count(DISTINCT some_integer) FROM test;
Aggregate  (cost=2210.29..2210.30 rows=1 width=8) (actual time=25.763..25.764 rows=1 loops=1)
  ->  Index Only Scan using some_integer_idx on test  (cost=0.29..1960.29 rows=100000 width=4) (actual time=0.030..12.562 rows=100000 loops=1)
        Heap Fetches: 0
Planning Time: 0.184 ms
Execution Time: 25.891 ms

db<>fiddle here

The benefit grows with the relative size handicap table >> index, the percentage of duplicates, and the portion of "all-visible" data pages (updated visibility map).

Optimization for many duplicates

Now, if index skip scans were already implemented, the index could be put to good use for cases with many duplicates (unlike your example), even without payload column. But we are not there, yet (Postgres 15).

Until then we can emulate an index skip scan:

WITH RECURSIVE cte AS (
   (
   SELECT some_integer AS i
   FROM   test
   ORDER  BY 1
   LIMIT  1
   )
   
   UNION ALL
   SELECT (SELECT t.some_integer
           FROM   test t
           WHERE  t.some_integer > c.i
           ORDER  BY 1
           LIMIT  1)
   FROM   cte c
   WHERE  c.i IS NOT NULL
   )
SELECT count(i) FROM cte;

db<>fiddle here

Much faster. See:

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply