Why is the optimizer not using the clustering index on my table?

All we need is an easy explanation of the problem, so here it is.

I have this table

                     Table "public.lineitem"
     Column      |     Type      | Collation | Nullable | Default 
-----------------+---------------+-----------+----------+---------
 l_orderkey      | integer       |           |          | 
 l_partkey       | integer       |           |          | 
 l_suppkey       | integer       |           |          | 
 l_linenumber    | integer       |           |          | 
 l_quantity      | integer       |           |          | 
 l_extendedprice | numeric(12,2) |           |          | 
 l_discount      | numeric(12,2) |           |          | 
 l_tax           | numeric(12,2) |           |          | 
 l_returnflag    | character(1)  |           |          | 
 l_linestatus    | character(1)  |           |          | 
 l_shipdate      | date          |           |          | 
 l_commitdate    | date          |           |          | 
 l_receiptdate   | date          |           |          | 
 l_shipinstruct  | character(25) |           |          | 
 l_shipmode      | character(10) |           |          | 
 l_comment       | character(44) |           |          | 
 l_partsuppkey   | character(20) |           |          | 
Indexes:
    "l_shipdate_c_idx" btree (l_shipdate) CLUSTER
    "l_shipmode_h_idx" hash (l_shipdate)
Foreign-key constraints:
    "lineitem_l_orderkey_fkey" FOREIGN KEY (l_orderkey) REFERENCES orders(o_orderkey)
    "lineitem_l_partkey_fkey" FOREIGN KEY (l_partkey) REFERENCES part(p_partkey)
    "lineitem_l_partsuppkey_fkey" FOREIGN KEY (l_partsuppkey) REFERENCES partsupp(ps_partsuppkey)
    "lineitem_l_suppkey_fkey" FOREIGN KEY (l_suppkey) REFERENCES supplier(s_suppkey)

and this query:

explain analyze select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice*(1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice*(1 - l_discount)*(1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate<='31/08/1998'
GROUP by
    l_returnflag,
    l_linestatus
ORDER by
    l_returnflag,
    l_linestatus

returning this query plan:

"Finalize GroupAggregate  (cost=2631562.25..2631564.19 rows=6 width=212) (actual time=28624.012..28624.466 rows=4 loops=1)"
"  Group Key: l_returnflag, l_linestatus"
"  ->  Gather Merge  (cost=2631562.25..2631563.65 rows=12 width=212) (actual time=28623.998..28624.442 rows=12 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Sort  (cost=2630562.23..2630562.24 rows=6 width=212) (actual time=28620.633..28620.633 rows=4 loops=3)"
"              Sort Key: l_returnflag, l_linestatus"
"              Sort Method: quicksort  Memory: 27kB"
"              Worker 0:  Sort Method: quicksort  Memory: 27kB"
"              Worker 1:  Sort Method: quicksort  Memory: 27kB"
"              ->  Partial HashAggregate  (cost=2630562.03..2630562.15 rows=6 width=212) (actual time=28620.607..28620.611 rows=4 loops=3)"
"                    Group Key: l_returnflag, l_linestatus"
"                    Batches: 1  Memory Usage: 24kB"
"                    Worker 0:  Batches: 1  Memory Usage: 24kB"
"                    Worker 1:  Batches: 1  Memory Usage: 24kB"
"                    ->  Parallel Seq Scan on lineitem  (cost=0.00..1707452.35 rows=24616258 width=24) (actual time=0.549..19028.353 rows=19701655 loops=3)"
"                          Filter: (l_shipdate <= '1998-08-31'::date)"
"                          Rows Removed by Filter: 293696"
"Planning Time: 0.374 ms"
"Execution Time: 28624.523 ms"
  • Why is the optimizer preferring the sequential scan over lineitem table instead of using l_shipdate_c_idx? Should I drop it?

Postgres version: PostgreSQL 14.2 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Your filter

l_shipdate<='31/08/1998'

Is not very selective, we can see from the plan that it was responsible for removing only 293,696 rows and it ended up needing to use 19,701,655. If it was to use an index to read those rows one by one it would probably have been much slower than the sequential scan of the table.

Should I drop it?

If this is the only query you are running and the only filter being used, then probably. Otherwise, there’s not enough information to go on. The index would probably be useful if you wanted to see the rows for 1 particular day. The index might be better off with some additional columns in it. Impossible to say without knowing your application.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply