Optimize GROUP BY, ORDER BY and many SUM and AVG operation query

All we need is an easy explanation of the problem, so here it is.

I have this query, from TPCH-H benchmark:

explain analyze select
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice*(1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice*(1 - l_discount)*(1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    lineitem
where
    l_shipdate<='31/08/1998'
GROUP by
    l_returnflag,
    l_linestatus
ORDER by
    l_returnflag,
    l_linestatus

returning this:

"Finalize GroupAggregate  (cost=2300777.06..2300779.00 rows=6 width=212) (actual time=38289.923..38290.426 rows=4 loops=1)"
"  Group Key: l_returnflag, l_linestatus"
"  ->  Gather Merge  (cost=2300777.06..2300778.46 rows=12 width=212) (actual time=38289.907..38290.390 rows=12 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Sort  (cost=2299777.04..2299777.05 rows=6 width=212) (actual time=38284.169..38284.169 rows=4 loops=3)"
"              Sort Key: l_returnflag, l_linestatus"
"              Sort Method: quicksort  Memory: 27kB"
"              Worker 0:  Sort Method: quicksort  Memory: 27kB"
"              Worker 1:  Sort Method: quicksort  Memory: 27kB"
"              ->  Partial HashAggregate  (cost=2299776.84..2299776.96 rows=6 width=212) (actual time=38284.129..38284.133 rows=4 loops=3)"
"                    Group Key: l_returnflag, l_linestatus"
"                    Batches: 1  Memory Usage: 24kB"
"                    Worker 0:  Batches: 1  Memory Usage: 24kB"
"                    Worker 1:  Batches: 1  Memory Usage: 24kB"
"                    ->  Parallel Seq Scan on lineitem  (cost=0.00..1493832.54 rows=21491848 width=24) (actual time=0.281..29321.949 rows=17236798 loops=3)"
"                          Filter: (l_shipdate <= '1998-08-31'::date)"
"                          Rows Removed by Filter: 256933"
"Planning Time: 3.870 ms"
"Execution Time: 38290.784 ms"

and it involves this relation:

CREATE TABLE LINEITEM
 ( 
L_ORDERKEY INTEGER REFERENCES ORDERS(O_ORDERKEY),
L_PARTKEY INTEGER REFERENCES PART(P_PARTKEY),
L_SUPPKEY INTEGER REFERENCES SUPPLIER(S_SUPPKEY),
L_LINENUMBER INTEGER,
L_QUANTITY INTEGER,
L_EXTENDEDPRICE NUMERIC (12,2),
L_DISCOUNT NUMERIC (12,2),
L_TAX NUMERIC (12,2),
L_RETURNFLAG CHAR (1),
L_LINESTATUS CHAR (1),
L_SHIPDATE DATE ,
L_COMMITDATE DATE ,
L_RECEIPTDATE DATE ,
L_SHIPINSTRUCT CHAR (25),
L_SHIPMODE CHAR (10),
L_COMMENT CHAR (44),
L_PARTSUPPKEY CHAR (20) REFERENCES PARTSUPP(PS_PARTSUPPKEY)
) 

As you can see it takes around 40 seconds and I would like to optimize this. I added a b-tree index on L_SHIPDATE column (sort order ASC and NULLs last).

  • How can I do better?

As you can see here the optimizer is not using the index on l_shipdate and for this reason he prefers to sequential scan lineitem table.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

There is nothing magical you can do to make this faster. Your options are:

  • faster disks

  • more RAM, and make sure the table is cached in RAM

  • throw more workers at it by increasing max_parallel_workers_per_gather

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply