# Optimize GROUP BY, ORDER BY and many SUM and AVG operation query

## All we need is an easy explanation of the problem, so here it is.

I have this query, from TPCH-H benchmark:

``````explain analyze select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1 - l_discount)) as sum_disc_price,
sum(l_extendedprice*(1 - l_discount)*(1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate<='31/08/1998'
GROUP by
l_returnflag,
l_linestatus
ORDER by
l_returnflag,
l_linestatus
``````

returning this:

``````"Finalize GroupAggregate  (cost=2300777.06..2300779.00 rows=6 width=212) (actual time=38289.923..38290.426 rows=4 loops=1)"
"  Group Key: l_returnflag, l_linestatus"
"  ->  Gather Merge  (cost=2300777.06..2300778.46 rows=12 width=212) (actual time=38289.907..38290.390 rows=12 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        ->  Sort  (cost=2299777.04..2299777.05 rows=6 width=212) (actual time=38284.169..38284.169 rows=4 loops=3)"
"              Sort Key: l_returnflag, l_linestatus"
"              Sort Method: quicksort  Memory: 27kB"
"              Worker 0:  Sort Method: quicksort  Memory: 27kB"
"              Worker 1:  Sort Method: quicksort  Memory: 27kB"
"              ->  Partial HashAggregate  (cost=2299776.84..2299776.96 rows=6 width=212) (actual time=38284.129..38284.133 rows=4 loops=3)"
"                    Group Key: l_returnflag, l_linestatus"
"                    Batches: 1  Memory Usage: 24kB"
"                    Worker 0:  Batches: 1  Memory Usage: 24kB"
"                    Worker 1:  Batches: 1  Memory Usage: 24kB"
"                    ->  Parallel Seq Scan on lineitem  (cost=0.00..1493832.54 rows=21491848 width=24) (actual time=0.281..29321.949 rows=17236798 loops=3)"
"                          Filter: (l_shipdate <= '1998-08-31'::date)"
"                          Rows Removed by Filter: 256933"
"Planning Time: 3.870 ms"
"Execution Time: 38290.784 ms"
``````

and it involves this relation:

``````CREATE TABLE LINEITEM
(
L_ORDERKEY INTEGER REFERENCES ORDERS(O_ORDERKEY),
L_PARTKEY INTEGER REFERENCES PART(P_PARTKEY),
L_SUPPKEY INTEGER REFERENCES SUPPLIER(S_SUPPKEY),
L_LINENUMBER INTEGER,
L_QUANTITY INTEGER,
L_EXTENDEDPRICE NUMERIC (12,2),
L_DISCOUNT NUMERIC (12,2),
L_TAX NUMERIC (12,2),
L_RETURNFLAG CHAR (1),
L_LINESTATUS CHAR (1),
L_SHIPDATE DATE ,
L_COMMITDATE DATE ,
L_RECEIPTDATE DATE ,
L_SHIPINSTRUCT CHAR (25),
L_SHIPMODE CHAR (10),
L_COMMENT CHAR (44),
L_PARTSUPPKEY CHAR (20) REFERENCES PARTSUPP(PS_PARTSUPPKEY)
)
``````

As you can see it takes around 40 seconds and I would like to optimize this. I added a b-tree index on `L_SHIPDATE` column (sort order ASC and NULLs last).

• How can I do better?

As you can see here the optimizer is not using the index on `l_shipdate` and for this reason he prefers to sequential scan `lineitem` table.

## How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

### Method 1

There is nothing magical you can do to make this faster. Your options are:

• faster disks

• more RAM, and make sure the table is cached in RAM

• throw more workers at it by increasing `max_parallel_workers_per_gather`

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂