Speeding up PostgreSQL GroupAggregate

All we need is an easy explanation of the problem, so here it is.

I currently have two tables. UserRoll (447.633 rows) and UserRollResult (4.476.330 rows). First table contains data of who rolled, where it was rolled, when it was rolled, etc as well as a identifying number for that roll. While the second table contains the actual result(s) of the roll, which for now each roll will always have 10 results. These are the schemas for both:

                                         Table "public.userroll"
   Column   |           Type           | Collation | Nullable |                  Default
------------+--------------------------+-----------+----------+-------------------------------------------
 roll_id    | bigint                   |           | not null | nextval('userroll_roll_id_seq'::regclass)
 user_id    | bigint                   |           | not null |
 guild_id   | bigint                   |           |          |
 channel_id | bigint                   |           | not null |
 banner_key | text                     |           | not null |
 time       | timestamp with time zone |           |          |
Indexes:
    "userroll_pkey" PRIMARY KEY, btree (roll_id)
    "userroll_roll_id_idx" btree (roll_id)
    "userroll_user_id_idx" btree (user_id)
Referenced by:
    TABLE "userrollresult" CONSTRAINT "userrollresult_roll_id_fkey" FOREIGN KEY (roll_id) REFERENCES userroll(roll_id)

             Table "public.userrollresult"
   Column    |  Type   | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
 roll_id     | bigint  |           | not null |
 operator_id | integer |           | not null |
Indexes:
    "userrollresult_roll_id_idx" btree (roll_id)
Foreign-key constraints:
    "userrollresult_roll_id_fkey" FOREIGN KEY (roll_id) REFERENCES userroll(roll_id)

My issue comes to when trying to query a specific user rolls, the query slows down when group by takes action.

EXPLAIN ANALYZE 
SELECT result.operator_ids 
FROM UserRoll roll 
JOIN (
  SELECT result.roll_id, jsonb_agg(result.operator_id) AS "operator_ids" 
  FROM UserRollResult result 
  GROUP BY result.roll_id
) result ON result.roll_id = roll.roll_id 
WHERE user_id = <user_id> 
ORDER BY roll.time DESC 
LIMIT 20;

Which gives me the following plan:

                                                                                          QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=177989.59..177989.62 rows=10 width=40) (actual time=7918.412..7918.425 rows=10 loops=1)
   ->  Sort  (cost=177989.59..177989.80 rows=81 width=40) (actual time=7874.515..7874.526 rows=10 loops=1)
         Sort Key: roll."time" DESC
         Sort Method: top-N heapsort  Memory: 30kB
         ->  Merge Join  (cost=313.92..177987.84 rows=81 width=40) (actual time=14.073..7874.131 rows=203 loops=1)
               Merge Cond: (roll.roll_id = result.roll_id)
               ->  Sort  (cost=313.48..313.69 rows=83 width=16) (actual time=12.924..13.029 rows=203 loops=1)
                     Sort Key: roll.roll_id
                     Sort Method: quicksort  Memory: 34kB
                     ->  Bitmap Heap Scan on userroll roll  (cost=5.07..310.84 rows=83 width=16) (actual time=2.007..12.826 rows=203 loops=1)
                           Recheck Cond: (user_id = '131858044777791488'::bigint)
                           Heap Blocks: exact=56
                           ->  Bitmap Index Scan on userroll_user_id_idx  (cost=0.00..5.04 rows=83 width=0) (actual time=1.521..1.525 rows=203 loops=1)
                                 Index Cond: (user_id = '131858044777791488'::bigint)
               ->  GroupAggregate  (cost=0.43..172234.38 rows=435100 width=40) (actual time=1.124..7793.725 rows=447570 loops=1)
                     Group Key: result.roll_id
                     ->  Index Scan using userrollresult_roll_id_idx on userrollresult result  (cost=0.43..144414.33 rows=4476260 width=12) (actual time=0.914..3707.996 rows=4475701 loops=1)
 Planning Time: 1.517 ms
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 9.359 ms, Inlining 0.000 ms, Optimization 2.699 ms, Emission 40.592 ms, Total 52.651 ms
 Execution Time: 7928.668 ms
(23 rows)

Is there any way I can improve/speed up/rewrite the query or the design of the schemas? I thought of an approach which would be to use integer arrays and get rid of the second table but I want to think of that as the last possible solution if it’s needed.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This doesn’t need an inner select or a lateral join at all. You can do the aggregation directly at the top level:

SELECT jsonb_agg(result.operator_id) AS "operator_ids" 
FROM userroll roll join userrollresult result using (roll_id)
WHERE user_id = 10
group by roll_id ORDER BY roll.time DESC
LIMIT 20;

This may or may not be faster than the lateral join, you would have to try it on our own data and see.

Method 2

A lateral join might help to reduce the work the group aggregate needs to do, as apparently the planner isn’t smart enough to push the roll_id into the derived table:

SELECT result.operator_ids 
FROM userroll roll 
  JOIN LATERAL (
    SELECT result.roll_id, jsonb_agg(result.operator_id) AS "operator_ids" 
    FROM userrollresult result 
    WHERE result.roll_id = roll.roll_id --<< here
    GROUP BY result.roll_id
  ) result ON true -- dummy join as we already filtered on the inside
WHERE user_id = <user_id> 
ORDER BY roll.time DESC 
LIMIT 20;

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply