How to optimized self-join view of 50 millions rows table

All we need is an easy explanation of the problem, so here it is.

I’ve the following simplified table with ~50M rows.

table sample (
    id uuid not null primary key,
    measured_date timestamp with time zone not null,
    segment_id uuid not null,
    activity_id uuid not null,
    value integer not null
);
Indexes:
    "sample_pkey" PRIMARY KEY, btree (id)
    "sample_idx" btree (segment_id, measured_date)
    "sample_uniq" UNIQUE CONSTRAINT, btree (segment_id, activity_id, measured_date)
    "sample_activity_idx" btree (activity_id)

I would like to get for each measured gap (between two different dates) my calculated value.

My query is as follow:

SELECT ROW_NUMBER () OVER () AS id,
        t1.segment_id AS segment_id,
        t1.activity_id AS activity_id,
        t1.measured_date AS from_date,
        t2.measured_date AS to_date,
        t2.value AS cumulative_progress,
        (t2.value - t1.value) AS marginal_progress,
    FROM sample AS t1 JOIN sample AS t2
    ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date
    WHERE t1.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND t1.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;

And the results are quite fast

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Sort  (cost=17.23..17.23 rows=1 width=154) (actual time=1.011..1.028 rows=171 loops=1)                                                                                  |
|  Sort Key: t1.measured_date, t2.measured_date                                                                                                                          |
|  Sort Method: quicksort  Memory: 70kB                                                                                                                                  |
|  ->  WindowAgg  (cost=1.13..17.22 rows=1 width=154) (actual time=0.063..0.914 rows=171 loops=1)                                                                        |
|        ->  Nested Loop  (cost=1.13..17.18 rows=1 width=124) (actual time=0.056..0.698 rows=171 loops=1)                                                                |
|              Join Filter: (t1.measured_date < t2.measured_date)                                                                                                        |
|              Rows Removed by Join Filter: 190                                                                                                                          |
|              ->  Index Scan using sample_uniq on sample t1  (cost=0.56..8.58 rows=1 width=70) (actual time=0.021..0.043 rows=19 loops=1)                               |
|                    Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))          |
|              ->  Index Scan using sample_uniq on sample t2  (cost=0.56..8.58 rows=1 width=86) (actual time=0.005..0.030 rows=19 loops=19)                              |
|                    Index Cond: ((segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))          |
|Planning Time: 0.321 ms                                                                                                                                                 |
|Execution Time: 1.097 ms                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

However, when I create a view to use this query often the results are poor.

Creating the view:

CREATE OR REPLACE VIEW sample_marginal AS
        SELECT ROW_NUMBER () OVER () AS id,
        t1.segment_id AS segment_id,
        t1.activity_id AS activity_id,
        t1.measured_date AS from_date,
        t2.measured_date AS to_date,
        t2.value AS cumulative_progress,
        (t2.value - t1.value) AS marginal_progress,
    FROM sample AS t1 JOIN sample AS t2
    ON t1.activity_id = t2.activity_id AND t1.segment_id = t2.segment_id and t1.measured_date < t2.measured_date;

Querying the view:

SELECT * FROM sample_marginal WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53' AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09' ORDER BY from_date asc, to_date asc;

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|QUERY PLAN                                                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Subquery Scan on sample_marginal  (cost=14106816.44..23333603.42 rows=2 width=154)                                                                                        |
|  Filter: ((sample_marginal.segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'::uuid) AND (sample_marginal.activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'::uuid))  |
|  ->  WindowAgg  (cost=14106816.44..22564147.32 rows=51297073 width=154)                                                                                                  |
|        ->  Gather  (cost=14106816.44..20768749.77 rows=51297073 width=124)                                                                                               |
|              Workers Planned: 2                                                                                                                                          |
|              ->  Merge Join  (cost=14105816.44..15638042.47 rows=21373780 width=124)                                                                                     |
|                    Merge Cond: ((t2.activity_id = t1.activity_id) AND (t2.segment_id = t1.segment_id))                                                                   |
|                    Join Filter: (t1.updated_by_date < t2.updated_by_date)                                                                                                |
|                    ->  Sort  (cost=4751689.83..4797948.93 rows=18503642 width=86)                                                                                        |
|                          Sort Key: t2.activity_id, t2.segment_id                                                                                                         |
|                          ->  Parallel Seq Scan on sample t2  (cost=0.00..1632749.42 rows=18503642 width=86)                                                              |
|                    ->  Materialize  (cost=9354126.62..9576170.32 rows=44408740 width=70)                                                                                 |
|                          ->  Sort  (cost=9354126.62..9465148.47 rows=44408740 width=70)                                                                                  |
|                                Sort Key: t1.activity_id, t1.segment_id                                                                                                   |
|                                ->  Seq Scan on sample t1  (cost=0.00..1891800.40 rows=44408740 width=70)                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I actually never waited for this query to finish since it took extremely long time, and I thought of fixing it before I can use it.

Running count(*) however, resulted in 1.8B rows (execution was slow of course).

Any idea on how to improve my view?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The parts of an SQL statement are processed in a certain order (at least logically). In the case at hand, the WHERE condition is evaluated before a window function like row_number(). Your view definition and the query on the view reverse that order by effectively introducing a subquery:

SELECT *
FROM (SELECT ROW_NUMBER () OVER () AS id,
             t1.segment_id AS segment_id,
             t1.activity_id AS activity_id,
             t1.measured_date AS from_date,
             t2.measured_date AS to_date,
             t2.value AS cumulative_progress,
             (t2.value - t1.value) AS marginal_progress,
      FROM sample AS t1
         JOIN sample AS t2
            ON t1.activity_id = t2.activity_id
               AND t1.segment_id = t2.segment_id
               and t1.measured_date < t2.measured_date
     ) AS sample_marginal
WHERE segment_id = '00021c8d-7162-467d-8e6a-4cb62926bf53'
  AND activity_id = '34a4b908-4613-422a-b6d0-4bb112737f09'
ORDER BY from_date asc, to_date asc;

There is nothing that the optimizer can do about that.

If you remove the window function call (that seems pretty pointless anyway) from the view definition, performance should improve.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply