Seemingly simple query with unexpected execution time

All we need is an easy explanation of the problem, so here it is.

I have the following schema :

Seemingly simple query with unexpected execution time

And want to get the last price (based on transaction_datetime) per test_acts.code using the following query :

    code, price
    test_acts test_acts_main
    test_transaction_acts.test_transaction_id= test_transactions.test_transaction_id AND
    test_invoice_items.test_invoice_item_id= test_transaction_acts.test_invoice_item_id AND
    test_acts_main.test_act_id= test_invoice_items.test_act_id AND
    test_transaction_acts.price IS NOT NULL AND
    test_transactions.transaction_datetime = (
            test_transaction_acts.test_transaction_id= test_transactions.test_transaction_id AND
            test_invoice_items.test_invoice_item_id = test_transaction_acts.test_invoice_item_id AND
            test_invoice_items.test_act_id = test_acts.test_act_id AND
            test_acts.code = test_acts_main.code AND
            test_transaction_acts.price IS NOT NULL)
order by code

The tables contain the following amount of information :

  • test_transaction_acts = 2240823 rows
  • test_transactions = 832746 rows
  • test_invoice_items = 2421058 rows
  • test_acts = 24 rows

And result is the following :

Seemingly simple query with unexpected execution time

My question is : Using indexes, I can’t get the query under 14 seconds, and I am unable to decide if this is normal and that I simply should be smarter about storing the information so that it is easier to access for my purpose, or if I should keep digging to make my query faster.

I understand that some "test_acts" were only used in old "test_transactions" which implies that the database must go trough alot of "test_transactions" before actully hitting one with the desired "code" so that could explain the length of the query.

I rephrased my query to use different nomenclatures to try and see if the query itself was the issue, I looked at the explain analyse to make sure indexes were properly used, tried to understand the explain as much as I could and tested various multi-column indexes to try and see differences in the result :

Sort  (cost=2673720.31..2673748.25 rows=11177 width=33) (actual time=16972.155..16972.156 rows=24 loops=1)
  Sort Key: test_acts_main.code
  Sort Method: quicksort  Memory: 26kB
  ->  Hash Join  (cost=184630.73..2672968.75 rows=11177 width=33) (actual time=16677.629..16972.074 rows=24 loops=1)
        Hash Cond: (((test_transaction_acts.test_invoice_item_id)::bigint = test_invoice_items.test_invoice_item_id) AND (test_acts_main.test_act_id = (test_invoice_items.test_act_id)::bigint))
        ->  Nested Loop  (cost=99195.28..2566538.13 rows=279421 width=45) (actual time=1828.289..15974.502 rows=400 loops=1)
              ->  Nested Loop  (cost=99194.85..2501879.00 rows=104093 width=36) (actual time=1828.266..15973.431 rows=145 loops=1)
                    ->  Index Scan using test_acts_test_act_id_idx on test_acts test_acts_main  (cost=0.14..12.51 rows=25 width=32) (actual time=0.017..0.040 rows=25 loops=1)
                    ->  Index Scan using test_transactions_transaction_datetime_idx on test_transactions  (cost=99194.71..100033.02 rows=4164 width=12) (actual time=0.030..0.034 rows=6 loops=25)
                          Index Cond: (transaction_datetime = (SubPlan 1))
                          SubPlan 1
                            ->  Aggregate  (cost=99194.27..99194.28 rows=1 width=8) (actual time=638.893..638.893 rows=1 loops=25)
                                  ->  Hash Join  (cost=29468.13..98970.74 rows=89415 width=8) (actual time=252.679..633.311 rows=89420 loops=25)
                                        Hash Cond: ((test_transaction_acts_1.test_transaction_id)::bigint = test_transactions_1.test_transaction_id)
                                        ->  Nested Loop  (cost=2162.34..65769.09 rows=89415 width=8) (actual time=5.944..248.034 rows=89420 loops=25)
                                              ->  Nested Loop  (cost=2161.91..17844.20 rows=96842 width=4) (actual time=4.661..57.592 rows=96842 loops=25)
                                                    ->  Seq Scan on test_acts  (cost=0.00..1.31 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=25)
                                                          Filter: ((code)::text = (test_acts_main.code)::text)
                                                          Rows Removed by Filter: 24
                                                    ->  Bitmap Heap Scan on test_invoice_items test_invoice_items_1  (cost=2161.91..16690.01 rows=115288 width=12) (actual time=4.635..48.117 rows=96842 loops=25)
                                                          Recheck Cond: ((test_act_id)::bigint = test_acts.test_act_id)
                                                          Heap Blocks: exact=130166
                                                          ->  Bitmap Index Scan on test_invoice_items_test_act_id_idx  (cost=0.00..2133.09 rows=115288 width=0) (actual time=4.097..4.097 rows=96842 loops=25)
                                                                Index Cond: ((test_act_id)::bigint = test_acts.test_act_id)
                                              ->  Index Scan using test_transaction_acts_test_invoice_item_id_idx on test_transaction_acts test_transaction_acts_1  (cost=0.43..0.48 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=2421058)
                                                    Index Cond: ((test_invoice_item_id)::bigint = test_invoice_items_1.test_invoice_item_id)
                                                    Filter: (price IS NOT NULL)
                                                    Rows Removed by Filter: 0
                                        ->  Hash  (cost=12829.46..12829.46 rows=832746 width=12) (actual time=241.633..241.633 rows=832746 loops=25)
                                              Buckets: 131072  Batches: 16  Memory Usage: 3262kB
                                              ->  Seq Scan on test_transactions test_transactions_1  (cost=0.00..12829.46 rows=832746 width=12) (actual time=0.012..99.733 rows=832746 loops=25)
              ->  Index Scan using test_transaction_acts_test_transaction_id_idx on test_transaction_acts  (cost=0.43..0.59 rows=3 width=21) (actual time=0.005..0.006 rows=3 loops=145)
                    Index Cond: ((test_transaction_id)::bigint = test_transactions.test_transaction_id)
                    Filter: (price IS NOT NULL)
        ->  Hash  (cost=37297.58..37297.58 rows=2421058 width=12) (actual time=696.049..696.049 rows=2421058 loops=1)
              Buckets: 131072  Batches: 64  Memory Usage: 2795kB
              ->  Seq Scan on test_invoice_items  (cost=0.00..37297.58 rows=2421058 width=12) (actual time=0.026..200.291 rows=2421058 loops=1)
Planning time: 2.175 ms
Execution time: 16972.606 ms

If I can’t get the query to run faster, my conclusion would be to store the last price every time a new transaction is created per code, effectivly creating a duplicate of information in the database, so that I have the information readily available when I need it.

Before doing a "workaround" mecanism like that, I need to be sure that there is not a better way of doing this kind of query or a way I can fully understand the explain/analyse to see if there is more speed readily available for my query.

I also understand that I could "simplify" the database design to, for instance, include the test_acts.code inside the test_invoice_items or the test_transaction_acts, but I’m trying to keep concepts seperated as much as possible to represent real business logic concepts and avoid information duplication.

What are the steps that you would take in trying to get this query faster? Would you simply be confortable with duplicating information in your database just to have one copy in a format that solves a performance issues? Is having a copy of the data in a different format considered bad practice? I expect you want to avoid to do this as much as possible.

If it helps, here are the indexes used in my latest example, used by the analyse explain posted earlier :

CREATE INDEX test_transaction_acts_test_invoice_item_id_idx ON test_transaction_acts (test_invoice_item_id);
CREATE INDEX test_transaction_acts_test_transaction_act_id_idx ON test_transaction_acts (test_transaction_act_id);
CREATE INDEX test_transaction_acts_test_transaction_id_idx ON test_transaction_acts (test_transaction_id);

CREATE INDEX test_transactions_test_transaction_id_idx ON test_transactions (test_transaction_id);
CREATE INDEX test_transactions_transaction_datetime_idx ON test_transactions (transaction_datetime);

CREATE INDEX test_invoice_items_test_act_id_idx ON test_invoice_items (test_act_id);
CREATE INDEX test_invoice_items_test_invoice_item_id_idx ON test_invoice_items (test_invoice_item_id);

CREATE INDEX test_acts_test_act_id_idx ON test_acts (test_act_id);

This is my first question on stack exchange, I tried to follow all guidelines, let me know if I messed up!

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Your time is spent in the many repetitions of the subquery. Try to write that smarter using DISTINCT ON:

    ta.code, tta.price
FROM test_transaction_acts AS tta
   JOIN test_transactions AS tt USING (test_transaction_id)
   JOIN test_invoice_items AS tii USING (test_invoice_item_id)
   JOIN test_acts AS ta USING (test_act_id)
ORDER BY ta.code, tt.transaction_datetime DESC;

Some style tips:

  • use the standard JOIN syntax for readability and to avoid forgetting join conditions

  • use table aliases for brevity and readability

  • if a query contains more than one table, prefix column references with the table name

Method 2

Please specify the version of PostgreSQL you are using, either by a tag or by updating your question.

A CTE avoids building the whole join for each code.

I made these changes to your query:

  1. Switched to using explicit joins instead of using the comma join operator. Remember that comma joins are implicitly CROSS JOIN, Cartesian products of the tables.
  2. Created a CTE of the join that the correlated subquery can reference.

On my server (running PostgreSQL 13.3) with the sample data, my query drops the runtime from the original 35ms down to 8ms. Running this query against all the codes still entails a LOT of data scanning; if this is something you need to do a lot then a trigger-updated cache table might be in order.

WITH cte AS (
        select ta.code, tta.price, tt.transaction_datetime
        test_acts ta
INNER JOIN test_invoice_items tti USING (test_act_id)
INNER JOIN test_transaction_acts tta USING (test_invoice_item_id)
INNER JOIN test_transactions tt USING (test_transaction_id)
        transaction_datetime =
                SELECT max(transaction_datetime)
                FROM cte cte2
                WHERE cte.code = cte2.code
                GROUP by code

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from or, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply