All we need is an easy explanation of the problem, so here it is.
I have a heavy function, let’s call it fcalc(x,y) -> my_z
. I need the result my_z
to both be a filter (too low and the row is discarded) and in the result set (so my client can see it). I write the query like so:
SELECT *, my_z
FROM big_table t, (SELECT * FROM fcalc(t.x, t.y)) as my_z
WHERE condition1 AND condition2 AND ... AND my_z > $threshold
My question is: will all the other conditions apply first which should filter out very large number of rows before it applies fcalc
? I’m very new to databases.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Method 1
NO, Postgres typically does not evaluate the function in the LATERAL
subquery for all rows.
It will apply simple filters on big_table
first and execute the function only for rows still in the race.
Fix question
Your query is syntactically invalid.
Assuming fcalc()
returns a single value, this would work:
SELECT *, my_zFROM big_table t, LATERAL (SELECT * FROM fcalc(t.x, t.y)) AS f(my_z) WHERE $condition1 AND $condition2 AND ... AND f.my_z > $threshold
And should be untangled to just:
SELECT *
FROM big_table t
JOIN LATERAL fcalc(t.x, t.y) AS f(my_z) ON f.my_z > $threshold
WHERE $condition1
AND $condition2
AND ...
Moving the f.my_z > $threshold
from the WHERE
clause to the join condition makes the query easier to read and has no effect on the query plan whatsoever (while using [INNER] JOIN
). This produces the exact same query plan:
SELECT *
FROM big_table t, fcalc(t.x, t.y) f(my_z)
WHERE $condition1
AND f.my_z > $threshold
AND $condition2
AND ...
Query plan
Either of the fixed queries will first apply predicates filtering rows in big_table
, before executing fcalc()
and filtering on the result.
You can check with EXPLAIN ANALYZE
. Say, your big_table
has 8 rows, 5 of which don’t pass your $conditionN
filters, and 1 of the remaining 3 does not pass f.my_z > $threshold
. You’ll see something like:
Nested Loop (cost=0.00..1.17 rows=3 width=79) (actual time=0.026..0.027 rows=1 loops=1) -> Seq Scan on big_table t (cost=0.00..1.10 rows=3 width=75) (actual time=0.007..0.009 rows=3 loops=1) Filter: (id > 5) Rows Removed by Filter: 5 -> Function Scan on fcalc f (cost=0.00..0.02 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=3) Filter: (my_z > 9) Rows Removed by Filter: 1 Planning Time: 0.101 ms Execution Time: 0.043 ms
Meaning, fcalc()
was only executed 3 times in the example. In reality, you should see index scans for the big table, but all the same.
You can further verify this if you set the GUC track_functions
to pl
before executing the query with or without EXPLAIN ANALYZE
. The manual:
Enables tracking of function call counts and time used. Specify
pl
to track only procedural-language functions,all
to also track SQL
and C language functions. The default isnone
, which disables
function statistics tracking. Only superusers can change this setting.Note
SQL-language functions that are simple enough to be “inlined” into the
calling query will not be tracked, regardless of this setting.
Then check how often your function has actually been called, before and after executing your query:
SELECT calls
FROM pg_catalog.pg_stat_user_functions
WHERE funcid = 'fcalc'::regproc
Consult the manual for details about the cast 'fcalc'::regproc
.
Aside
Postgres will also prioritize filters on the same level by their estimated cost. You can verify with the tools I laid out above. Tinker with the COST
setting of simple plpgsql functions …
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0