Nested CTEs Returning Incorrect Results in Certain Situations

All we need is an easy explanation of the problem, so here it is.

OK, first let me say I got this issue from:

I was trying to assist to figure out the issue, but got stumped when trying to debug the code one step at a time. I know the issue is due to the nested CTEs (because during debug if you dump each step aka cteX into temp tables the correct results are achieved) but not knowing how they work "under the hood" I cannot explain it in a sensible fashion outside of "it dont work yo." I suspect that it has something to do with how the compiler is trying to evaluate them all at the same time during run time but without more context I cant say for sure.

My question is merely about trying to understand how they work under the hood and how it relates to this situation. Now that I’m involved, I just want to understand the issue so I can speak to it in the future and learn something fun in the mean time.

Whoever answers here can also cross post on SO and answer there as well.

Code Set up:

declare @t1 TABLE (ID varchar(max),Action varchar(max), DateTime datetime );
INSERT INTO @t1
Select *
from
(
VALUES 
('w2337','Open','2020-11-06 12:28:10.000'),
('w2337','Hold','2021-06-14 14:50:59.000'),
('w2337','Open','2021-06-14 14:51:26.000'),
('w2337','Hold','2021-06-15 14:50:59.000'),
('w2337','Open','2021-06-17 14:51:26.000'),
('w2337','Open','2021-06-18 14:51:26.000')

) t (ID, Action, DateTime);

with cte1 as (
select [ID],[Action],[DateTime]
,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as [RegIndex]
,DENSE_RANK () OVER (ORDER BY ID) as [Index by ID]
,ROW_NUMBER() OVER (PARTITION BY ID ORDER BY [DateTime]) as [Index by DateTimeID]
,CASE when [Action]='Hold' then ROW_NUMBER() OVER (PARTITION BY ID,Action ORDER BY DateTime) end as [TimesHeld]
FROM @t1 
)
,cte2 as (
select *, MAX([TimesHeld]) OVER (PARTITION BY ID ORDER BY RegIndex ROWS UNBOUNDED PRECEDING) as [FD] from cte1 
)
,cte3 as(
select *, CASE when [Action]='Open' then ROW_NUMBER() OVER (PARTITION BY ID,Action ORDER BY DateTime) end as [TimesOpened]
from cte2 
where FD is not null
)
select 
    a.*, ' ' as thing, b.DateTime  -- b.*   alternating between the direct column versus all is the issue
from cte3 a 
LEFT OUTER JOIN cte3 b
ON a.ID=b.ID and a.TimesHeld=b.TimesOpened 
where a.TimesHeld is not null and b.TimesOpened is not null

Details:

When compiling the queries below, in the final LEFT OUTER JOIN, if you select b.* you get the correct results. However if you just select the one column (datetime for example) the results are not correct.

Correct:

Nested CTEs Returning Incorrect Results in Certain Situations

Incorrect:

Nested CTEs Returning Incorrect Results in Certain Situations

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The problem isn’t incorrect results, but rather non-deterministic ordering of the ROW_NUMBER window function.

I’m going to reference Itzik Ben-Gan’s great article, definitely check it out in full when you have the time: Row numbers with nondeterministic order

The way RegIndex is calculated in cte1 is explicitly non-deterministic:

ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) as [RegIndex]

Ordering by a subquery that produces a constant value lets the optimizer know that order does not matter here:

…with this solution, SQL Server’s optimizer recognizes that there’s no ordering relevance, and therefore doesn’t impose an unnecessary sort or limit the storage engine’s choices to ones that must guarantee order

Following from that, FD in cte2 is based on RegIndex, and is thus non-deterministic as well:

MAX([TimesHeld]) OVER (PARTITION BY ID ORDER BY RegIndex ROWS UNBOUNDED PRECEDING) as [FD]

Finally, the query in cte3 is filtered by the value of FD:

where FD is not null

The end result is that a different set of rows can end up in cte3 depending on how the query gets optimized. And the number and data types of the columns included in the SELECT list can absolutely influence how the query is optimized, thus resulting in different results here.

Method 2

How do you expect deterministic results when the first cte (cte1) uses:

ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) as RegIndex

and then that Regindex column is used in the following CTEs?

OVER (ORDER BY (SELECT NULL)) basically says to the optimizer "choose any order you like" and give me row numbers.

It’s very well expected that the next time you run it, you may get (as you do) a different order – since you say again "choose any order you like" – and thus different results.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply