Order By causes a scan on a large table

All we need is an easy explanation of the problem, so here it is.

I have the following query;

SELECT TOP 100 ID
FROM [dbo].[TableName] WITH (NOLOCK)
WHERE TypeId = 2
    AND DateTimeUTC < '2022-Aug-04 07:02:40'
    AND DateTimeUTC > '4/26/2022 7:36:36 AM'
ORDER BY ID ASC

The table [dbo].[TableName] (Not its real name, btw) has just over 118 million rows.

I’ve created the following Index on this table;

CREATE INDEX [ix_TableName_DateTimeUTC_TypeId] 
ON [dbo].[TableName] (DateTimeUTC, TypeId)
    WITH FILLFACTOR = 90;

If I run this query (excluding the ORDER BY), the query performs a SEEK on the above index, and completes instantly. However, as soon as I include the ORDER BY, the query performs a SCAN instead on the PK, reading all 118+ million rows. As you can imagine, this tanks the performance and the query takes a long time to finish.

The simplest way to resolve this problem is to just remove the ORDER BY clause altogether, however I don’t think that’s possible because the application (which makes this call) requires the data to be returned in order.

Any suggestions on how to improve this?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

sortie

I would change the index to look like this:

CREATE INDEX 
    [TypeId_Id_DateTimeUTC] 
ON [dbo].[TableName] 
(
    TypeId, 
    Id, 
    DateTimeUTC
)
WITH 
(
    FILLFACTOR = 100,
    SORT_IN_TEMPDB = ON
);

The idea is to make the initial data location and sorting free, and also support the range predicate. I discuss this in some detail in these blog posts:

Let’s Design A SQL Server Index Together Part 1, Part 2, Part 3.

It is usually better, as a practical matter, to avoid a sort than a residual predicate.

Method 2

You should use a consistent unambiguous format for datetime literals. It is weird having two entirely different formats for the > and < predicates.

DateTimeUTC, TypeId is not the optimal order for that index.

Columns used in equality conditions should be listed first so if this index is specifically to optimise that query then TypeId should be listed first (TypeId, DateTimeUTC). Otherwise best it can do is a range seek on the date part and a residual predicate.

If you do make that indexing change and still see the scan on the clustered index this is presumably because SQL Server thinks it is quicker to read them from a source that already has them in the desired order and discard the unmatching ones than it will be to sort them at run time. Due to the TOP 100 it only needs to find the first 100 to match and then can stop the scan.

You may well be a similar case to the issue here where date is largely correlated with id rather than being independent of it so it underestimates the rows that will need to be read in id order before it finds 100 matching the predicate.

Assuming ID is an ascending identity column and given that your DateTimeUTC predicate ends today likely the matching rows will all be at the end of the index not scattered evenly through out it so this is pretty much worst case.

Possible query hints to look at are DISABLE_OPTIMIZER_ROWGOAL to remove the row goal effect from the TOP or FORCESEEK to just tell it to use the seek anyway

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply