How does SQL estimate the number of rows in a less than < Predicate

All we need is an easy explanation of the problem, so here it is.

I have been doing some testing to try to better understand how SQL Server uses a histogram to estimate the number of rows that will match an equality predicate and also a < or > predicate

Given I am using the AdventureWorks2016 OLTP database

If can understand SQL Server’s estimation process for = and > predicates:

/* update stats with fullscan first */    
UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN

Then I can see the histogram for the column TransactionHistory.Quantity

DBCC SHOW_STATISTICS (
    'Production.TransactionHistory', 
    'Quantity')

The below screenshot is the top end of the histogram where I have run my tests:

How does SQL estimate the number of rows in a less than < Predicate

The following query will estimate 6 rows as the value in the predicate is a RANGE_HI_KEY so uses the EQ_ROWS for that bucket:

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity = 2863

The following will estimate 1.36 rows as it is not a RANGE_HI_KEY so uses the AVG_RANGE_ROWS for the bucket it falls in:

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity = 2862

The following "greater than" query will estimate 130 rows which appears to be the sum of the RANGE_ROWS and the EQ_ROWS for all the buckets with a RANGE_HI_KEY > 2863

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity > 2863

A similar query below, but the value is not a RANGE_HI_KEY in the histogram. SQL Server again estimates 130 and appears to use the same method as above

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity > 2870 

This all makes sense up to now so my testing moved onto a "less than" query

SELECT  * 
FROM    Production.TransactionHistory
WHERE   Quantity < 490 

for this query, SQL Server estimates 109,579 rows but I can’t work out where it has got that from:

RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.

How does SQL Server use the histogram to estimate the number of rows that will be returned by a "less than" predicate

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

for this query, SQL Server estimates 109,579 rows but I can’t work out where it has got that from:

RANGE_HI_KEY + RANGE_ROWS of all buckets up to and including RANGE_HI_KEY 470 = 109,566 so we are 11 short somewhere.

You’re 13 short, not 11: 109,579 – 109,566 = 13.

The general idea, as shown in my related answer is to use linear interpolation within the partial step, assuming uniformity.

In your case:

How does SQL estimate the number of rows in a less than < Predicate

So the question is how many of those 23 RANGE_ROWS do we expect to match the predicate < 490 when they are assumed to be distributed uniformly within the histogram step with RANGE_HI_KEY 500:

DECLARE
    @ARR float = 23e0 / 6e0, -- AVG_RANGE_ROWS
    @DRR float = 6e0,        -- DISTINCT_RANGE_ROWS
    @PR float = 490 - 470,   -- predicate range
    @SR float = 499 - 470    -- whole step range (excluding high key)

SELECT (@DRR - 1) * ((@PR - 1) / @SR) / ((@SR - 1) / @SR) * @ARR;

This computation gives 13.00595.

The -1 factors account for using < which is assumed to exclude a DISTINCT_RANGE_ROW row. When <= is used, that row is assumed to match the predicate.

The whole thing is a modification of applying the fraction of the range you are asking for versus the range covered by the histogram step. Without excluding the unmatched value, it would be simply @PR/@SR.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply