Why Does SQL Server not have 200 buckets in the statistics histogram when there are >100k distinct values in the table

All we need is an easy explanation of the problem, so here it is.

Given I am using the AdventureWorks2016 OLTP database why does the statistics histogram for the index PK_TransactionHistory_TransactionID on table Production.TransactionHistory only contain 3 histogram "buckets" when there are 113k distinct values in that column?

An example below:

USE AdventureWorks2016

/* ensure statistics are as accurate as they can be */
UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN

then we can look at the updated histogram

/* look at the statistics for the primary key column */
DBCC SHOW_STATISTICS (
    'Production.TransactionHistory', 
    'PK_TransactionHistory_TransactionID')
WITH HISTOGRAM;

and I see the output:

Why Does SQL Server not have 200 buckets in the statistics histogram when there are >100k distinct values in the table

Note the max and min Transaction IDs:

SELECT MIN(TransactionID) FROM Production.TransactionHistory /* 100000 */
SELECT MAX(TransactionID) FROM Production.TransactionHistory /* 213442 */

SQL Server seems to have created a "bucket" for the max value, one for the min value and one for all the values in between (which it knows are all distinct)

I note that if I remove the primary key from this table

ALTER TABLE Production.TransactionHistory DROP CONSTRAINT PK_TransactionHistory_TransactionID

and then insert some duplicate values

INSERT INTO [Production].[TransactionHistory]
(
    TransactionID,
    [ProductID],
    [ReferenceOrderID],
    [ReferenceOrderLineID],
    [TransactionDate],
    [TransactionType],
    [Quantity],
    [ActualCost],
    [ModifiedDate]
)
VALUES
(200001,1,1,1,GETDATE(),'P',1,1,GETDATE()),
(200011,1,1,1,GETDATE(),'P',1,1,GETDATE()),
(200021,1,1,1,GETDATE(),'P',1,1,GETDATE()),
(200031,1,1,1,GETDATE(),'P',1,1,GETDATE())

Update the stats on the table and then look at the statistic for the column (rather than the PK we have deleted)

USE AdventureWorks2016

/* ensure statistics are as accurate as they can be */
UPDATE STATISTICS Production.TransactionHistory WITH FULLSCAN

/* look at the statistics for the primary key column */
DBCC SHOW_STATISTICS (
    'Production.TransactionHistory', 
    'TransactionID')
WITH HISTOGRAM;

We still have two buckets, though DISTINCT_RANGE_ROWS has updated accordingly

Why Does SQL Server not have 200 buckets in the statistics histogram when there are >100k distinct values in the table

Why does SQL Server not make use of the 200 "buckets" available in a histogram here? Is it something to do with resources required to fill the 8KB statistics page and using all the 200 buckets would mean it may then need to redefine when new data is added to the table?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The histogram in this case is nearly indistinguishable from prior to inserting the 4 duplicate values. At that time, the unique and sequential series could be completely described by three steps.

The difference would have been range rows = 113441 instead of 113445, distinct range rows still = 113441, and avg range rows = 1 instead of 1.000035.

So. Wouldn’t it be better to capture the four duplicates in what can be an up-to-200 plus NULL slot histogram?

Nah, not necessarily.

Why? Because optimizer stats aren’t just for the moment in time. Optimizer stats are for until the next time optimizer stats are updated. Since there are more than 25,000 rows default auto-stats threshold in SQL Server 2016 and onward is SQRT(1000 * rows). In this case, the threshold is COLMODCTR > 10651.06. So no auto-update until at least 10652 modifications to TransactionId, which we’ve already seen have duplicates. What general value can indicating 4 duplicates among an otherwise unique sequential series that would still be present given the next auto-update stats threshold of 106652 modifications – which may be deletes creating holes in the series, duplicates of a few or many values, or a range of unique, sequential numbers starting with previous max + 1?

Optimizer stats, like all of the work done by the optimizer, isn’t to achieve the best case for all circumstances regardless of effort or time. Rather, to provide a "good enough" result given the effort and time, taking into account modeling limitations within cardinality estimation and other optimizer work.

This is one reason that query-informed schema shaping with constraints, indexes and statistics will always be important. Also a reason that schema-informed query shaping, including both T-SQL code format and hints, will always be important 🙂

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply