Efficient dimension and fact joining

All we need is an easy explanation of the problem, so here it is.

I have large fact table, and a much smaller dimension table in a simple star schema:

--1.
CREATE TABLE dbo.Dim
(
Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
CustomerName VARCHAR(2000)
)
--index
CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName);


--2. 
CREATE TABLE dbo.Fact
(
...
PurchaseDate DATE 
CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id)
...
)
--index
CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact;

Running the following simple query, which filters on fact table, and joins in the dimension:

SELECT sd.CustomerName,f.*
FROM dbo.Fact f
INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId
WHERE f.PurchaseDate IN (
'20000506',
'20000507',
'20000508',
'20000509',
'20000501',
'20000502',
'20000503'
)

We get the following ugly query plan:
Efficient dimension and fact joining

Interestingly the dimension table tend to scan ALL its 500 000 rows in 4 iteration,
but in the end only few thousand is needed in that date range of the fact table.

This is very inefficient with larger dimension tables, basically all the rows scanned all the time, like the lookup table indexes are not even there.

The expected thing would be that sql server first limits the fact table on the date range,
then using this limited range of CustomerKeyId it looks up the CustomerName from the small dimension table using an index seek.

  1. Is this really how inefficiently the star schema is, or is there something i miss here?
  2. In other words, how could i force sql server to prepare the limited CustomerKeyId table and lookup only those? (with CTE somehow?)

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Here’s a sample to play with:

--1.
CREATE TABLE dbo.Dim
(
Id INT NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
CustomerName VARCHAR(2000)
)
--index
CREATE UNIQUE NONCLUSTERED INDEX uniqueindex1 ON Dim(CustomerName);

with q as
(
   select top 100000 row_number() over (order by (select null)) rn
   from sys.messages m, sys.objects o
)
insert into dim(CustomerName) 
select concat('CustomerName',rn)
from q

--2. 

CREATE TABLE dbo.Fact
(
PurchaseDate DATE,
CustomerNameId INT CONSTRAINT fk1 FOREIGN KEY (CustomerNameId) REFERENCES dbo.Dim(Id)
)
--index
CREATE CLUSTERED COLUMNSTORE INDEX ccs ON dbo.Fact;


with q as
(
   select top 10000000 row_number() over (order by (select null)) rn
   from sys.messages m, sys.objects o
)
insert into Fact(PurchaseDate,CustomerNameId) 
select dateadd(day,rn%1000,'20000101'), 1+rn%100000
from q


SELECT sd.CustomerName,f.*
FROM dbo.Fact f
INNER JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId
WHERE f.PurchaseDate IN (
'20000506',
'20000507',
'20000508',
'20000509',
'20000501',
'20000502',
'20000503'
)


SELECT sd.CustomerName,f.*
FROM dbo.Fact f
INNER LOOP JOIN dbo.Dim sd ON sd.Id = f.CustomerNameId
WHERE f.PurchaseDate IN (
'20000506',
'20000507',
'20000508',
'20000509',
'20000501',
'20000502',
'20000503'
)

The plan is here.

You’ll see that the loop join with the index seek is more expensive than scanning the dimension on each thread of the parallel execution and doing a hash join:

(70000 rows affected)

 SQL Server Execution Times:
   CPU time = 62 ms,  elapsed time = 64 ms.

(70000 rows affected)

 SQL Server Execution Times:
   CPU time = 108 ms,  elapsed time = 90 ms.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply