Optimal Indexing Strategy for Datawarehouse and Data Lake Updates

All we need is an easy explanation of the problem, so here it is.

We have a sql server database we use as a data lake and a datawarehouse. Each table in the database has some standardized definition as we are at 600 or so tables now, so maintenance needs to be somewhat automated.

The general process for loading each table is to first load a copy of the table into a hash table in the changeLog schema (sometimes with only the changed records if we can determine what the changed records are), then compare the changeLog table to the destination table. The destination table is used for reporting so this changeLog approach allows us to persist the destination table and only apply minimal UPDATE/INSERT operations.

Each destination table has a Unique Key/Business Key that is identifiable through a configuration table AND has standardized audit columns that are named the same in every table. The audit columns tell us

  1. when the record was added to the datawarehouse
  2. when the record was last updated in the datawarehouse
  3. whether or not the record has been deleted in the source
  4. a changed record identifier using HASHBYTES(‘SHA2_256’, CONCAT())

The changed record identifier used to be CHECKSUM() but we found that CHECKSUM() has too high a collision rate to be trustworthy. I have just added a HASHBYTES() column to every table and populated it.

I created the HASHBYTES() column as VARBINARY(MAX). Now, every time a table is being loaded, we can tell if a record needs to be updated by comparing the new HASHBYTES() value, calculated in the changeLog table, to the persisted one in the destination table.

I have immediately noticed that the switch from the INT CHECKSUM() to the VARBINARY(MAX) HASHBYTES() has caused the update checking process to slow significantly. I have NONCLUSTERED indexes on every CHECKSUM column but not on the HASHBYTES column that I just added. There are also clustered indexes for each tables’ unique key.

  1. What is the ideal index to add to check for updates?
  2. Is there a standardized index I can add to each table?
  3. Is VARBINARY(MAX) the proper data type or can I safely bring it down to a smaller size?

Hopefully that is enough background for this question to make sense. I need to speed up this process as soon as possible.

EDIT: I am adding a large SQL script as an example, it has an example table definition for the table’s changeLog version and destination version as well as the query that is run to update the destination version.

--OBJECT DEFINITIONS
CREATE TABLE [changeLog].[DimSalesOffice](
    [Sales Office Code] [varchar](3) NOT NULL,
    [CCN_Key] [uniqueidentifier] NOT NULL,
    [Sales Office Name (Short)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long)] [nvarchar](50) NOT NULL,
    [Sales Office City] [nvarchar](50) NOT NULL,
    [Sales Office StateProvince] [nvarchar](50) NOT NULL,
    [Sales Office Postal Code] [nvarchar](20) NOT NULL,
    [Sales Office Country Code] [varchar](3) NOT NULL,
    [Sales Office Address] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
    [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
    [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office Region] [nvarchar](100) NULL,
    [Native Language Code] [varchar](2) NOT NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[DimSalesOffice](
    [SalesOffice_Key] [uniqueidentifier] NOT NULL,
    [Sales Office Code] [varchar](3) NOT NULL,
    [CCN_Key] [uniqueidentifier] NOT NULL,
    [Sales Office Name (Short)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long)] [nvarchar](50) NOT NULL,
    [Sales Office City] [nvarchar](50) NOT NULL,
    [Sales Office StateProvince] [nvarchar](50) NOT NULL,
    [Sales Office Postal Code] [nvarchar](20) NOT NULL,
    [Sales Office Country Code] [varchar](3) NOT NULL,
    [Sales Office Address] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2)] [nvarchar](65) NOT NULL,
    [Sales Office Name (Short - Native Language)] [nvarchar](14) NOT NULL,
    [Sales Office Name (Long - Native Language)] [nvarchar](50) NOT NULL,
    [Sales Office Address (Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office Address (Line 2 - Native Language)] [nvarchar](65) NOT NULL,
    [Sales Office City (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office StateProvince (Native Language)] [nvarchar](100) NOT NULL,
    [Sales Office Region] [nvarchar](100) NULL,
    [Native Language Code] [varchar](2) NOT NULL,
    [DW_CreatedOn] [datetime2](7) NULL,
    [DW_ModifiedOn] [datetime2](7) NULL,
    [DW_IsDeleted?] [bit] NULL,
    [DW_Checksum] [int] NULL,
    [Source_ModifiedOn] [datetime2](7) NULL,
    [DW_Hashbytes] [varbinary](max) NULL,
PRIMARY KEY NONCLUSTERED 
(
    [SalesOffice_Key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[DimSalesOffice] ADD  DEFAULT (newsequentialid()) FOR [SalesOffice_Key]
GO

CREATE UNIQUE CLUSTERED INDEX [IX_UK_DimSalesOffice] ON [dbo].[DimSalesOffice]
(
    [Sales Office Code] ASC
)
GO

--MERGE QUERY
DECLARE @InsertRecordCount INT, @UpdateRecordCount INT;

/*****UPDATE*****/
UPDATE [dbo].[DimSalesOffice] SET
    [CCN_Key] = [Source].[CCN_Key],
    [Sales Office Name (Short)] = [Source].[Sales Office Name (Short)],
    [Sales Office Name (Long)] = [Source].[Sales Office Name (Long)],
    [Sales Office City] = [Source].[Sales Office City],
    [Sales Office StateProvince] = [Source].[Sales Office StateProvince],
    [Sales Office Postal Code] = [Source].[Sales Office Postal Code],
    [Sales Office Country Code] = [Source].[Sales Office Country Code],
    [Sales Office Address] = [Source].[Sales Office Address],
    [Sales Office Address (Line 2)] = [Source].[Sales Office Address (Line 2)],
    [Sales Office Name (Short - Native Language)] = [Source].[Sales Office Name (Short - Native Language)],
    [Sales Office Name (Long - Native Language)] = [Source].[Sales Office Name (Long - Native Language)],
    [Sales Office Address (Native Language)] = [Source].[Sales Office Address (Native Language)],
    [Sales Office Address (Line 2 - Native Language)] = [Source].[Sales Office Address (Line 2 - Native Language)],
    [Sales Office City (Native Language)] = [Source].[Sales Office City (Native Language)],
    [Sales Office StateProvince (Native Language)] = [Source].[Sales Office StateProvince (Native Language)],
    [Sales Office Region] = [Source].[Sales Office Region],
    [Native Language Code] = [Source].[Native Language Code],
    [DW_Checksum] =
        CHECKSUM(
            [Source].[CCN_Key],
            [Source].[Sales Office Name (Short)],
            [Source].[Sales Office Name (Long)],
            [Source].[Sales Office City],
            [Source].[Sales Office StateProvince],
            [Source].[Sales Office Postal Code],
            [Source].[Sales Office Country Code],
            [Source].[Sales Office Address],
            [Source].[Sales Office Address (Line 2)],
            [Source].[Sales Office Name (Short - Native Language)],
            [Source].[Sales Office Name (Long - Native Language)],
            [Source].[Sales Office Address (Native Language)],
            [Source].[Sales Office Address (Line 2 - Native Language)],
            [Source].[Sales Office City (Native Language)],
            [Source].[Sales Office StateProvince (Native Language)],
            [Source].[Sales Office Region],
            [Source].[Native Language Code],
            0
        ),
    [DW_Hashbytes] = 
        HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '0'
        ),
    [Source_ModifiedOn] = NULL,
    [DW_ModifiedOn] = GETUTCDATE(),
    [DW_IsDeleted?] = 0
FROM [changeLog].[DimSalesOffice] [Source]
JOIN [dbo].[DimSalesOffice]
    ON [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
    AND ISNULL([DimSalesOffice].[DW_Hashbytes], HASHBYTES('SHA2_256', '')) <> HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '0'
        )
SET @UpdateRecordCount = @@ROWCOUNT;

/*****Soft Deletes*****/
UPDATE [dbo].[DimSalesOffice] SET
    [DW_Checksum] = 0,
    [DW_Hashbytes] = 
        HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([DimSalesOffice].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([DimSalesOffice].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '1'
        ),
    [Source_ModifiedOn] = NULL,
    [DW_ModifiedOn] = GETUTCDATE(),
    [DW_IsDeleted?] = 1
FROM [dbo].[DimSalesOffice]
WHERE NOT EXISTS
    (
        SELECT 1
        FROM [changeLog].[DimSalesOffice] [Source]
        WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
    )
SET @UpdateRecordCount = @UpdateRecordCount + @@ROWCOUNT;

/*****INSERT*****/
INSERT INTO [dbo].[DimSalesOffice]
    (
        [Sales Office Code],
        [CCN_Key],
        [Sales Office Name (Short)],
        [Sales Office Name (Long)],
        [Sales Office City],
        [Sales Office StateProvince],
        [Sales Office Postal Code],
        [Sales Office Country Code],
        [Sales Office Address],
        [Sales Office Address (Line 2)],
        [Sales Office Name (Short - Native Language)],
        [Sales Office Name (Long - Native Language)],
        [Sales Office Address (Native Language)],
        [Sales Office Address (Line 2 - Native Language)],
        [Sales Office City (Native Language)],
        [Sales Office StateProvince (Native Language)],
        [Sales Office Region],
        [Native Language Code],
        [DW_Checksum],
        [DW_Hashbytes],
        [Source_ModifiedOn],
        [DW_ModifiedOn],
        [DW_IsDeleted?],
        [DW_CreatedOn]
    )
SELECT
    [Source].[Sales Office Code],
    [Source].[CCN_Key],
    [Source].[Sales Office Name (Short)],
    [Source].[Sales Office Name (Long)],
    [Source].[Sales Office City],
    [Source].[Sales Office StateProvince],
    [Source].[Sales Office Postal Code],
    [Source].[Sales Office Country Code],
    [Source].[Sales Office Address],
    [Source].[Sales Office Address (Line 2)],
    [Source].[Sales Office Name (Short - Native Language)],
    [Source].[Sales Office Name (Long - Native Language)],
    [Source].[Sales Office Address (Native Language)],
    [Source].[Sales Office Address (Line 2 - Native Language)],
    [Source].[Sales Office City (Native Language)],
    [Source].[Sales Office StateProvince (Native Language)],
    [Source].[Sales Office Region],
    [Source].[Native Language Code],
    [DW_Checksum] = 
        CHECKSUM(
            [Source].[CCN_Key],
            [Source].[Sales Office Name (Short)],
            [Source].[Sales Office Name (Long)],
            [Source].[Sales Office City],
            [Source].[Sales Office StateProvince],
            [Source].[Sales Office Postal Code],
            [Source].[Sales Office Country Code],
            [Source].[Sales Office Address],
            [Source].[Sales Office Address (Line 2)],
            [Source].[Sales Office Name (Short - Native Language)],
            [Source].[Sales Office Name (Long - Native Language)],
            [Source].[Sales Office Address (Native Language)],
            [Source].[Sales Office Address (Line 2 - Native Language)],
            [Source].[Sales Office City (Native Language)],
            [Source].[Sales Office StateProvince (Native Language)],
            [Source].[Sales Office Region],
            [Source].[Native Language Code],
            0
        ),
    [DW_Hashbytes] = 
        HASHBYTES(
            'SHA2_256',
            ISNULL(CAST([Source].[CCN_Key] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Postal Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Country Code] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Short - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Name (Long - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Address (Line 2 - Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office City (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office StateProvince (Native Language)] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Sales Office Region] AS NVARCHAR(MAX)), '') + '|'
            + ISNULL(CAST([Source].[Native Language Code] AS NVARCHAR(MAX)), '') + '|'
            + '0'
        ),
    [Source_ModifiedOn] = NULL,
    [DW_ModifiedOn] = GETUTCDATE(),
    [DW_IsDeleted?] = 0,
    [DW_CreatedOn] = GETUTCDATE()
FROM [changeLog].[DimSalesOffice] [Source]
WHERE NOT EXISTS
    (
        SELECT 1
        FROM [dbo].[DimSalesOffice]
        WHERE [Source].[Sales Office Code] = [DimSalesOffice].[Sales Office Code]
    )
SET @InsertRecordCount = @@ROWCOUNT;

SELECT [Update Record Count] = @UpdateRecordCount, [Insert Record Count] = @InsertRecordCount;

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Answering in reverse because it makes more linguistic sense:

  1. Is VARBINARY(MAX) the proper data type or can I safely bring it down to a smaller size?

As I mentioned in the comments, SHA2_256 means it hashes an output to 256 bits aka 32 bytes (8 bits in 1 byte) in length, meaning the max VARBINARY size you’ll ever need is VARBINARY(32). This is mentioned in the docs for HASBYTES():

The output conforms to the algorithm standard: 128 bits (16 bytes) for MD2, MD4, and MD5; 160 bits (20 bytes) for SHA and SHA1; 256 bits (32 bytes) for SHA2_256, and 512 bits (64 bytes) for SHA2_512.

So, yes you can safely drop the size of your VARBINARY down to VARBINARY(32).

  1. What is the ideal index to add to check for updates?

  2. Is there a standardized index I can add to each table?

Yes, after you reduce the size of your VARBINARY field, it can then be added to an index on the table. I would recommend a nonclustered index that leads with the primary key field and then has the HASHBYTES() calculated field second in the definition.

The reason for making it nonclustered is because you typically want to avoid hot columns in your indexes, especially the clustered index which gets stored with every nonclustered index. Hot columns that are updated frequently cause a lot of writes to the index, and in the case of a clustered index those writes would have to happen to every nonclustered index too (since the clustered index is stored with it). A row hash on all columns will certainly change frequently.

Leading with the primary key field makes sense because you’ll want to first join by that field to match up the same rows and then by the HASHBYTES() calculated field to check if they’re different.

How would you improve the hashbyte calculation? I was considering indexing a computed column.

Yes you can use a computed column (doesn’t even need to be persisted but can be) and index it, alternatively you can create an indexed view too, I’ve done both before. I’d shoot for the computed column first as it’s a little more flexible than indexed views, couples the actual row hash in the row in the table itself, and is one less object to manage.

Method 2

The output length of HASHBYTES depends on the algorithm used. SHA2_256 produces 256 bits or 32 bytes. This is in the documentation. Declaring the columns as binary(32) is fine. The system I currently work on does this.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply