query performance gains by removing operator hash match inner join

All we need is an easy explanation of the problem, so here it is.

While trying to apply the contents of this question below to my own situation, I am a bit confused as how I could get rid of the operator Hash Match (Inner Join) if any way possible.

SQL Server query performance – removing need for Hash Match (Inner Join)

I noticed the cost of 10% and was wondering if I could reduce it.
See the query plan below.

enter image description here

This work comes from a query thad I had to tune today:

SELECT c.AccountCode, MIN(d.CustomerSID) 
FROM   Stage.Customer c 
INNER JOIN Dimensions.Customer d  ON c.Email = d.Email
                                  OR (
                                          c.HomePostCode = d.HomePostCode
                                       AND c.StrSurname = d.strSurname
                                                                    )
GROUP BY c.AccountCode

and after adding these indexes:

---------------------------------------------------------------------
-- Create the indexes
---------------------------------------------------------------------

CREATE NONCLUSTERED INDEX IDX_Stage_Customer_HOME_SURNAME_INCL
ON Stage.Customer(HomePostCode ,strSurname)
INCLUDE (AccountCode)
--WHERE HASEMAIL = 0
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go


CREATE NONCLUSTERED INDEX IDX_Dimensions_Customer_HOME_SURNAME_INCL
ON Dimensions.Customer(HomePostCode ,strSurname)
INCLUDE (AccountCode,CustomerSID)
--WHERE HASEMAIL = 0
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go



CREATE NONCLUSTERED INDEX IDX_Stage_Customer_EMAIL_INCL
ON Stage.Customer(EMAIL)
INCLUDE (AccountCode)
--WHERE HASEMAIL = 1
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go


CREATE NONCLUSTERED INDEX IDX_Dimensions_Customer_EMAIL_INCL
ON Dimensions.Customer(EMAIL)
INCLUDE (AccountCode,CustomerSID)
--WHERE HASEMAIL = 1
--WITH (ONLINE=ON, DROP_EXISTING = ON)
go

this is the new query:

----------------------------------------------------------------------------
-- new query 
----------------------------------------------------------------------------

SELECT * 
FROM (    
SELECT AccountCode
     ,RO=ROW_NUMBER () OVER (PARTITION BY AccountCode ORDER BY CustomerSID)
     --,CustomerSID=MIN(CustomerSID) OVER (PARTITION BY AccountCode ORDER BY AccountCode)
       ,CustomerSID
FROM (    
          SELECT c.AccountCode, D.CustomerSID
       FROM   Stage.Customer c 
       INNER JOIN Dimensions.Customer d  ON c.Email = d.Email

          UNION ALL

          SELECT c.AccountCode, D.CustomerSID
       FROM   Stage.Customer c 
       INNER JOIN Dimensions.Customer d  ON c.HomePostCode = d.HomePostCode
                                        AND c.StrSurname = d.strSurname
) RADHE
) R1
WHERE RO = 1

This has reduced the query execution time from 8 minutes to 1 second.

Everybody is happy, but still I would like to know if I could get more done,
I.e. by somehow removing the hash match operator.

Why is it there at the first place, I am matching all the fields, why hash?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

the following links will provide a good source of knowledge regarding execution plans.

From Execution Plan Basics — Hash Match Confusion I found:

From
http://sqlinthewild.co.za/index.php/2007/12/30/execution-plan-operations-joins/

“The hash join is one of the more expensive join operations, as it
requires the creation of a hash table to do the join. That said, it’s
the join that’s best for large, unsorted inputs. It is the most
memory-intensive of any of the joins

The hash join first reads one of the inputs and hashes the join column
and puts the resulting hash and the column values into a hash table
built up in memory. Then it reads all the rows in the second input,
hashes those and checks the rows in the resulting hash bucket for the
joining rows.”

which links to this post:

http://blogs.msdn.com/b/craigfr/archive/2006/08/10/687630.aspx

Can you explain this execution plan? provides good insights about the execution plan with, not specific to hash match but relevant.

The constant scans are a way for SQL Server to create a bucket into
which it’s going to place something later in the execution plan. I’ve
posted a more thorough explanation of it here. To understand what the
constant scan is for, you have to look further into the plan. In this
case, it’s the Compute Scalar operators that are being used to
populate the space created by the constant scan.

The Compute Scalar operators are being loaded up with NULL and the
value 1045876, so they’re clearly going to be used with the Loop Join
in an effort to filter the data.

The really cool part is that this plan is Trivial. It means that it
went through a minimal optimization process. All the operations are
leading up to the Merge Interval. This is used to create a minimal set
of comparison operators for an index seek (details on that here).

In this question:
Can I get SSMS to show me the Actual query costs in the Execution plan pane?
I’m fixing performance issues on a multistatement stored procedure in SQL Server. I want to know which part(s) I should spend time on.

I understand from How do I read Query Cost, and is it always a percentage? that even when SSMS is told to Include Actual Execution Plan, the “Query cost (relative to the batch)” figures is still based on cost estimates, which can be far off actuals

Measuring Query Performance : “Execution Plan Query Cost” vs “Time Taken”
gives good info for when you need to compare the performance of 2 different queries.

In Reading a SQL Server Execution plan you can find great tips for reading the execution plan.

Other questions/answers that I really liked because they are relevant to this subject, and for my personal reference I would like to quote are:

How to optimise T-SQL query using Execution Plan

can sql generate a good plan for this procedure?

Execution Plans Differ for the Same SQL Statement

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply