NUMAnodes and SQL Server performance

All we need is an easy explanation of the problem, so here it is.

Pretty sure I understand this, but wanted to be sure I do.
We have a SQL Server 2016, that is running with 2 NumaNodes, each with 8 vCPUs. The Max Degrees of Parallelism (MAXDOP) is set to 8.

This doesn’t sound right to me. First question: Is that as bad an idea as I think it is?

From my research, I need to tell them to reduce the VM Settings to make this run in a single NUMANode. We seem to be having random periods where queries that were running in 170 ms, are now timing out at 30+ seconds! So, we do a quick look, and it is 5% CPU use, and low disk I/O use, and reasonable network use… Basically, the machine is about idle.. We also looked for queries waiting on locks, and there were none. We are running the queries on the Secondary in a AG Group (a Read Only Query)

So, my guess: It has gotten enough load that it switched and ran one the view in question (gets run about 4,000 times per day) on one of the vCPUs in the 2nd NUMANode, and then decided that execution plan should be always run on that node. The result is that all the data it is accessing is cached in the other node’s memory, and it needs to fetch it across the inter-node link (remote memory), so it does that, but that ends up being a lot slower (170 times?), and the queries are now all stacking up running more and more of them across this remote link… until it will always timeout because the remote memory is saturated…

Is that analysis valid? I would hate to submit this as a solution to why the query suddenly jumps up in time if this is totally not correct. And it will be hard to convince them they will get better performance with 8 CPUs than with 16.

Oh, additional evidence to back up my claim: If I select * into #tmp from myView OPTION (MAXDOP 16) then I get a performance change of about -5% to -12% – meaning that it takes longer to run the query than if I just used the 8 vCPUs. However, not massively so.

So my question is: Is there any validity to my analysis?

Update: A couple other things, I got a lot of the information from: https://codenotary-compliance.medium.com/vmware-vsphere-why-checking-numa-configuration-is-so-important-9764c16a7e73

Secondly, If I do a select * from sys.[dm_os_nodes] Then I get foreign_commited_KB as 5,414,260 or 5 GB, from the information above, that would sound like it is committed out of the other node which (is bad?)

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

So my question is: Is there any validity to my analysis?

No, not really. There absolutely is a (very small) performance penalty when doing cross-NUMA memory access. But that isn’t the cause of your problems. The "performance penalty" is going to be a small percentage, not a massive 170x increase.

If you are looking at a slow query, you should start by reviewing that query for performance issues. Indexing & good old-fashioned performance tuning might get you a 170x improvement, but NUMA configurations won’t get you very far.

Secondly, If I do a select * from sys.[dm_os_nodes] Then I get foreign_commited_KB as 5,414,260 or 5 GB, from the information above, that would sound like it is committed out of the other node which (is bad?)

The sys.dm_os_nodes DMV is proof that SQL Server is NUMA-aware. SQL Server knows what’s going on with the server you’re running it on, and it will handle things itself.

VMware has created a white paper related to configuration considerations for running SQL Server on VMs. Note that Section 3.6 on pages 27-39 discusses NUMA in depth.

That said–it sounds like your problem is that a query is slow. I don’t see any evidence in your question that points to NUMA being the reason behind it. Your best bet would be to start at the beginning with basic performance tuning of the query. The comment from Erik is good advice on how this community has found to be an effective way for us to help you. If, after going through query-tuning steps, you are not able to improve performance, you can post a new question following Erik’s suggested guidelines:

If you want help with the view, please follow this advice: How To Get Answers To SQL Server Performance Questions.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply