Stuck REDO Thread of an availability group

All we need is an easy explanation of the problem, so here it is.

We are facing issues with REDO queue fluctuating as we have readable secondary configuration.

Based on my understanding in newer versions of SQL Server, redo threads have been made parallel. Current version is SQL Server 2017 running on VM with 64 cores.

I am seeing problems when select queries on the secondary run on the same cpu/scheduler where AG threads are running. I am assuming AG thread yields its scheduler and takes a back seat to let other queries run. And I believe this is the problem because queries coming in seem to be hitting same cpu_id where AG threads are doing redo work.

Example below: spid 91 is select query on cpu_id 6 and at same time spid 123 and 144 running threads for AG’s as found from sys.dm_exec_requests dmv. REDO queue starts building up when 91 came up.

Stuck REDO Thread of an availability group

I am not aware if there is any process which makes your queries go to different cores rather than one where AGs thread might be running.

  1. Can we control this to make queries hit other schedulers and not one
    where AG working?

  2. I see lots of waits on PARALLEL_REDO_TRAN_TURN when queries running on secondary while AG threads are suspended. Can trace flag help as mentioned here

  3. There might be some many other schedulers available. I am not able to understand why certain process is stuck in slot of Numa. I guess newer versions have all soft numa enabled so that makes bucket of 8 soft numas. 8*8 schedulers for 64 available. Is disabling soft numa a good idea here?

Database isolation is RCSI but I don’t think that matters here because internally it’s changed to snapshot based on design of readable secondaries.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

While this might not be the answer you are looking for as it requires a migration in the end, you could try running your workload on SQL Server 2019 to be able to use "SQL Server 2019 Intelligent Performance -Worker Migration" or also commonly called Worker Stealing.

Worker migration (AKA “worker stealing”) allows an idle SOS scheduler
to migrate a worker from the runnable queue of another scheduler on
the same NUMA node and immediately resume the task of the migrated
worker. This enhancement provides more balanced CPU usage and reduces
the amount of time long-running tasks spend in the runnable queue.

A long-running task that is enabled for worker migration is no longer
bound to a fixed scheduler. Instead, it will frequently move across
schedulers within the same NUMA node which naturally results in less
loaded schedulers. Together with the existing load factor mechanism,
worker migration provides SQL Server with an enriched solution for
balanced CPU usage.

Source

Parallel Redo is eligible for worker migration as noted in the same source:

In SQL Server 2019, workers associated with availability group
parallel redo tasks are enabled for worker migration to address a
commonly observed scheduler contention issue among redo tasks on
secondary replicas.

When migrating to SQL Server 2019 this feature is enabled by default.

Addition

I see lots of waits on PARALLEL_REDO_TRAN_TURN when queries running on
secondary while AG threads are suspended. Can trace flag 3459 help?

That trace flag can provide benefit if the queries running on the secondary cannot be adapted ( for example long running queries on the secondary reduced in time). But there could be other reasons such as frequent page splits.

One way to see if running the redo thread serial is beneficial for your set up is looking into the PARALLEL_REDO_TRAN_TURN wait stats but you should always compare these to other information such as redo queue size increase/decrease on the secondary .

Be mindful about enabling this traceflag as redo performance can suffer from going serial instead of parallel.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply