All we need is an easy explanation of the problem, so here it is.
Can someone help me figure out if my understanding is correct on the below:
On the AG Dashboard for my readable secondary replica I see:
- Estimated recovery time (secs) – 4598
- REDO queue – almost 24 GB
So what exactly does it mean if my secondary AG needs to go for node/failover or SQL restart as part of an activity?
Does it mean my secondary will take 4598 secs to bring this DB up with redo queue 24 GB?
I am concerned because one of our prod sec side has most of time during day time redo size of 400 GB and recovery time almost 10 hours from AG dashboard. Does it mean so called DR is compromised?
I just did a test failover and as expected DB went for recovery as I see from error log messages and see it completing in 1235 secs.
Just curious as that number of estimated recovery was way off. This is just to explain my business users help them what outage window we are talking about.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Estimated Recovery Time is how long SQL Server thinks it will take to run the recovery process required to bring the database into read-write, usable state. 4600 seconds is a LONG TIME. You should be concerned.
The REDO queue is the amount of data that needs to be replayed into the secondary database in order for the database to come online as the primary. 24GB is a lot.
What are the company’s Recovery Point and Recovery Time objectives? Those two metrics will tell you if those Estimated Recovery Time and REDO queue are a problem.
From Microsoft’s Docs:
For a secondary database (DB_sec), calculation and display of its RTO is based on its redo_queue_size and redo_rate:
The formula to calculate RTO is:
Clearly, the redo rate (the speed at which recovery can take place) is a defining factor in how fast a secondary can be brought online as a primary.
If the speed of the underlying disk can fluctuate, as is likely with lower quality HDDs or cloud services, you may well get an estimate that is not reflective of reality.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂