All we need is an easy explanation of the problem, so here it is.
How to measure the delay between secondary node to catch up with the primary node?
I am using red gate to monitor the databases and this is what mentioned there
Guideline values: To estimate how long a secondary replica will take to catch up with the primary replica, divide the log send queue by the rate of log bytes received.
Based on this graph can you advise me how can I measure this delay ? it is not clear for me how to divide these 2 values
If there is another way to measure the delay I will appreciate to share
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
I’m not experienced with RedGate’s monitoring tools, but SSMS has an Availability Group Dashboard out-of-the-box that I’ve always found particularly helpful with basic metrics like this. Specifically the metric in the dashboard that would be helpful to you is called the Estimated Recovery Time (seconds):
Indicates the time in seconds it takes to redo the catch-up time. The catch-up time is the time it will take for the secondary replica to catch up with the primary replica. This value is hidden by default.
I also found the Redo Queue Size (KB):
Indicates the number of log records in the log files of the secondary replica that have not yet been redone. This value is hidden by default.
And the Redo Rate (KB/sec) particularly helpful in monitoring the health of my secondary replicas:
Indicates the rate in KB per second at which the log records are being redone. This value is hidden by default.
If you want to access this information more natively so you can collect the metrics over time, the above Availability Group Dashboard just leverages the following DMVs:
You can find more information on querying the DMVs for this information in SQLPerformance’s Monitoring Availability Group Replica Synchronization. This is the query from the article (with one additional calculated column to get you the Estimated Recovery Time):
SELECT ar.replica_server_name, adc.database_name, ag.name AS ag_name, drs.is_local, drs.is_primary_replica, drs.synchronization_state_desc, drs.is_commit_participant, drs.synchronization_health_desc, drs.recovery_lsn, drs.truncation_lsn, drs.last_sent_lsn, drs.last_sent_time, drs.last_received_lsn, drs.last_received_time, drs.last_hardened_lsn, drs.last_hardened_time, drs.last_redone_lsn, drs.last_redone_time, drs.log_send_queue_size, drs.log_send_rate, drs.redo_queue_size, drs.redo_rate, drs.redo_queue_size / see.redo_rate AS EstimatedRecoveryTime -- Additional helpful calculated column drs.filestream_send_rate, drs.end_of_log_lsn, drs.last_commit_lsn, drs.last_commit_time FROM sys.dm_hadr_database_replica_states AS drs INNER JOIN sys.availability_databases_cluster AS adc ON drs.group_id = adc.group_id AND drs.group_database_id = adc.group_database_id INNER JOIN sys.availability_groups AS ag ON ag.group_id = drs.group_id INNER JOIN sys.availability_replicas AS ar ON drs.group_id = ar.group_id AND drs.replica_id = ar.replica_id ORDER BY ag.name, ar.replica_server_name, adc.database_name;
You can create a SQL job to routinely log this information to a table so you have a historical comparison to achieve your goals with. (This is essentially what RedGate’s Performance Monitor is likely doing under the hood.)
The above article also briefly mentions a third way to monitor these metrics via Perfmon Counters.
And finally a third and equally effective way to log your Availability Groups health metrics is via the built-in Extended Events.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂