High availability primary does not automatically take secondary role when back online

All we need is an easy explanation of the problem, so here it is.

Setup:

  • Basic high availability

  • 2 replicas (1 primary, 1 secondary).

    DB01 => initial primary.

    DB02 => initial secondary

  • Synchronous commit on both

  • Both are in synchronized state

  • There is no listener configured

  • Cluster type None

When we stop the DB01 (initial & current primary) SQL service using services.msc (simulating a friendly server crash) and then initiate a forced failover on DB02 (initial & current secondary) using:

ALTER AVAILABILITY GROUP [TestHA] FORCE_FAILOVER_ALLOW_DATA_LOSS;

The secondary database comes online, which is what we want.

However, when the DB01 SQL Server service is started again, using services.msc, the DB01 db assumes primary role again.

So currently there are 2 instances readable/writable and out of sync. We were expecting that the initial primary would detect that a secondary has taken the primary role and assume a secondary role or at least be inaccessible so apps cannot work on old data.

The same procedure, but using the deprecated mirror setup, does behave this way.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Since this is a clusterless (read-scale) availability group, there is nothing automatically coordinating which role each node is in – that process is completely manual.

This is why the former primary comes back up as the primary – nothing has told it to change its role.

You’ll want to follow the instructions outlined here:

Fail over the primary replica on a read-scale availability group – Forced manual failover with data loss

…if the original primary replica recovers after failover, it will assume the primary role. To avoid having each replica be in a different state, remove the original primary from the availability group after a forced failover with data loss. Once the original primary comes back online, remove the availability group from it entirely.

In the end, you can add that former primary back as a secondary manually:

  1. (Optional) If desired, you can now add N1 back as a new secondary replica to the availability group AGRScale.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply