All we need is an easy explanation of the problem, so here it is.
As far as I understand,
BEFORE_ON_PRIMARY_FAILOVER is exactly behaving like
EVENTUAL in normal scenarios until a failover happened, which the former will block all new transactions until the new primary node has applied its backlog while the latter allows transactions to be run immediately.
I have tried both options in the lab environment. When the replication lag is relatively high, it takes a very long time for the new primary to apply backlog and open to new transactions. This significantly affects HA, render the auto-failover feature of Group Replication almost meaningless.
So my question is, is that really beneficial to use
BEFORE_ON_PRIMARY_FAILOVER instead of
EVENTUAL? Is that really a "safer" option?
Some info about my environment in case that matters:
MySQL 8.0.25, installed using TAR ball format.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
What makes a DBA’s blood run cold the most is split-brain.
That’s the term for the scenario where you start writing updates to one of the nodes in a cluster while it is still applying updates replicated from one of the other nodes. So you have a high chance of committing an update to some row of data, but then it is overwritten by a replicated change that originally occurred earlier.
Then the transaction you just committed executes on the other node. But the change is based on an outdated version of data, so it has the wrong values. Now neither node has the "true" state of data.
Recovering from this kind of mistake is incredibly difficult. You’d have to reconstruct the correct sequence of changes from logs somehow, then restore the whole cluster from backup, then re-run the changes in the correct order. And of course the application owner says you must do it without downtime, so you can’t interrupt ongoing traffic or take the database offline to restore it.
It’s virtually impossible to do this, but you soon have managers all the way up the hierarchy yelling at you to fix it immediately.
This is why synchronous replication is worth it, even if it delays failover. It protects against split-brain, which is, frankly, more important than high throughput.
If you have frequent replication lag, switching to EVENTUAL is not a good solution. The solution is either to reduce the query traffic, or increase the server’s performance, until the group replication cluster can keep up with it without causing frequent replication lag.
This is done in one of two ways:
Scale up: get faster CPUs, faster storage drives, more RAM.
Scale out: split the data over multiple clusters, and distribute data updates more or less evenly between the clusters, so each cluster only needs to handle a fraction of the traffic.
Or eventually, you need to use both solutions, because there’s only so far you can scale up.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂