All we need is an easy explanation of the problem, so here it is.
In the company I work we use a simple setup of a Primary-Secondary replication. Should the Primary break down for whatever reason, we do the switch manually. This also means, that MySQL hardly ever gets updated. I want to make it possible, to update the servers without downtime. For various reasons we do not want a (over-)complicated solution. So I’m wondering, to achieve my goal, can it be as simple as this:
A Primary-Primary replication with GTID enabled and semi-synchronous replication.
Pacemaker to switch a virtual IP from one server to the other, so I can stop one server to update. Then switch back and update the other server.
For the Primary-Primary replication I do not have auto-increment-increment configured differently. All write processes would use the virtual IP of pacemaker and therefore would write to only one host.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
We do this at my company. We run thousands of MySQL master-master pairs that use this method. We developed a service that performs the switch you describe. We call this a "successover" because that has a more positive connotation than failover. We can and do run this anytime of day, without even notifying the app team that we are doing it.
Here are the general steps:
"Master1" is the current writeable master MySQL instance, where the VIP is defined.
"Master2" is a replica, in read-only mode.
- Make Master1 read-only.
- Remove VIP from Master1.
- Start killing any outstanding queries on Master1. Continue killing any queries in a loop, in case the apps using it try to run new queries. We shouldn’t rely on the prior steps terminating queries or connections. I.e. clients might have bypassed the VIP and connected directly to the physical IP address.
- Wait for replication lag on Master2 to reach 0. Ideally, compare slave status on Master2 to master status on Master1. If the failover occurred because Master1 is down, then you can’t do that comparison, but you can compare if slave status on Master2 shows SQL thread is caught up to the IO thread.
- Make Master2 writeable.
- Add VIP to Master2.
- Stop killing queries on Master1.
When it’s automated, this results in less than a second of downtime, provided there is no significant replication lag. Moving a VIP is a lot faster than updating DNS.
Then the apps must reconnect to the VIP, and thus get access to the new primary instance and re-run any queries they need.
Obviously apps should not use the standby MySQL instance, because the point is to allow it to be taken offline for updates, config changes, etc. This also gives you a quick way to respond if there’s a problem with the host server that MySQL is running on. We get a few host crashes or failed disks per week, so we do this switch quickly to make sure the app can continue.
It’s unavoidable that this causes a brief "blip" where the apps have their connections dropped and have to reconnect, but it’s more brief of an interruption than any other solution. Still, apps have to be designed to detect a dropped connection and reconnect. Our biggest issue is educating the app developers to do this. They keep complaining that they get alerted for a lost connection, and we tell them, "we’ve already documented what you need to do — it shouldn’t alert for a single blip."
This system has been working for several years, but we now have a mixed environment where we have many cloud MySQL instances. We need a new solution, because we can’t create VIPs using BGP in the cloud.
So we have prototyped using Envoy as a proxy to MySQL. We believe this works, but we need to develop a service to notify the Envoy proxy of the change when we do a successover. Envoy supports GRPC protocol, so we can send it a message dynamically and it’ll start routing traffic to a different target MySQL instance. This proxy-based solution should work identically in the cloud as it does in the datacenter. But this solution is probably more work than you had in mind.
We could also use ProxySQL to do something similar, but Envoy has a lot of adoption within our company for service-to-service traffic already, so if we can leverage that instead of a new type of proxy we have one fewer pieces of technology to use.
Update re comments:
Step 4 in the above list waits for replication to catch up while both instances are read-only. So there are no new updates allowed on Master1 during that time, and Master2 only has to execute a set of remaining updates that were committed before Master1 was set to read-only. Hopefully, this is a very brief wait, unless Master2 was already lagging behind by a lot.
The service that runs the successover refuses to even begin the operation if it detects that Master2 has high replication lag. The user is encouraged to try again later.
Of course, in the real world sometimes you have to override that sort of restriction because of urgent circumstances, so there’s an option to force the successover. But this comes with a risk, because if you make Master2 the primary and start allowing new queries directly on it, while there are still outstanding updates to process because of replication lag, you end up in a split-brain situation: You could make an update which will then be replaced by an event from the binary log that actually occurred in the past.
So in theory you could enable the VIP on Master2, but leave Master2 read-only until it catches up fully with respect to replication. This at least performs part of the successover and allows clients to read data, but not update, temporarily. This might be acceptable for some apps for a short time, but it depends on the app’s requirements.
In practice, our implementation of successover doesn’t do this temporary read-only mode. We just try to be very reluctant to use the forced-successover option, because a split-brain is extremely difficult to clean up (it may not be possible). We’d rather try the successover when there is no replication lag.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂