Monitoring replication on mysql

All we need is an easy explanation of the problem, so here it is.

I have managed to setup master and slave replication.

It is working fine. What are the possibilities that it might go down?

Is there any alerting tool to monitor that?

Another thing: Can I run a separate db in my replication db which I just run for testing purposes?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Replication can break or misbehave in all sorts of fun and exciting ways. You need to monitor for three things:

1. Replication is running and has not stopped due to error

To monitor whether or not replication is running is simply a matter of programmatically checking SHOW SLAVE STATUS and looking at the values for Slave_IO_Running and Slave_SQL_Running. Both should be “yes”. pmp-check-mysql-replication-running from the Percona Monitoring Plugins for Nagios is written for this task.

2. Replication is performing well (slave lag behind master is within an acceptable range)

You need to make sure that the slave has not lagged behind the master by too far. “Too far” is determined by what your application can tolerate and by how many binary logs you keep on the master server. Because replication on the slave is single-threaded, slaves can easily get lagged behind. SHOW SLAVE STATUS has the Seconds_Behind_Master value, but is not a reliable indicator of actual lag, and frequently will jump around. In order to accurately measure replication lag, you need an external application to insert a timestamp into a table periodically. You can then measure that value from the slave and compare it against the current time to get actual replication delay. pt-heartbeat is a daemon that will insert a heartbeat into a table on your server. You can then alert on that value with pmp-check-mysql-replication-delay to make sure it is within your specified parameters.

3. The data on the servers is in sync.

There are many ways that a master and slave can get out of sync so that the data differs. You need to detect those differences and correct them periodically because a small difference can, over time, turn into a very large difference, especially with statement-based replication. This is no small task, and pt-table-checksum is designed to calculate these differences. Run this weekly. pmp-check-pt-table-checksum is a Nagios plugin to alert when the slave has data discrepancies relative to the master. To actually fix the differences, use pt-table-sync.

pt-table-checksum has been recently rewritten and is pretty easy to use. pt-table-sync has a lot of options and can be confusing. Read the documentation for these thoroughly as you can really shoot yourself in the foot if you aren’t careful. Here is a webinar about these tools.

Another thing: Can I run a separate db in my replication db which I just run for testing purposes?

There is nothing preventing you from modifying (or supplementing) the data on the slave, though generally I would recommend against it. Best practice is to have the slave be read_only=1. However, real life tends to trump best practices and often slaves are used as reporting servers. My suggestion would be to make very clear access privileges for those using the slave for data modification and to have all additional tables in a separate schema.

Method 2

  1. Replication can ‘go down’ from several reasons, the main one is that the slave will get a sql error while performing one of the commands that were executed on the master (e.g. updating a row that exist on the master but doesnt exist on the slave), another issue can be different variables setting between master and slave such as max_allowed_packet. All in all, replication is a solid feature.

  2. I use Server Density to monitor the replication (among other parameters on the server), they can monitor if the replication is running, seconds behind slave, and lots of other parameters on the server (cpu, memory). They have a very clear web app, iphone app and can send push notification when thing goes south and the best thing is that integration with them takes 5 min.

  3. As for the separate db, I didn’t understood what you trying to achieve there

Hope this helps,

R

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply