Fix mysql replication crashed due to underlying storage corruption

All we need is an easy explanation of the problem, so here it is.

we have critical database server running on two virtual machines. one virtual machine crashed due to underlying storage problem and we had to move it to different storage medium and recover the xfs file systems.

after start the crashed vm, we noticed mysql replication is broken and our applications are not running properly.

From server_01 (slave)

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: x.x.x.10
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.006822
          Read_Master_Log_Pos: 484378856
               Relay_Log_File: mysqld-server-relay-bin.015091
                Relay_Log_Pos: 404689852
        Relay_Master_Log_File: mysql-bin.006822
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB: mysql
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 484378856
              Relay_Log_Space: 404690059
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.006822' at 484378856, the last event read from '/var/log/mysql/mysql-bin.006822' at 4, the last byte read from '/var/log/mysql/mysql-bin.006822' at 4.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 2
1 row in set (0.00 sec)

From server_02 (Master-crashed and recover)

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: x.x.x.11
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.011022
          Read_Master_Log_Pos: 480910234
               Relay_Log_File: mysqld-server-relay-bin.003852
                Relay_Log_Pos: 162009
        Relay_Master_Log_File: mysql-bin.011022
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB: mysql
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 480909976
              Relay_Log_Space: 0
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
1 row in set (0.00 sec)

while checking both outputs, I can see that Relay_Log_Pos of slave is ahead from master. is this means slave has new data than master ?

I was reading about recovering this with below options. but I’m not sure this is the right thing to do.

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;

If slave is ahead from master, can I make the slave as master DB and replicate other way ? with this method am I going to lose any data ?

Or what is the best possible way to recover this with minimum data loss ? we don’t have any backups at this moment

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

You don’t mention what version of MySQL you have, or what type of Replication you are using, so I am guessing from the output it is 5.7 or earlier and you’re not using GTID’s.

In my experience this error normally occurs following an ‘unexpected shutdown’ of the master server.

What happens is:

[normal operation]
  • Slave I/O Thread requests new data from the master and starts to receive this data
  • The master rotates its binary log to the next one informing the slave to start reading from the next binary log
  • Slave starts reading from the next binary log
[CRASH]
  • Master Crashes
  • Slave is disconnected
  • Master comes back up
  • Master starts a new binlog (as is normal for start up) in case of corruption in the old log
  • Slave reconnects and asks for the next position from the old binary log
  • This position doesn’t exist as the log has now been closed and the master moved on. The slave is trying to read from an “impossible position”
  • Replication stops

SOLUTION:
On the slave database do a SHOW SLAVE STATUS query and note down the Relay_Master_Log_File. This should be the binary log on the master that it was reading from when the crash occurred.

Check on the Master, there should be a binary log whose name corresponds, and whose Date Modified time is the time the crash occurred.

On the Slave Issue a CHANGE MASTER COMMAND to point it at the beginning of the next binary log.

e.g. if it was reading from binlog.000001 when the crash occurred, start reading from binlog.000002 position 1

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE = 'binlog.000002', MASTER_LOG_POS = 1;
START SLAVE;

Replication should now continue without errors.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply