MySQL 5.6: Slave_IO thread stops working

All we need is an easy explanation of the problem, so here it is.

Standard replication breaks for no apparent reason.

mysql> SELECT @@version, @@version_comment;
+---------------+----------------------------------------------------------------------------+
| @@version | @@version_comment |
+---------------+----------------------------------------------------------------------------+
| 5.6.15-56-log | Percona XtraDB Cluster (GPL), Release 25.5, Revision 759, wsrep_25.5.r4061 |
+---------------+----------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'wsrep_on';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_on | OFF |
+---------------+-------+
1 row in set (0.00 sec)

Crash safe replications enabled:

master_info_repository = TABLE
relay_log_info_repository = TABLE
relay_log_recovery = 1

Slave is running fine:

# mysql -e "SHOW SLAVE STATUS\G" | grep "Slave"
               Slave_IO_State: Waiting for master to send event
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

but no connected slaves on MASTER and IO thread disappears from MASTER after some time:

mysql> SELECT * FROM information_schema.processlist WHERE command = 'Binlog Dump';
Empty set (0.10 sec)

mysql> SHOW SLAVE HOSTS;
Empty set (0.00 sec)

Master:

mysql> SHOW MASTER STATUS; +------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000003 | 568210 | | | |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

Slave:

# mysql -e "SHOW SLAVE STATUS\G" | grep "Master_Log"
              Master_Log_File: mysql-bin.000003
          Read_Master_Log_Pos: 568210
        Relay_Master_Log_File: mysql-bin.000003
          Exec_Master_Log_Pos: 568210

Master:

mysql> CREATE DATABASE IF NOT EXISTS repl_test; SHOW MASTER STATUS;
Query OK, 1 row affected (0.00 sec)

+------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000003 | 568333 | | | |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

no changes on slave:

# mysql -e "SHOW SLAVE STATUS\G" | grep "Master_Log"
    Master_Log_File: mysql-bin.000003
    Read_Master_Log_Pos: 568210
    Relay_Master_Log_File: mysql-bin.000003
    Exec_Master_Log_Pos: 568210

slave picks up the changes after IO_THREAD restart:

# mysql -e "STOP SLAVE IO_THREAD; START SLAVE IO_THREAD;"
# mysql -e "SHOW SLAVE STATUS\G" | grep "Master_Log"
    Master_Log_File: mysql-bin.000003
    Read_Master_Log_Pos: 568333
    Relay_Master_Log_File: mysql-bin.000003
    Exec_Master_Log_Pos: 568333

Update: Fri Jul 18 14:15:59 BST 2014

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: $IP
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000007
          Read_Master_Log_Pos: 433
               Relay_Log_File: mysql-relay.000265
                Relay_Log_Pos: 283
        Relay_Master_Log_File: mysql-bin.000007
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 433
              Relay_Log_Space: 710
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File: /etc/mysql/ca-cert.pem
           Master_SSL_CA_Path: 
              Master_SSL_Cert: /etc/mysql/client-cert.pem
            Master_SSL_Cipher: 
               Master_SSL_Key: /etc/mysql/client-key.pem
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1124732721
                  Master_UUID: 4412a455-e1d0-11e3-835a-5254007fe78d
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: /etc/mysql/ca-cert.pem
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
1 row in set (0.00 sec)

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

What you’re experiencing can easily occur in a low traffic situation, particularly if the two servers are separated by a firewall or other device that implements stateful packet inspection. (This is a situation that can occur, for example, within Amazon EC2/VPC). The intermediate networking hardware can “forget” about the TCP connection between the servers, because when there’s no data being replicated, the connection is left idle.

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.09 sec)

mysql> CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 60;
Query OK, 0 rows affected (0.13 sec)

mysql> START SLAVE;
Query OK, 0 rows affected (0.00 sec)

When the slave connects to the master, it will request that the master inject a heartbeat message into the binlog stream every 60 seconds, but only when there have been no replication events for that amount of time — so it has no impact when there’s a lot of traffic, but when traffic is light, the heartbeat events will be sent, and the connection will stay alive.

Note that CHANGE MASTER TO is typically a disruptive command that can reset your replication configuration. In this case, if MASTER_HEARTBEAT_PERIOD is the only argument provided, the slave configuration does not get reset.

http://dev.mysql.com/doc/refman/5.6/en/change-master-to.html

Also consider setting the global variable slave_net_timeout to a value shorter than the default, but not less than twice the value you use for the master heartbeat period. This will cause the slave to drop and retry the connection to the master if nothing happens on the replication stream within the configured period of time.

Method 2

Edit: After my first assumptions, my next bet was on network problems. Use Michael advice, they have my bet and vote.

Probably your problems is the combination of wsrep_on = OFF + DDL. Strange things may happen in that case with the binlog (a.k.a. they may not be logged correctly). Make sure to enable it even if you only have one node or you isolate it using galera-aware commands.

Make sure you also activate log_slave_updates.

Are master and slave part of the same Galera cluster?

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply