All we need is an easy explanation of the problem, so here it is.
Since a couple of days my mariadb database randomly becomes unresponsive after random intervals (which have been between 4 and 20 hours).
The database comes back to life when rebooting the ct.
And after some time these queries start to fail with
MySqlException (0x80004005): The Command Timeout expired before the operation completed.
I have also noticed that sometimes restarting the mariadb-service also gets stuck.
Below is all information I have on the issue:
Ubuntu: Ubuntu 22.04 LTS
Fresh install (vm) importing old data (only copying the database
fivem) using mysqldump
mysqlcheck --all-databases --auto-repair(while mysql is working. When it is in the unresponsive state, there’s no output.
This is a screenshot of all information on the proxmox CT: https://i.imgur.com/nR6n3nt.png
Data and structure
It happens in a database called fivem which has multiple tables. Notably only the bigger ones are stuck (meaning that
select * from TABLE_NAME will not finish for a long time).
I have done these modification to the default config:
skip-name-resolve max_allowed_packet = 1G max_connections = 100000 table_cache = 640 thread_handling = pool-of-threads innodb_buffer_pool_size = 48103633715
SHOW ENGINE INNODB STATUS; was also unresponsive.
Shows multiple queries on mostly the states
sending data (some
Hastebin-Link to full processlist: https://www.toptal.com/developers/hastebin/sapajapohi.sql
Shows a lot of
Got an error reading communication packets and
(This connection closed normally without authentication.
Note that these happen while mysql is operational too.
Some examples: https://www.toptal.com/developers/hastebin/rewozofana.yaml
Output while starting: https://www.toptal.com/developers/hastebin/uracoqutab.yaml
Restarting the mariadb service also did not work. The last warning it shows are: https://www.toptal.com/developers/hastebin/usewerugut.less
If anyone has ideas of what the issue might be or just ideas on how to further debug this I would really appreciate it.
Edit / Updates
- max_connections has been lowered to 250 but will need to be increased
- innodb_buffer has also been lowered
- It has become apparent that shortly after (or right before) the unresponsivness a lot of Disk IO is observed (20x the usual amount)
- mysql status shows some warnings the quickly increase in size (https://i.imgur.com/TxMxC35.png). I am not sure if these could be a cause and how to fix them for now
- I noticed that restarting the service (mariadb) occasionally got it stuck at
Waiting to flush the buffer poolwith the last messages being
May 19 08:11:06 database systemd: Stopping MariaDB 10.6.7 database server... May 19 08:11:06 database mariadbd: 2022-05-19 8:11:06 0 [Note] InnoDB: FTS optimize thread exiting. May 19 08:11:06 database mariadbd: 2022-05-19 8:11:06 0 [Note] InnoDB: Starting shutdown... May 19 08:11:06 database mariadbd: 2022-05-19 8:11:06 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool May 19 08:11:06 database mariadbd: 2022-05-19 8:11:06 0 [Note] InnoDB: Buffer pool(s) dump completed at 220519 8:11:06
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
I have probably found the cause and it’s a weird edge case.
I have the cts (from proxmox) running on lvm instead of lvm-thin. And had a backup job using snapshot for the database every 4 hours. My guess is that since snapshot was unavailable it froze the ct for the backup and somehow mariadb did not like that.
I have now changed that to a backup once a day with the shutdown option.
Turns out it is a known issue on proxmox+mariadb
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂