All we need is an easy explanation of the problem, so here it is.
Since a couple of days my mariadb database randomly becomes unresponsive after random intervals (which have been between 4 and 20 hours).
The database comes back to life when rebooting the ct.
And after some time these queries start to fail with MySqlException (0x80004005): The Command Timeout expired before the operation completed
.
I have also noticed that sometimes restarting the mariadb-service also gets stuck.
Below is all information I have on the issue:
System Info
Ubuntu: Ubuntu 22.04 LTS
Mariadb: 10.6.7
Proxmox: 7.2-3
Fixing attemps
-
Fresh install (vm) importing old data (only copying the database
fivem
) using mysqldump -
mysqlcheck --all-databases --auto-repair
(while mysql is working. When it is in the unresponsive state, there’s no output.
Proxmox
This is a screenshot of all information on the proxmox CT: https://i.imgur.com/nR6n3nt.png
Data and structure
It happens in a database called fivem which has multiple tables. Notably only the bigger ones are stuck (meaning that select * from TABLE_NAME
will not finish for a long time).
Phpmyadmin table (content): https://i.imgur.com/cUdhvs1.png
Phpmyadmin table (size and "is hung"): https://i.imgur.com/C6gxn1N.png
MariaDB Config
I have done these modification to the default config:
skip-name-resolve
max_allowed_packet = 1G
max_connections = 100000
table_cache = 640
thread_handling = pool-of-threads
innodb_buffer_pool_size = 48103633715
INNODB
SHOW ENGINE INNODB STATUS;
was also unresponsive.
Processlist
Shows multiple queries on mostly the states commit
and sending data
(some commits
).
Hastebin-Link to full processlist: https://www.toptal.com/developers/hastebin/sapajapohi.sql
Syslog
Shows a lot of Got an error reading communication packets
and (This connection closed normally without authentication
.
Note that these happen while mysql is operational too.
Some examples: https://www.toptal.com/developers/hastebin/rewozofana.yaml
Mariadb service
Output while starting: https://www.toptal.com/developers/hastebin/uracoqutab.yaml
Restarting the mariadb service also did not work. The last warning it shows are: https://www.toptal.com/developers/hastebin/usewerugut.less
Conclusion
If anyone has ideas of what the issue might be or just ideas on how to further debug this I would really appreciate it.
Edit / Updates
- max_connections has been lowered to 250 but will need to be increased
- innodb_buffer has also been lowered
- It has become apparent that shortly after (or right before) the unresponsivness a lot of Disk IO is observed (20x the usual amount)
- mysql status shows some warnings the quickly increase in size (https://i.imgur.com/TxMxC35.png). I am not sure if these could be a cause and how to fix them for now
- I noticed that restarting the service (mariadb) occasionally got it stuck at
Waiting to flush the buffer pool
with the last messages being
May 19 08:11:06 database systemd[1]: Stopping MariaDB 10.6.7 database server...
May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: FTS optimize thread exiting.
May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: Starting shutdown...
May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
May 19 08:11:06 database mariadbd[659]: 2022-05-19 8:11:06 0 [Note] InnoDB: Buffer pool(s) dump completed at 220519 8:11:06
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Method 1
I have probably found the cause and it’s a weird edge case.
I have the cts (from proxmox) running on lvm instead of lvm-thin. And had a backup job using snapshot for the database every 4 hours. My guess is that since snapshot was unavailable it froze the ct for the backup and somehow mariadb did not like that.
I have now changed that to a backup once a day with the shutdown option.
Method 2
Turns out it is a known issue on proxmox+mariadb
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0