All we need is an easy explanation of the problem, so here it is.
The following statement fails with Time-out occurred while waiting for buffer latch type 2 for page (1:4365112), database ID 12.:
alter table [T] add Id bigint identity(1, 1)
The table T in question has 5 million rows, each of which has some large text columns, for a total of 60GB. It is poorly designed so we’re trying to replace its current identity column, a GUID string column set up to be a clustered index, with a simple integer autoincrement ID and nonclustered index. The table is only ever appended to – it’s an audit log. Its data is rarely viewed, but nevertheless uses up a ton of RAM due to the clustered index and the table being so huge.
The ALTER statement works on small copies but fails with the given error on a larger table, and we might need it to work on tables with 10x as many rows. The above message appears after about 1 hour.
If we put the database in single user mode, we get a similar error but slightly faster, in about 45 minutes: Time-out occurred while waiting for buffer latch type 4 for page (1:4495724), database ID 12.
We can achieve the same task differently, e.g., define a new table and then write some T-SQL to insert rows gradually and build up the new index, then drop the original table. But if there’s a way to make the ALTER work, that would be simpler, so any suggestions would be appreciated.
SQL 2019, on a test server with almost no load. Physical server, plenty of disk and RAM, SSD drive.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
The issue appears to be because of slow disk access and other processes.
We looked in SQL error logs, Windows Event Viewer and SQL Activity Monitor. A hardware I/O error was posted in Event Viewer, and we found Acronis backups (which uses VSS) running continuously. We further found out that this particular database was installed, due to its size, on a separate SATA drive and not on the main SSD drive where the other databases on this server were installed. (Therefore my original post was mistaken regarding the SSD).
So in summary, we found that the database was on a slower drive, and it was being exercised continously by the backup subsystem. This was straining the SATA disk to the point where it could not complete the operation and threw errors in Event Viewer.
We moved the database to the SSD/RAID drive set and reduced the backup frequency, and then this operation completed in 30 minutes. I did not try it with backups off and on the SATA drive, but I imagine that would be way slower.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂