All we need is an easy explanation of the problem, so here it is.
I’m currently working on understanding the performance of snapshot restores in my company’s test suite. I’ve read a lot of material online about how the time it takes to perform a restore is directly proportional to the size of the snapshot – that is, the number of pages that have been copied to the snapshot sparse file. I’ve mostly found this to be true, and it’s an intuitive conclusion. However, there does appear to be a "floor" performance that you hit when the snapshot is sufficiently small. I haven’t seen this discussed anywhere, and I don’t quite understand it.
Essentially, what I’ve found is that the size of the snapshot does NOT have a linear relationship to the time taken to restore the snapshot. Instead, as snapshot size approaches zero, restore time approaches about 3.5 seconds. I’m sure this floor is a bit different depending on your setup, but both on my local dev machine and on the dozens of build servers we’ve studied, we have not been able to reduce the restore time of a snapshot below around three and a half seconds.
I’ve spent the last day running a local test to help illustrate this phenomenon. I performed the test by creating a fresh database containing a single table with a single integer valued column. I created a snapshot of the database when it was empty, then filled the table and restored the snapshot, using SSMS client statistics to measure how long the restore operation took. I graphed the restore time in milliseconds against the number of rows in the table when restore was called:
Notably, when there is nothing to restore, the restore time is a relatively constant .15-.3s. That’s the 0 on the X axis here. But as soon as we have even just one row to restore, the time goes up to a little over 3 seconds. I have other data that shows this phenomenon on our build servers – its not local to my machine, so it must have something to do with the implementation of snapshot restore. But I can’t find anything explaining this phenomenon online, anywhere. Can anyone help me understand it? Is this something that only affects me? Is it possible to increase the performance of snapshot restores past this limit? Any help is appreciated – thanks!
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
I’m attempting to repro your scenario, and am seeing the snapshot restore time between 600 to 800 milliseconds.
This is the repro:
USE master; IF DB_ID('test_snapshot') IS NOT NULL BEGIN ALTER DATABASE test_snapshot SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE test_snapshot; END GO CREATE DATABASE test_snapshot ON (NAME = 'test_snapshot_data', FILENAME = '/data/mssql/data/test_snapshot_data.mdf') LOG ON (NAME = 'test_snapshot_log' , FILENAME = '/data/mssql/logs/test_snapshot_log.ldf'); GO USE test_snapshot; GO CREATE TABLE dbo.TestTable (id int NOT NULL); GO USE master; GO SET STATISTICS TIME OFF; DECLARE @msg nvarchar(1000); SET @msg = 'Create Snapshot'; RAISERROR (@msg, 0, 0) WITH NOWAIT; SET STATISTICS TIME ON; CREATE DATABASE test_snapshot_snap ON (NAME = test_snapshot_data , FILENAME = '/data/mssql/data/test_snapshot_data.ss') AS SNAPSHOT OF test_snapshot; SET STATISTICS TIME OFF; GO USE test_snapshot; GO INSERT INTO dbo.TestTable (id) VALUES (0); GO USE master; GO DECLARE @msg nvarchar(1000); SET @msg = 'Restore Snapshot'; RAISERROR (@msg, 0, 0) WITH NOWAIT; SET STATISTICS TIME ON; RESTORE DATABASE [test_snapshot] FROM DATABASE_SNAPSHOT = 'test_snapshot_snap'; SET STATISTICS TIME OFF; SET @msg = 'Drop Snapshot'; RAISERROR (@msg, 0, 0) WITH NOWAIT; SET STATISTICS TIME ON; DROP DATABASE test_snapshot_snap; SET STATISTICS TIME OFF;
A couple of things to note, this is running on SQL Server 15.0.4102.2 (SQL Server 2019 on RedHat Enterprise Linux), on my laptop in a HyperV VM. The VM is hosted on a very fast NVMe SSD.
The statistics time results are consistently similar to:
Create Snapshot SQL Server Execution Times: CPU time = 37 ms, elapsed time = 118 ms. (1 row affected) Restore Snapshot SQL Server Execution Times: CPU time = 193 ms, elapsed time = 605 ms. Drop Snapshot SQL Server Execution Times: CPU time = 12 ms, elapsed time = 25 ms. Completion time: 2021-07-06T12:44:46.0651782-05:00
If I comment-out the
CREATE TABLE and
INSERT INTO statements, I see no statistically significant difference in run time.
After experimenting with the example given by Hannah Vernon, I believe I’ve found the source of my issues with snapshot restore.
In my chase to repro this issue on my box, and in our test suite, we set the restore target database to single user prior to snapshot restore with a statement along the lines of
alter database x set single_user with rollback immediate. We set it back to multi user afterwards. This is where the floor time is coming from as far as I can tell. Running those two alter database statements on their own takes around 3s on my machine – which explains the ~3.5s floor for restore operations. ~3s to rollback open connections, ~0.5s to do the restore. That tracks with our "empty" restore time as well.
It’s still a bit of a mystery to me that an "empty" restore took so little time, even when bookended by the set single/multi user statements the same way our other restores are. My guess is that in my test and in our test suite, that can occur under special conditions, maybe when no other users are or were connected to the DB. I can’t seem to repro that .15-3s time on my machine today to confirm those conditions, but I am still seeing it on our build machines. In any case, it’s not a particularly relevant to 99% of snapshot restore cases, at least not mine, so I’m dropping the investigation here. This was the major red herring on my approach.
Major thanks to Hannah for attempting to reproduce the issue and going back and forth with me on this! I’m going to mark my post as the answer for this question since it’s a more complete of an answer to my question should anyone else encounter a similar issue, but all credit goes to Hannah.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂