All we need is an easy explanation of the problem, so here it is.
In ETL is it better to drop the index before inserting millions of rows and right after create the index again, or to simply insert into the empty table with the indexe in place.
I know that I can test it and measure it (have not done that yet) but I what to understand the reason, that is what is more expensive: sort and insert into a clustered index or to create an index.
I have kept my index in place and when I insert I see a sort at the end of the execution plan. Also the clustered index insert operator right before the root node is quite expensive (basically all the cost is divided between the sort and the clustered index insert operators).
I use TABLOCK for my insert, recovery model is simple and table is rowstore.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
I would keep the clustered index in-place especially if the data is being inserted in large lumps (rather than lots of individual inserts).
You should drop the non-clustered indexes if rebuilding the data from scratch.
NOTE: As per your mention of truncating, this answer is talking about rebuilding a table from scratch.
Considerations would be different if you were adding millions of rows to a table that already contained billions.
the clustered index insert operator right before the root node is quite expensive
It will be as that is the step that is writing all the data to permanent storage. You’ll get a similarly expensive step with a heap.
(basically all the cost is divided between the sort and the clustered index insert operators).
This is expected. If you turn the index off so that you have a heap, when you readd the clustered index it will have to reread the data, perform the same sort, and rewrite the pages in the new order – so it will be as expensive, probably more so, than the initial insert with the clustered index turned on.
I know that I can test it and measure it
This is a very good point!
Don’t just measure the data insert though: remember that creating the index on an already populated table will be expensive too, so it is "unfair" to compare just the "insert heap" & "insert with CI" not "insert heap + build CI" and "insert with CI". Also, it will need more space in the relevant data file as it will have two copies of the data as the index is being rebuilt (the heap and the newly forming clustered index that will replace it when complete).
Try both in otherwise empty fresh (therefore near zero length data files) test DBs to see the different file growth effects too.
but I what to understand the reason
I suggest trying it, with the index rebuilds too, and look at the work done as displayed in the query plans and the IO statistics.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂