All we need is an easy explanation of the problem, so here it is.
For an ETL reporting system, is it normal for the total execution time for a pull of 15 minutes with no data to be similar to a 24 hour pull with data?
I had expected that the total time for ETL when there is NO data would be shorter but that is not the situation between a 15 minute and 24 hour pull. But I must confess that I know nothing about the internals of the T and L phases within a report server.
Can someone shed light on whether the T and L phases are typically fixed in duration (up to a point)?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
There’s nothing specifically quantifiable about the abstract concepts of Transform and Load, only their concrete implementations are measurable. To be able to comment on your case, we’d need to know specifically what your Transformation and Loading processes are actually doing. Obviously some Transformations can take multitudes longer than others.
But generally speaking, the amount of data being processed definitely should affect the overall runtime of an ETL process. If there’s a significant difference in the amount of data between a 24 hour timeframe vs a 15 minute timeframe, yet your ETL process is executing at roughly the same average runtime for both cases, then something is definitely suspect, and that’s not normal.
Even if an index scan is occuring in both cases, if there’s a significant difference in the amount of data, the total runtime should certainly reflect it. The runtime of an index scan is linear (generally speaking) based on the number of rows in the index.
I would also add few things about Power BI.
Often in a Power BI models you will use "import data" mode. In this case when you update even few rows in your source data Power BI storage engine will create a brand new full copy of all source data (or if the process is optimized just specific partitions). Power BI uses the same engine like non-updatable columns store indexes, so the whole index partition needs to be rebuild after every change.
You can read more about power BI refresh here: https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-data.
Usually you will not use partitions smaller than one day, so it is expected that refreshing 15 minutes or 24h will cause rebuild of the same partitions and this phase of ETL will take similar amount of time.
Of course it is just one of the phases of the ETL process, but often the longest one.
The answers from J.D. and Piotr are useful and contain worthwhile data but unfortunately not the real answer for this problem. That is not their fault.
As I started digging into this I found that this is simply an SSIS package extraction issue. At the time I did not have enough knowledge to dig into the SSIS integration report dashboard to understand what I was seeing.
The final step I needed to take was opening the project within visual studio and seeing the SSIS toolbox designer showing the multitude of steps. Learning how the process works. Very interesting and powerful!
I finally arrived at a table that was being pulled (extracted) in its entirety because it lacks a timestamp column. This table contains 4 million rows of 3 small columns and the SSIS logic uses a ‘lookup’ action to decide whether the report database needs to be updated or inserted.
This SSIS toolbox lookup action had the in memory cache option disabled ! Yikes!
It was taking 40 minutes to process this table every single time regardless of how many minutes the original extraction was set for.
Power BI has nothing to do with this. My apologies for the confusion.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂