All we need is an easy explanation of the problem, so here it is.
I have been confused whether to create a data lake or a data warehouse and hope some experienced real-world professional can give me some enlightenment.
I will like to store, visualise and perform machine learning with the data that I ingested from multiple sources (IoT devices, APIs etc.). I read that a business will require both data lake and warehouse in the current environment that we are in.
My question is:
- should I create a data lake first, then transform/process these raw data from the lake and ingest it into a data warehouse?
- Or is the data lake a separate data processing pipeline on its own?
- Or is this depends on the use case?
PS: If this is the wrong StackExchange do let me know thanks 🙂
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
There’s a lot of similar and overlapping terms these days (Data Lake, Data Swamp, Data Warehouse, etc) that I wouldn’t get too hung up over, IMO.
Data Lakes are informal places to centralize different sources of data. They can be flexible and don’t necessarily need to adhere to a fixed schema but can follow one.
Data Warehouses are more formally defined and unify those different sources of data into a common structure, such that it’s easy to build consuming applications and reports off of.
So the answer to your question is it just depends on your use cases, how many different types of data and sources you need to consume, and if having a Data Lake as an intermediary step makes it easier to accomplish your use cases before applying the ETL (really the Transform part) processes to that data.
If all of your sources of data already follow a rather common schema, then usually you can just ETL straight into your Data Warehouse and skip the Lake altogether. But sometimes it’s good to use a Data Lake to preserve the original data as it was extracted, in case some level of reconcilation and debugging is needed later on. It adds a layer of what the data looked like before you touched it to transform it into the Warehouse.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂