All we need is an easy explanation of the problem, so here it is.
We want to build a data warehouse to distribute valid information across the organization (reporting) and get some insights into our data (analytics). After some research we found MariaDB using Column Store an attractive option. However, we are all non-professionals in the field of data warehousing. That’s why I am would like to hear some opinions on the applicability and hardware required.
The data from the source systems (Oracle, IBM DB2, MariaDBs) sum up to 1 TB (including PK/FK and indexes) containing 36 months of data history. The data consist of tables with mostly integer keys, a mix of integers/doubles (65%), mostly shorter VARCAHRs (30%) and 5% of GEO-data. The data comes from 15-20 different topics with different weights. Meaning some make 50GB some 5GB in the data warehouse. Not all topics (tables) can and will be joined, because there are simply no common keys. For the reporting case, the different topics will mostly be queried isolated (max. five topics will be joined). To query all time points (months) will be used seldomly. I think it is reasonable to assume up to 17 months of data will be quered for product performance comparison. New data will be written overnight in batch jobs.
The MariaDB database server is planned to run on a 8 phy. core machine with 1,25 TB of SSD storage and 64GB of RAM. In total, we expect about 50 people querying the data, 45 of which use aggregated reporting tables and 5 are heavy analytics users – joining tables, grouping, exporting stuff to other ML tools. Complex string searches are not in our scope for the data warehouse. We do not expect the 45 people to query the data at the same time, but up to 20 concurrent users can be possible. We do not want basic reporting queries to take 10 seconds or longer. More complex queries for the 5 analytical users can take longer.
Some considerations we had and I would be happy to get your opinions on:
- Do commercial vendors offer dramatically better solutions than MariaDB column store, which we should consider for our case? Or is it possible to do it with an open source RDBMS?
- Does it make sense to have one stronger machine, or 2-3 weaker ones balancing the load a little better? Would they then need a full copy of all data? (so 1,25 TB SSDs x 3 Workers?).
- Does it make sense to not have one big SSD, but five (256 GB each) smaller ones to increase IO? (Saving topics of similar tables on the same SSD, because they will be joined more frequently.)
- Does it make sense to have 64GB or RAM, or is more needed? What are the determinants of RAM needed?
As you can see, our lack of experience causes some fundamental questions of the hardware required. Is it even possible with the information provided to draw conclusions on the possible performance? If not, what kind of information is needed?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
This is what I say about MySQL/MariaDB without columnstore:
Not good: "sum up to 1 TB" versus "1,25 TB of SSD storage". As a Rule of Thumb, you should have half your disk free for maintenance and growth. As a minimum, there should be enough room for an extra copy of the largest table — data+indexes. This allows any
ALTER to run without running out of disk space.
Let’s discuss the largest table. Please show us
SHOW CREATE TABLE and the main queries — for inserting, for updating and deleting (if needed), and for selecting. With that we can discuss shrinking the disk footprint, indexing, and partitioning (or not).
Here’s my discussion of DW: http://mysql.rjweb.org/doc.php/datawarehouse See also its link to a Summary table discussion.
Please elaborate on "simply no common keys".
With "overnight batch loading", I highly recommend:
- load into a temp table;
- normalize (tips on that in another link);
- augment the summary table(s) from the temp table;
- copy the data into the main Fact table;
- drop the temp table.
If you are "purging" "old" data, then
PARTITION the table so as to replace the big, slow,
DELETE with a fast
DROP PARTITION. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Your other numeric specs are not scary but could be problematic. Need to get into the actual analytic queries to see.
Note: if your team grows and the system fails to provide adequate concurrent queries, you can build a Replication system, wherein the Primary copies to read-only Replicas. Each Replica can independently handle several big analysis queries.
The amount of RAM will be determined by the table sizes, query complexity, etc. More ram is better — but there is a point of "diminishing returns" that cannot be predicted. Start with your 64GB, try to optimize the "worst queries"; then decide whether to add Replicas.
The layout of the SSDs won’t matter much. (Aside from needing more room.) With multiple small drives, you could use RAID striping and/or parity. I would want a raid controller with battery-backed-write-cache (costly, good, but probably overkill since you write only nightly, and possibly don’t care if it takes 2 hours instead of 1?) That would allow for `RAID-5 deployment of 4x500g or 7x250G drives, Either would give you effectively 1.5TB plus parity.
(Years ago, I did a similar project (0.5TB data; HHD; RAID-5; hourly load; 7 summary tables loaded in about 7 minutes. Primary-Replica was implemented, but was overkill in my case. The Summary tables turned hour-long queries into minute-long queries; YMMV.)
Columnstore shrinks typical data by 10x. (I suspect that a well designed schema on InnoDB can’t be shrunk by that much.) Some of the performance characteristics come from this compression; some from parallel operations.
What does your 1TB come from? Does it mean that you would need only 0.1TB of disk for Columnstore? Or that InnoDB would need 10TB? (My discussions above may need adjusting accordingly.)
Columnstore is excellent at querying on an arbitrary column because every column is indexed. But when you filter on more than one column, it will efficiently filter on one, then (perhaps) have to brute force on the others.
Without more insight into your queries, I can’t judge between
- MySQL/MariaDB with regular indexing
- MySQL/MariaDB with Partitioning / FULLTEXT / Spatial
It is likely that CS will outperform or underperform an equivalent database without CS — depending on the query.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂