Manage billions lines of sensors data in mariaDB

All we need is an easy explanation of the problem, so here it is.

I’m the new "IT guy" in a small company that managed ecological footprint reduction in industries, for contracts reasons they need to keep every data changes of each sensors(some changed every seconds some every minutes) they put in their installation for 5 years. They are doing some calculations on these data to prove the ecological footprint reduction each year and to detect some errors/illogical things on the installation.

Currently we have one project ongoing and two other coming in next weeks.

The previous IT guy have set a MariaDB server on a VPS with this structure:

CREATE TABLE `machine` (
  `Name` varchar(100) CHARACTER SET utf8 NOT NULL,
  `Site` int(11) DEFAULT NULL,
  `Emplacement` varchar(200) DEFAULT NULL,
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`Name`),
  UNIQUE KEY `ID` (`ID`),
  KEY `FK_machine_site` (`Site`),
  CONSTRAINT `FK_machine_site` FOREIGN KEY (`Site`) REFERENCES `site` (`ID`)
) ENGINE=InnoDB;
CREATE TABLE `mesure` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Machine` varchar(100) NOT NULL,
  `Date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `Valeur` decimal(18,5) NOT NULL,
  PRIMARY KEY (`ID`),
  UNIQUE KEY `machine_timestamp` (`Machine`,`Date`) USING BTREE,
  KEY `Date` (`Date`),
  CONSTRAINT `FK_valeur_machine` FOREIGN KEY (`machine`) REFERENCES `machine` (`Name`)
) ENGINE=InnoDB;
CREATE TABLE `site` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Nom` varchar(50) NOT NULL,
  `Ville` varchar(100) DEFAULT NULL,
  `Code_Postal` varchar(100) DEFAULT NULL,
  `Rue` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB;

In about 7 months we got about 600 millions rows inside the "mesure",80 rows inside the "machine", just one inside the "site" table and the database is about 40GB
This working quite well for the moment, data can be accessed in a good time for the monthly data extraction (I have done a script that recover and generate an excel per week with all calculations done inside per seconds).

we have decided to set a new server on a new VM per project for the moment.

To summarize the context

  • One database and server per project, for the moment MariaDB
  • About 100 sensors on each project with variable updates, about 600 millions records in 7 months for the current project, 40GB
  • Data cannot be reduced, normalized, and we need to keep these data available for 5 years
  • low interactions with the DB, only writing inside with the application reading sensors data. And extracting one time each month to recover monthly data.
  • Limited budget for IT, cannot affrod big servers. I’m working on 4 cores, 8GB RAM, SSD VPS, only used for the data recovery app(low ressources usage) and the DB

My questions

  • It seems that we’re going to reach max int value for the mesure ID PK fast, do I need the set it to bigint ? was thinking about change the PK to (Machine,Date) couple is this a good idea or not ?
  • Do we will face any limitations in the future with this way of doing ?
  • Is it a good idea to stay on MariaDB or do I need to look to other DB ? was looking for TimeScaleDB, any comments, positive/negative reviews on this ?
  • Any optimisations that I can do to reduce size of the DB for example ? was thinking about yearly "archive" compressed DB dumps for each project inside a unique less costly server with big HDD(these dumps will be backup locally and on a cloud drive), to reduce size of backups we’re doing on mariaDB servers and only restore dump on local DB if data need to be retrieve in very rare cases. Any comments on this ?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Since you are likely to have a billion rows in this one table, I will focus on it:

CREATE TABLE `mesure` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Machine` varchar(100) NOT NULL,
  `Date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `Valeur` decimal(18,5) NOT NULL,
  PRIMARY KEY (`ID`),
  UNIQUE KEY `machine_timestamp` (`Machine`,`Date`) USING BTREE,
  KEY `Date` (`Date`),
  CONSTRAINT `FK_valeur_machine` FOREIGN KEY (`machine`) REFERENCES `machine` (`Name`)
) ENGINE=InnoDB;

Each row is taking about 70 bytes (including overhead); multiply that by billions of rows over 5 years, and… Get the picture? We need to shrink this table.

decimal(18,5) — That seems much too big. It takes 9 bytes. Consider using FLOAT (no, not FLOAT(18,5)). That takes 4 bytes and has about 7 significant digits. That is more than adequate for any sensor I know of.

Machine varchar(100) — Shrink that down to an "id" that joins to the machine table. How many machines might you eventually have? Probably between 256 and 64K? Use a 2-byte SMALLINT UNSIGNED both here and in the machine table.

To allow for the unlikely duplicate, do INSERT IGNORE instead of simply INSERT.

Can you get 2 readings for the same machine in the same second? Probably not. Or even if you did, you can (should?) throw away the redundant reading? So, I recommend PRIMARY KEY(machine_id, date) (and eliminate the existing UNIQUE with the same columns).

The change to the PK eliminates the 4 byte ID that you currently have. INT is a disaster that will occur in less than a year — when it hits the limit of about 2 billion.

Revised schema

CREATE TABLE `measure` (
  `MachineId` SMALLINT UNSIGNED NOT NULL,
  `Date` datetime NOT NULL,
  `Valeur` FLOAT NOT NULL,
  PRIMARY KEY(`Machine`,`Date`),
  KEY `Date` (`Date`),
  CONSTRAINT `FK_valeur_machine`
        FOREIGN KEY (`machineId`) REFERENCES `machine` (`ID`)
) ENGINE=InnoDB;

Those changes might cut the table’s disk footprint in half. You say the queries are fast enough now, but they will get slower over time. This schema will help keep them "fast enough".

Summary table(s)

You will probably want "reports" or graphs or other queries that do a big SELECT ... WHERE date between... GROUP BY machineId. And you will find that this is slower than you like. See http://mysql.rjweb.org/doc.php/summarytables ; we can discuss that in another Question.

Done properly, you can consider not saving the raw data for 5 years, while keeping the summarized data "forever". The summary should be one-tenth the size and 10 times as fast to query. (YMMV – Your Mileage May Vary.)

Deleting old data

If you keep the data for only 5 years, how will you delete the "old" data? You will find that a big DELETE is grossly inefficient. Plan ahead by PARTITION BY RANGE(TO_DAYS(date)) now. Have monthly partitions. More: http://mysql.rjweb.org/doc.php/partitionmaint . DROP PARTITION is much faster.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply