Correct composite indexing indexing order

All we need is an easy explanation of the problem, so here it is.

I have MySQL (MariaDB) database. I have a table ‘sensors’ ,which collects data from IoT devices.

Each device may have 4-6 parameters that it records, like temperature, humidity, air quality, etc. Each device sends a measurement once every minute.

There are 10-15 such devices. Each device has its deviceid.

Six columns:

`autoid`(INT,AUTOICREMENT)
`deviceid`(varchar)
`pname`(varchar)         /* name of parameter like temperature,humidty */
`pcode`(INT)             /* code for each parameter like for temperature its 11,humidty its 12 etc */
`datavalue`(double)      /* value of parameter */
`rectime`(INT)           /* UNIX timestamp */

Here is sample of table data:

autoid deviceid pname pcode datavalue rectime
1 sdbjs4b temp 11 30.54 1621702300
2 sdbjs4b hum 12 104 1621702300
3 sdbjs4b gas 13 768 1621702300
4 vsf5bjs temp 11 31.45 1621702300
5 vsf5bjs volt 15 5.10 1621702300

There are almost 4-5 million rows in the sensors table.

My query requirements: I have to get data for some arbitrary time values for each day for each device and parameter.

Here is query that is used:

SELECT * from sensors where deviceid =? AND  pcode = ? AND rectime =?

This is guaranteed to give me only one result. The problem is I need to run this query inside nested loops, worst case 500 times. Why I need looping?. I need to create a report between two dates for each device, parameter, and for a given set of time slots between two dates. I have to loop for values of time slots.

I have a composite index on (deviceid,rectime,pcode).

What is the difference if I change this index to (rectime,deviceid,pcode)?

In general, will column order matter in composite indexing if my query uses all indexed columns in the where clause?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

I think your approach with nested loops is suboptimal.

Why can’t you do something like:

select * from sensors where deviceid =? AND  pcode = ? AND rectime between ? and ?

This would return the whole dataset and you could process it locally.
Selecting 500 or even more rows in one correct select is better then 500 single row selects.

In this case I would change the the clustered index to

(deviceid,pcode,rectime)

you can still keep the autoId primary key, just make it nonclustered.

In addition, the way your select works, it has to do key lookup afterwards to get one single extra row. MySQL AFAIK doesn’t support include on indexes, but even then, it would be a de facto reordered copy of the table so clustered index makes sense.

As for if the order matters…. Well, yes and no. Given your datatypes and size of table, the difference would be minimal.

Method 2

Consider trying to make these changes:

  • One row per minute.
  • The various metrics are in columns, not rows.
  • The Primary key is the time, truncated to the minute. (No AUTO_INCREMENT column is needed.)
  • INSERT .. ON DUPLICATE KEY UPDATE .. is used for each of the 4-6 updates, the first of which will be an insert. (That is, it does not matter which device gets there first.
  • The metrics are DEFAULT NULL.

No looping. I think your desired query can be performed in a single query. Example:

SELECT temp, humidity
    FROM metrics
    WHERE device = ?
      AND rectime >= ?
      AND rectime  < ? + INTERVAL 1 MONTH

will get two readings from one device over a period of a month (assuming the last two "?" are the same).

This is likely to be optimal, both for the updates and for many queries:

PRIMARY KEY(device, rectime)

If you are testing against a date range, rectime must be after the columns being tested with =. This is not about cardinality, it is about equality versus range.

Furthermore, for multiple columns tested with = cardinality does not impact the optimal order of such columns in the index.

I used that way of testing a date/datetime range because BETWEEN includes both end-points. (This is a mistake a lot of people make.)

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply