Performance problems on historical data – inner join using mysql

All we need is an easy explanation of the problem, so here it is.

I have two tables one containing products(produits) and one containing sales (ventes) which contains millions of records on which I am performing an inner join. This is really slow.

Following is an example query. The WHERE clause can contain more fields describing the products (upc, category, etc.) depending on what the user wants.

use mysql_reception;
SELECT p.upc, p.description, sum(v.quantity)  
FROM produits p inner join ventes v on p.upc = v.upc    
WHERE v.date > '2021-01-01' and date < '2021-04-30' and p.sap = "32015201"
GROUP BY p.upc;

I have an index on the columns

  • sap
  • date and upc
  • date

It takes minutes to get the result.

I was wondering if I should create a third table that contains all the historical data so no inner join is required when I query the data or is this bad? Is there a better solution to increase performance?

EDIT 31-05-2021 :

On database

Column date is on table Ventes (Sales)

Column sap is on table Produits (Products)

Both tables have a upc column for the join.

This is used by humans so I’d like a 10 seconds execution.

Configuration

OS : Windows Server 2012

RAM : 6 Gb

CPU : Intel Xeon E5-2660 @ 2.2GHz

MySQL version 8.0.21

Plan

Performance problems on historical data - inner join using mysql

Performance problems on historical data - inner join using mysql

{
"query_block": {
    "select_id": 1,
    "cost_info": {
        "query_cost": "452260.74"
    },
    "grouping_operation": {
        "using_temporary_table": true,
        "using_filesort": false,
        "nested_loop": [
            {
                "table": {
                    "table_name": "p",
                    "access_type": "ref",
                    "possible_keys": [
                        "IX_upc",
                        "IX_sap"
                    ],
                    "key": "IX_sap",
                    "used_key_parts": [
                        "sap"
                    ],
                    "key_length": "48",
                    "ref": [
                        "const"
                    ],
                    "rows_examined_per_scan": 1,
                    "rows_produced_per_join": 1,
                    "filtered": "100.00",
                    "cost_info": {
                        "read_cost": "1.00",
                        "eval_cost": "0.10",
                        "prefix_cost": "1.10",
                        "data_read_per_join": "736"
                    },
                    "used_columns": [
                        "pharmacieIdBJC",
                        "upc",
                        "sap",
                        "idItem",
                        "description",
                        "coutant"
                    ]
                }
            },
            {
                "table": {
                    "table_name": "v",
                    "access_type": "ALL",
                    "possible_keys": [
                        "IX_DATE_UPC"
                    ],
                    "rows_examined_per_scan": 4207042,
                    "rows_produced_per_join": 1985946,
                    "filtered": "47.21",
                    "using_join_buffer": "hash join",
                    "cost_info": {
                        "read_cost": "253665.04",
                        "eval_cost": "198594.60",
                        "prefix_cost": "452260.74",
                        "data_read_per_join": "378M"
                    },
                    "used_columns": [
                        "id",
                        "Date",
                        "upc",
                        "quantite",
                        "coutMoyen",
                        "montantVente"
                    ],
                    "attached_condition": "((`mysql_reception`.`v`.`Date` > DATE'2021-01-01') and (`mysql_reception`.`v`.`Date` < DATE'2021-04-30') and (`mysql_reception`.`p`.`upc` = convert(`mysql_reception`.`v`.`upc` using utf8)))"
                }
            }
        ]
    }
}
}

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

You are leaving out the 1st and 30th of the month? Or is it a DATETIME, in which case you are leaving out midnight of the 1st and all of the 30th? I suggest

WHERE v.date >= '2021-01-01'
  and v.date  < '2021-01-01' + INTERVAL 3 MONTH

That works "correctly" regardless of the datatype. (And it handles leap year, year wrap, etc.)

Add this flipped index:

INDEX(upc, date)

in case the Optimizer prefers to start with p, then move on to v.

If that does not speed it up enough, please provide SHOW CREATE TABLE.

After that, we can talk about building and maintaining a "summary table". And we can probably get it down to less than 1 second.

Charset/Collation

p.upc = convert(mysql_reception.v.upc using utf8))

Wherever possible, use the same CHARACTER SET and COLLATION for any column used in a JOIN.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply