All we need is an easy explanation of the problem, so here it is.
When the user selects a product in my web app, I run a query to search for related alternatives based on product categories linked to the product via a many to many relationship with a category reference table.
With only 6000 products, the alternatives query takes more than 500ms and I expect the product count to x10 over the short term. Is there a better way to construct my query?
The relationship is pretty simple: I have a reference table of categories (e.g. Shorts, pants, shirts, S, M, L, XL, cotton, etc.) and each product will have one or more categories linked to it. Appropriate FK indexes are in place.
My "alternatives" query tries to maximise the number of matching categories – The more matching categories that a product has, the higher its matching score. It also excludes the selected product and conflates product variants which are stored as separate products records ( ROW_NUM() OVER(PARTITION… ). The conflation avoids suggesting very similar products as alternatives (e.g. same product in different color)
To further rank products with the same category match score, a text relevance score is calculated (MATCH AGAINST fulltext) and the results are ordered by category match count and text relevance. The fulltext matching has very little effect on query performance, but the variant conflation takes about 200ms.
At this point, I cannot change the application/data model to normalise product variants.
This is the SQL:
SELECT * FROM ( SELECT P.*, (SUM(CASE WHEN C.category_id = 3 OR C.category_id = 11 OR C.category_id = 18 THEN 1 ELSE 0 END)) AS cat_score, MATCH(P.name, P.description) AGAINST('blue short sleeve shirt' IN NATURAL LANGUAGE MODE) AS rel_score, ROW_NUMBER() OVER(PARTITION BY variant_group ORDER BY price ASC) variantIndex FROM RPDB.Product P JOIN Product_has_Category PC ON PC.product_id = P.product_id JOIN Category C ON C.category_id = PC.category_id WHERE P.product_id <> 123 AND variant_group <> 65 GROUP BY P.product_id ) result WHERE variantIndex = 1 ORDER BY cat_score DESC LIMIT 0, 5
Explain output (note that I simplified some field names in the above SQL):
I don’t care where the performance improvement comes from, so any aspect of the query that could be improved would be very welcome!
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
It is quite common to have the inadequate indexing for many-to-many tables. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table for the two indexes it should have and the indexes it should not have.
I thought through the problem again, starting from the Product_has_Category, to give me the set of relevant products (one or more categories) and joined that to the Products table before conflating variants and ordering for score.
This executes in ~150ms rather than 500ms.
SELECT * FROM ( SELECT *, MATCH(P.name, P.desc_short, P.desc_long) AGAINST('blue short sleeve shirt' IN NATURAL LANGUAGE MODE) AS rel_score, ROW_NUMBER() OVER(PARTITION BY Organization_idOrganization, variant_group ORDER BY price ASC) variantIndex FROM Product P JOIN ( SELECT Product_idProduct, COUNT(Product_idProduct) AS cat_score FROM Product_has_Category PC JOIN Category C ON PC.Category_idCategory = C.idCategory WHERE PC.Category_idCategory IN (3, 11, 18) GROUP BY Product_idProduct ) AS R ON P.idProduct = R.Product_idProduct WHERE idProduct <> 123 AND variant_group <> 65 AND (P.price BETWEEN 0.0 AND 999999.0) ) AS PR WHERE variantIndex = 1 ORDER BY cat_score DESC, rel_score DESC LIMIT 0, 5
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂