All we need is an easy explanation of the problem, so here it is.
I have a SQLite table named
hashbands that looks like this:
hashband | file_id | window_id ------------------------------ potato | 0 | 0 potato | 1 | 0
The table has 100M+ rows (lol). I need to find all hashbands that have two or more distinct file_id values, and then I need to fetch all rows that contain those hashbands and order the results by hashband. Right now I’m using this:
WITH file_id_counts AS ( SELECT hashband, COUNT(DISTINCT(file_id)) as count FROM hashbands GROUP BY hashband HAVING COUNT > 1 ) SELECT hashband, file_id, window_id FROM hashbands WHERE hashband IN (SELECT hashband from file_id_counts) ORDER BY hashband
Does anyone see ways to expedite this query? Any pointers would be helpful!
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Likely the majority of the performance issue you’re seeing is just the normal constraints of the data limitations of SQLite. Your query is of sound mind, and I don’t believe there’s much you can do to optimize it other than re-write the
WHERE predicate to something more efficiently relational with an
INNER JOIN like this:
WITH file_id_counts AS ( SELECT hashband, COUNT(DISTINCT(file_id)) as count FROM hashbands GROUP BY hashband HAVING COUNT > 1 ) SELECT hashband, file_id, window_id FROM hashbands INNER JOIN file_id_counts ON hashbands.hashband = file_id_counts.hashband ORDER BY hashband
This is logically equivalent and the reasoning it could be faster is because the
IN clause is syntactical sugar for a bunch of
OR clauses which can be less efficient than a direct single equality operator like the
INNER JOIN is doing above.
Additionally, ensuring your
hashbands table has an index on at least
(hashband) or possibly on
(hashband, file_id) should help (if one doesn’t exist already).
Finally, if possible, removing your
ORDER BY clause, and instead doing the sorting in your consuming application would probably help a small amount as well. While this mostly just moves the onus of the sort to a different part of the call stack, generally sorting in the database can add a little extra complexity that sometimes can be resolved and slightly more efficient to do in the consuming application. Plus, sorting is really presentation logic in my opinion (at least when not used for a functional purpose of the query).
Where does this SQLite database live?…a mobile application, or on its own server?
Using a window aggregate would presumably be substantially faster than the self-join that you currently have. This is available from version 3.25
Unfortunately, in SQLite,
DISTINCT cannot be used in an
OVER aggregate, but we can simulate it with
file_id is nullable then it can be a little more complex)
The calculation is quite simple:
DENSE_RANK will give us a ranking, but with tied results, e.g. 1,1,2,3,3,3,4. Then we use
MAX to get the highest of that.
PARTITION BY works like
GROUP BY but for window aggregates, in other words the function is calculated separately over each partition.
WITH file_id_ranked AS ( SELECT *, DENSE_RANK() OVER (PARTITION BY hashband ORDER BY file_id) rnk FROM hashbands ), file_id_counts AS ( SELECT *, MAX(rnk) OVER (PARTITION BY hashband) AS count FROM file_id_ranked ) SELECT hashband, file_id, window_id FROM file_id_counts WHERE count > 1 ORDER BY hashband;
I strongly recommend an index on
hashband, file_id, window_id in that order.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂