Optimizing nested SQLite query

All we need is an easy explanation of the problem, so here it is.

I have a SQLite table named hashbands that looks like this:

hashband | file_id | window_id
------------------------------
potato   | 0       | 0
potato   | 1       | 0

The table has 100M+ rows (lol). I need to find all hashbands that have two or more distinct file_id values, and then I need to fetch all rows that contain those hashbands and order the results by hashband. Right now I’m using this:

  WITH file_id_counts AS (
    SELECT hashband, COUNT(DISTINCT(file_id)) as count
    FROM hashbands
    GROUP BY hashband
    HAVING COUNT > 1
  ) SELECT hashband, file_id, window_id
    FROM hashbands
    WHERE hashband IN (SELECT hashband from file_id_counts)
    ORDER BY hashband

Does anyone see ways to expedite this query? Any pointers would be helpful!

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Likely the majority of the performance issue you’re seeing is just the normal constraints of the data limitations of SQLite. Your query is of sound mind, and I don’t believe there’s much you can do to optimize it other than re-write the WHERE predicate to something more efficiently relational with an INNER JOIN like this:

WITH file_id_counts AS (
    SELECT hashband, COUNT(DISTINCT(file_id)) as count
    FROM hashbands
    GROUP BY hashband
    HAVING COUNT > 1
  ) SELECT hashband, file_id, window_id
    FROM hashbands
    INNER JOIN file_id_counts
        ON hashbands.hashband = file_id_counts.hashband
    ORDER BY hashband

This is logically equivalent and the reasoning it could be faster is because the IN clause is syntactical sugar for a bunch of OR clauses which can be less efficient than a direct single equality operator like the INNER JOIN is doing above.

Additionally, ensuring your hashbands table has an index on at least (hashband) or possibly on (hashband, file_id) should help (if one doesn’t exist already).

Finally, if possible, removing your ORDER BY clause, and instead doing the sorting in your consuming application would probably help a small amount as well. While this mostly just moves the onus of the sort to a different part of the call stack, generally sorting in the database can add a little extra complexity that sometimes can be resolved and slightly more efficient to do in the consuming application. Plus, sorting is really presentation logic in my opinion (at least when not used for a functional purpose of the query).

Where does this SQLite database live?…a mobile application, or on its own server?

Method 2

Using a window aggregate would presumably be substantially faster than the self-join that you currently have. This is available from version 3.25

Unfortunately, in SQLite, DISTINCT cannot be used in an OVER aggregate, but we can simulate it with MAX of DENSE_RANK (if file_id is nullable then it can be a little more complex)

The calculation is quite simple: DENSE_RANK will give us a ranking, but with tied results, e.g. 1,1,2,3,3,3,4. Then we use MAX to get the highest of that. PARTITION BY works like GROUP BY but for window aggregates, in other words the function is calculated separately over each partition.

WITH file_id_ranked AS (
    SELECT *,
        DENSE_RANK() OVER (PARTITION BY hashband ORDER BY file_id) rnk
    FROM hashbands
),
file_id_counts AS (
    SELECT *,
        MAX(rnk) OVER (PARTITION BY hashband) AS count
    FROM file_id_ranked
)
SELECT hashband, file_id, window_id
    FROM file_id_counts
    WHERE count > 1
    ORDER BY hashband;

I strongly recommend an index on hashband, file_id, window_id in that order.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply