Should I store data pre-ordered rather than ordering on the fly?

All we need is an easy explanation of the problem, so here it is.

I’m using MySQL and I’m wondering if it’s a good strategy to presort my data so when a user accessed the information, it’s not having to sort it on the fly?

Basically, I have an HTML table with is being populated with paginated data from the database, this is ordered by a particular column and can sometimes be a little sluggish – I was thinking about reordering the table on a nightly basis so the order by can be removed from the query.

Is this general practice or should I avoid this?

Update

My query is as follows:

'select keyword, position, impressions, clicks, ctr 
 from keywords where profile_id=%s
 order by impressions desc limit %s, %s', (profile_id, start, end))

My table looks like this:

+---------------------+---------------+------+-----+---------+----------------+
| Field               | Type          | Null | Key | Default | Extra          |
+---------------------+---------------+------+-----+---------+----------------+
| id                  | bigint(20)    | NO   | PRI | NULL    | auto_increment |
| profile_id          | int(11)       | YES  | MUL | NULL    |                |
| landing_page_id     | int(11)       | YES  | MUL | NULL    |                |
| keyword             | varchar(2083) | YES  |     | NULL    |                |
| position            | int(11)       | YES  | MUL | NULL    |                |
| impressions         | int(11)       | YES  | MUL | NULL    |                |
| ctr                 | float         | YES  | MUL | NULL    |                |
| clicks              | int(11)       | YES  | MUL | NULL    |                |
| unique_key          | varchar(200)  | YES  | UNI | NULL    |                |
| position_30_days    | int(11)       | YES  |     | NULL    |                |
| impressions_30_days | int(11)       | YES  |     | NULL    |                |
| clicks_30_days      | int(11)       | YES  |     | NULL    |                |
| ctr_30_days         | float         | YES  |     | NULL    |                |
| position_60_days    | int(11)       | YES  |     | NULL    |                |
| impressions_60_days | int(11)       | YES  |     | NULL    |                |
| clicks_60_days      | int(11)       | YES  |     | NULL    |                |
| ctr_60_days         | float         | YES  |     | NULL    |                |
| position_90_days    | int(11)       | YES  |     | NULL    |                |
| impressions_90_days | int(11)       | YES  |     | NULL    |                |
| clicks_90_days      | int(11)       | YES  |     | NULL    |                |
| ctr_90_days         | float         | YES  |     | NULL    |                |
+---------------------+---------------+------+-----+---------+----------------+

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Storing the data in an ordered way maybe useful in some rare cases, but it doesn’t guarantee that the selected rows will be ordered. You will have to use order by to guarantee the order of the returned rows.

Is it general practice? I don’t think so.

Should I avoid this? Yes, at least for this specific case

Alternative solution:

To let this query run faster, and reduce the sorting process, add a composite index on profileid ASC and on impressions DESC:

ALTER TABLE keywords ADD INDEX (profile_id ASC, impressions DESC).

IMPORTANT: drop the other index on profile_id (The name of the index you should drop will be displayed if you run “SHOW CREATE TABLE keywords“)

Other factors that could affect the performance:

  • The cardinality, or data distribution. For example, some profiles may have much more entries than others. A useful way to check that is:

    `SELECT profile_id, count(*) cc FROM keywords GROUP BY profile_id ORDER BY cc ASC limit 10;`
    `SELECT profile_id, count(*) cc FROM keywords GROUP BY profile_id ORDER BY cc DESC limit 10;`
    

    If the numbers are hugely different, the same query may vary in performance based on the number of rows a profile has.

  • If a profile has huge number of rows, using limit x, y will gradually worsen when x (the offset) get’s higher.

Method 2

(DESCRIBE is not as descriptive as SHOW CREATE TABLE; we can’t see what indexes you have.)

That one SELECT would benefit from this ‘composite’ index:

INDEX(profile_id, impressions) -- in that order.

Do you have a keyword that is 2083 characters long? If not, why have such a big VARCHAR?

Why have both unique_key and id? Is unique_key some form of UUID? They are notoriously inefficient when the table gets huge.

LIMIT ?, ? ... ($start, $end) — The two numbers in LIMIT are start and count, not end.

By using the index, above, and changing to “remember where you left off”, you can make the ORDER BY...LIMIT work a lot faster. Details . This suggestion, if practical for your application, will be faster (at least after the first ‘page’) than your original question about ordering the data could ever be! Why? Because OFFSET (the first number in LIMIT) requires work. My blog show how to get rid of that work.

More

When you could have multiple rows with the same value, and you need to be deterministic in ordering:

ORDER BY profile_id DESC, impression DESC, id DESC)
INDEX   (profile_id,      impression,      id)

Notes:

  • In the ORDER BY, all the items are in the same direction (DESC is usually what is wanted).
  • Mixing ASC and DESC prevents use of the index (until 8.0).
  • Since you are looking for a single profile_id, ASC and DESC on it have identical effect.

To deal with a ‘compound’ $leftoff, let’s look at the above example. After assuming that profile_id is constant, we want to somehow remember where you left off as a pair of $impression, $id, then do

WHERE   impression <= $impression
  AND ( impression <  $impression OR id < $id )

alternatively (and it is unclear whether these optimize differently in different versions of mysql):

WHERE ( impression = $impression AND id < $id
     OR impression < $impression )

Method 3

You can’t remove the ORDER BY from the query. SQL is set-based and as such is unordered. What you could do, however, if you can’t optimise your query, is using MySQL’s “poor man’s materialized view”: a secondary table that is regularly updated from your primary table.

How often your secondary table needs to be updated, depends on how often the data in your primary table changes and how quickly you need that change to be reflected on your web page.

DROP TABLE IF EXISTS my_secondary_table;

CREATE TABLE my_secondary_table
SELECT *
FROM my_primary_table
ORDER BY my_ordering;

First you drop the secondary table if needed, then you recreate it by copying your primary table.

Now you need to schedule this to happen once a night or once an hour, depending on your needs.
Note that the secondary table is inaccessible while it is dropped and created again.

You can specify your columns and create a primary key for your secondary table that aligns with your ordering. Since MySQL stores the data of a table “behind” its primary key, retrieval in that order should be quite fast.
If you cannot use any (combination) of your columns as the primary key, you can emulate a row numbering in your select and use that as the primary key, as follows.

CREATE TABLE my_secondary_table (pk INT, PRIMARY KEY(pk))
SELECT
   @rownum := @rownum + 1 AS pk,
   mpt.*
FROM (
   SELECT *
   FROM my_primary_table
   ORDER BY my_ordering
) mpt
CROSS JOIN (SELECT @rownum := 0) rn;

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply