A quick way to ORDER all IDs randomly in MySQL

All we need is an easy explanation of the problem, so here it is.

I need to assign random IDs to a table. Thus, I create a mapping table as

CREATE TABLE t2
(
ID int(11) unsigned NOT NULL AUTO_INCREMENT,
SourceID int(11) unsigned NOT NULL,
UNIQUE INDEX(SourceID),
PRIMARY KEY(ID)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_general_ci KEY_BLOCK_SIZE=1

and then added the IDs from the main table t1 as

INSERT IGNORE INTO t2 (SourceID) SELECT ID FROM t1 ORDER BY RAND()

For example, imagine t1 is the test results of the students, and we do not want to reveal the student ID (t1.ID) to reviewers (for an anonymous review). Then, we show each record with a new ID stored in t2.

SELECT t2.ID AS NewID, t1.results FROM t1 JOIN t2 ON t1.ID=t2.SourceID

The problem is t1 is tens of millions of rows and RAND() is very very slow.

I do not need a perfect RAND() here; just assigning new IDs (somehow randomly arranged). Can you think of an approach to do query faster?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Possible trick.

Create intermediate table. Copy your ID from t1 into it. Add virtual generated column which calculates some hash from id value, and index it. Use this table as a source for insertion, add sorting by created index expression, and force it (without index hint it may be ignored due to 100% rows selection.. from the other side, it must be used because it is covering).

An example may be found there.

If you think that BINARY(16) is too long then you may cut out a part of checksum value and convert it from hexadecimal string to, for example, INT. Of course the index won’t be by fact close-to-unique, but this don’t mention, as I understand.

PS. It will be, of course, time-expensive on the stage of data copying into the temptable, but the insertion itself must be fast. I cannot predict does total process will be more fast – test it.

Method 2

I did a lot of experimentation by checking the performance. It may help others.

The fastest way (by far) is to do the random rearrangement outside the SQL query.

SELECT ID INTO OUTFILE '/tmp/id.csv' 
    FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' FROM t1

then

shuf -o /tmp/id.csv < /tmp/id.csv

and finally the fast INSERT step

LOAD DATA LOCAL INFILE '/tmp/id.csv' INTO TABLE t2 
    FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (SourceID)

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply