All we need is an easy explanation of the problem, so here it is.
I need to assign random
IDs to a table. Thus, I create a mapping table as
CREATE TABLE t2 ( ID int(11) unsigned NOT NULL AUTO_INCREMENT, SourceID int(11) unsigned NOT NULL, UNIQUE INDEX(SourceID), PRIMARY KEY(ID) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_general_ci KEY_BLOCK_SIZE=1
and then added the
IDs from the main table
INSERT IGNORE INTO t2 (SourceID) SELECT ID FROM t1 ORDER BY RAND()
For example, imagine
t1 is the test results of the students, and we do not want to reveal the student ID (t1.ID) to reviewers (for an anonymous review). Then, we show each record with a new ID stored in
SELECT t2.ID AS NewID, t1.results FROM t1 JOIN t2 ON t1.ID=t2.SourceID
The problem is
t1 is tens of millions of rows and
RAND() is very very slow.
I do not need a perfect
RAND() here; just assigning new
IDs (somehow randomly arranged). Can you think of an approach to do query faster?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Create intermediate table. Copy your
ID from t1 into it. Add virtual generated column which calculates some hash from
id value, and index it. Use this table as a source for insertion, add sorting by created index expression, and force it (without index hint it may be ignored due to 100% rows selection.. from the other side, it must be used because it is covering).
An example may be found there.
If you think that
BINARY(16) is too long then you may cut out a part of checksum value and convert it from hexadecimal string to, for example, INT. Of course the index won’t be by fact close-to-unique, but this don’t mention, as I understand.
PS. It will be, of course, time-expensive on the stage of data copying into the temptable, but the insertion itself must be fast. I cannot predict does total process will be more fast – test it.
I did a lot of experimentation by checking the performance. It may help others.
The fastest way (by far) is to do the random rearrangement outside the SQL query.
SELECT ID INTO OUTFILE '/tmp/id.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' FROM t1
shuf -o /tmp/id.csv < /tmp/id.csv
and finally the fast
LOAD DATA LOCAL INFILE '/tmp/id.csv' INTO TABLE t2 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (SourceID)
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂