All we need is an easy explanation of the problem, so here it is.
I need to assign random ID
s to a table. Thus, I create a mapping table as
CREATE TABLE t2
(
ID int(11) unsigned NOT NULL AUTO_INCREMENT,
SourceID int(11) unsigned NOT NULL,
UNIQUE INDEX(SourceID),
PRIMARY KEY(ID)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE utf8_general_ci KEY_BLOCK_SIZE=1
and then added the ID
s from the main table t1
as
INSERT IGNORE INTO t2 (SourceID) SELECT ID FROM t1 ORDER BY RAND()
For example, imagine t1
is the test results of the students, and we do not want to reveal the student ID (t1.ID) to reviewers (for an anonymous review). Then, we show each record with a new ID stored in t2
.
SELECT t2.ID AS NewID, t1.results FROM t1 JOIN t2 ON t1.ID=t2.SourceID
The problem is t1
is tens of millions of rows and RAND()
is very very slow.
I do not need a perfect RAND()
here; just assigning new ID
s (somehow randomly arranged). Can you think of an approach to do query faster?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Method 1
Possible trick.
Create intermediate table. Copy your ID
from t1 into it. Add virtual generated column which calculates some hash from id
value, and index it. Use this table as a source for insertion, add sorting by created index expression, and force it (without index hint it may be ignored due to 100% rows selection.. from the other side, it must be used because it is covering).
An example may be found there.
If you think that BINARY(16)
is too long then you may cut out a part of checksum value and convert it from hexadecimal string to, for example, INT. Of course the index won’t be by fact close-to-unique, but this don’t mention, as I understand.
PS. It will be, of course, time-expensive on the stage of data copying into the temptable, but the insertion itself must be fast. I cannot predict does total process will be more fast – test it.
Method 2
I did a lot of experimentation by checking the performance. It may help others.
The fastest way (by far) is to do the random rearrangement outside the SQL query.
SELECT ID INTO OUTFILE '/tmp/id.csv'
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' FROM t1
then
shuf -o /tmp/id.csv < /tmp/id.csv
and finally the fast INSERT
step
LOAD DATA LOCAL INFILE '/tmp/id.csv' INTO TABLE t2
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (SourceID)
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0