Best way for processing over 150 million rows MySql

All we need is an easy explanation of the problem, so here it is.

I wrote a function on nodejs to SELECT 150 million records, process the data and UPDATE those same 150 million records using

UPDATE 'table' SET value1='somevalue' WHERE idRo2 = 1;

Single Update for each row concatenate send one single query string allowing multipleStatements querys on the db connection.

I’ve being encountering multiple errors as

  • Error: Connection lost: The server closed the connection.
  • JavaScript heap out of memory.
  • RangeError: Invalid string length.
  • Killed process.

I think i might not be using the right technologies or programing technique

*Edit:

The data process i need to do is to take the ‘wholeApiResponse’ column value, parse that string and then insert the parsed values into new added columns (askPrice, lastPrice, lowPrice, highPrice, volumePrice, platformTimestamp), this end up modifying the existing row by adding new values from an existing string.
Best way for processing over 150 million rows MySql

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Big Updates (and Deletes) are problematic.

Plan A: Avoid the big update. If value1 should always be ‘somevalue’ when idRo2 = 1, then don’t store it in the table; store it elsewhere and use a JOIN. Then, instead of checking 150M rows, you are changing exactly 1 row.

Plan B: Do the Update in chunks of 1K rows at a time. This avoids timeouts, and a number of other potential problems. Details: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks . Even 10K can be problematic, hence my recommendation of 1K. Anyway, going beyond 1K is getting into "diminishing returns".

Plan C: Tells us what value1 and idRo2 are really called. That might lead to some specific suggestions.

Minor issues with datatypes. Shrinking the table size will help performance some.

  • INT SIGNED has a limit of 2 billion. 150M is getting kinda close to it. Keep your eye on any AUTO_INCREMENTs.
  • Similarly, INT for idCurrencyPair may be wasting space. (INT takes 4 bytes; there are smaller datatypes)
  • DOUBLE can lead to rounding errors. And it take 8 bytes. Consider DECIMAL(...).
  • platformTimestamp DOUBLE — Huh?
  • I’ll bet that LONGTEXT is rarely used? We should discuss better ways to store it.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply