Percentage based scan on a timeseries data

All we need is an easy explanation of the problem, so here it is.

I have 20 GB csv file with 650 million rows. The data looks like this:

trade_time,price
2020-01-01 00:00:01.481,7189.42
2020-01-01 00:00:01.708,7189.42
2020-01-01 00:00:06.290,7189.5
2020-01-01 00:00:06.291,7190.52
2020-01-01 00:00:07.161,7188.97
2020-01-01 00:00:08.274,7189.93
2020-01-01 00:00:09.277,7190.47
2020-01-01 00:00:09.384,7190.47
2020-01-01 00:00:09.630,7190.11
2020-01-01 00:00:09.848,7189.74
2020-01-01 00:00:10.098,7189.46
2020-01-01 00:00:10.197,7189.16
2020-01-01 00:00:10.351,7189.1

I would like to check whether price is up or down by 0.5%. If the price hits +0.5% first, the result is 1. If the price hits -0.5% first, then the result is 0.

At the moment, I’m using this python solution. If database can perform better for my use case, then I would like to move to the database solution. I don’t have a database yet. I have only the csv file.

I’m worried a disk-based solution will be slow. I’m not seeking persistence here. I’m only looking for a way to finish my task faster. After finishing the task, I don’t need the database. So even an in-memory solution is okay for use case.

  1. Since my task involves timeseries data, which database is better for my use case? SQL or NoSQL?
  2. Is it really possible to do percentage based comparison in databases? e.g. +0.5%
  3. How long the indexing process usually take for 1 billion rows?

db<>fiddle can be found here.

CREATE TABLE Trades (
    trade_time datetime(3)  NOT NULL PRIMARY KEY,
    price      NUMERIC(7,2) NOT NULL
);

INSERT INTO Trades(trade_time,price) VALUES
 ('2020-01-01 00:00:01.481',7189.42)
,('2020-01-01 00:00:01.708',7189.42)
,('2020-01-01 00:00:06.290',7189.5)
,('2020-01-01 00:00:06.291',7190.52)
,('2020-01-01 00:00:07.161',7188.97)
,('2020-01-01 00:00:08.274',7189.93)
,('2020-01-01 00:00:09.277',7190.47)
,('2020-01-01 00:00:09.384',7190.47)
,('2020-01-01 00:00:09.630',7190.11)
,('2020-01-01 00:00:09.848',7189.74)

CSV Import to MySQL

LOAD DATA INFILE 'trades.csv' INTO TABLE Trades
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(trade_time, price);

To clarify, I have 5 minute candle data. Each month has 8640 entries. I’m checking for 2 years. So there will be 207,360 entries. I need to take the timetstamp from these 207,360 entries and then perform the check in Trades table to check whether the price goes up or down from that point.

I’m not checking all 650 million rows. I’m checking up/down only for the 5 minute candlestick. Those 650 M rows are tick data. One 5 minute candlestick can have 10k trade data, or even 100k trade data during that period if there is momentum. I actually have two tables. 1) 5 minute candles 2) Trades. I have only 200K records in 5 minute candles.

Here are the steps:

  1. Get a record from 5 minute candle table.
  2. Find the nearest timestamp in 650M table.
  3. Loop until the prices hits +0.5% or -0.5%. Whatever hits first, record the result as either 0 or 1.

Up/down by 0.5% is relative to the starting value, but only future values. Here starting value is "nearest timestamp" from the 5 minute record timestamp. For more details, check the code found in my SO question as linked before.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

(Discussing a implementation using MySQL)

CREATE TABLE Trades (
    trade_time DATETIME(3) NOT NULL, -- millisecond resolution
    price DECIMAL(8,2) NOT NULL,     -- or possibly something else
    PRIMARY KEY(trade_time)          -- UNIQUE and an INDEX
) ENGINE=InnoDB

You seem to be concerned about only one stock, correct? (If not then PRIMARY KEY (stock_id, trade_time) in that order.)

($start and $start_ts come from your app that is building and running the query. They need to hold the beginning price and time.)

SELECT ts, price
     FROM Trades
     WHERE trade_time > $start_ts,
       AND ABS(price - $start) / $start > 0.005  -- up or down .5%
     ORDER BY trade_time ASC

to find the next time when the price has moved by 0.5%. Similarly, use < and DESC to find the previous time.

SELECT ts, price, 
       price > $start   -- 1 when TRUE; 0 when FALSE
     FROM Trades
     WHERE trade_time > $start_ts,
       AND ( (price - $start) / $start > 0.005  -- up .5%
          OR (price - $start) / $start < 0.003  -- up .3%
           )
     ORDER BY trade_time ASC

Because of the indexing and clustering, this query will take [typically] less than 1ms. (Of course, if the price takes a month to move by .5%, the query will take much longer.)

A 5-minute candle (with just lo, avg, hi) is probably done via:

SELECT  MIN(trade_time) as lo,
        AVG(trade_time) as avg,
        MAX(trade_time) as hi
    FROM Trades
    WHERE trade_time >= $start_ts
      AND trade_time  < $start_ts + INTERVAL 5 MINUTE

The candle query should be run once every 5 minutes and saved in another table with 4 columns: trade_time, lo, avg, hi. It will take time to catch up with the data you already have, but future data should be computed on a 5-minute EVENT or cron.

It will take time to get the PRIMARY KEY in place. But once in place, new data will arrive just as fast as before. It may take hour(s) to add the PK.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply