# Percentage based scan on a timeseries data

## All we need is an easy explanation of the problem, so here it is.

I have 20 GB csv file with 650 million rows. The data looks like this:

``````trade_time,price
2020-01-01 00:00:01.481,7189.42
2020-01-01 00:00:01.708,7189.42
2020-01-01 00:00:06.290,7189.5
2020-01-01 00:00:06.291,7190.52
2020-01-01 00:00:07.161,7188.97
2020-01-01 00:00:08.274,7189.93
2020-01-01 00:00:09.277,7190.47
2020-01-01 00:00:09.384,7190.47
2020-01-01 00:00:09.630,7190.11
2020-01-01 00:00:09.848,7189.74
2020-01-01 00:00:10.098,7189.46
2020-01-01 00:00:10.197,7189.16
2020-01-01 00:00:10.351,7189.1
``````

I would like to check whether price is up or down by 0.5%. If the price hits +0.5% first, the result is 1. If the price hits -0.5% first, then the result is 0.

At the moment, I’m using this python solution. If database can perform better for my use case, then I would like to move to the database solution. I don’t have a database yet. I have only the csv file.

I’m worried a disk-based solution will be slow. I’m not seeking persistence here. I’m only looking for a way to finish my task faster. After finishing the task, I don’t need the database. So even an in-memory solution is okay for use case.

1. Since my task involves timeseries data, which database is better for my use case? SQL or NoSQL?
2. Is it really possible to do percentage based comparison in databases? e.g. +0.5%
3. How long the indexing process usually take for 1 billion rows?

db<>fiddle can be found here.

``````CREATE TABLE Trades (
trade_time datetime(3)  NOT NULL PRIMARY KEY,
price      NUMERIC(7,2) NOT NULL
);

('2020-01-01 00:00:01.481',7189.42)
,('2020-01-01 00:00:01.708',7189.42)
,('2020-01-01 00:00:06.290',7189.5)
,('2020-01-01 00:00:06.291',7190.52)
,('2020-01-01 00:00:07.161',7188.97)
,('2020-01-01 00:00:08.274',7189.93)
,('2020-01-01 00:00:09.277',7190.47)
,('2020-01-01 00:00:09.384',7190.47)
,('2020-01-01 00:00:09.630',7190.11)
,('2020-01-01 00:00:09.848',7189.74)
``````

CSV Import to MySQL

``````LOAD DATA INFILE 'trades.csv' INTO TABLE Trades
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
``````

To clarify, I have 5 minute candle data. Each month has 8640 entries. I’m checking for 2 years. So there will be 207,360 entries. I need to take the timetstamp from these 207,360 entries and then perform the check in `Trades` table to check whether the price goes up or down from that point.

I’m not checking all 650 million rows. I’m checking up/down only for the 5 minute candlestick. Those 650 M rows are tick data. One 5 minute candlestick can have 10k trade data, or even 100k trade data during that period if there is momentum. I actually have two tables. 1) 5 minute candles 2) Trades. I have only 200K records in 5 minute candles.

Here are the steps:

1. Get a record from 5 minute candle table.
2. Find the nearest timestamp in 650M table.
3. Loop until the prices hits +0.5% or -0.5%. Whatever hits first, record the result as either 0 or 1.

Up/down by 0.5% is relative to the starting value, but only future values. Here starting value is "nearest timestamp" from the 5 minute record timestamp. For more details, check the code found in my SO question as linked before.

## How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

### Method 1

(Discussing a implementation using MySQL)

``````CREATE TABLE Trades (
trade_time DATETIME(3) NOT NULL, -- millisecond resolution
price DECIMAL(8,2) NOT NULL,     -- or possibly something else
PRIMARY KEY(trade_time)          -- UNIQUE and an INDEX
) ENGINE=InnoDB
``````

You seem to be concerned about only one stock, correct? (If not then `PRIMARY KEY (stock_id, trade_time)` in that order.)

(\$start and \$start_ts come from your app that is building and running the query. They need to hold the beginning price and time.)

``````SELECT ts, price
AND ABS(price - \$start) / \$start > 0.005  -- up or down .5%
``````

to find the next time when the `price` has moved by 0.5%. Similarly, use `<` and `DESC` to find the previous time.

``````SELECT ts, price,
price > \$start   -- 1 when TRUE; 0 when FALSE
AND ( (price - \$start) / \$start > 0.005  -- up .5%
OR (price - \$start) / \$start < 0.003  -- up .3%
)
``````

Because of the indexing and clustering, this query will take [typically] less than 1ms. (Of course, if the price takes a month to move by .5%, the query will take much longer.)

A 5-minute candle (with just lo, avg, hi) is probably done via:

``````SELECT  MIN(trade_time) as lo,
It will take time to get the `PRIMARY KEY` in place. But once in place, new data will arrive just as fast as before. It may take hour(s) to add the PK.