All we need is an easy explanation of the problem, so here it is.
How could one write a Postgresql query that gives both daily and cumulative counts without gaps? E.g. if there is no data for a given day, it would show
0 as a daily count for that that day and the same cumulative sum as the previous day?
I think I’d need
GENERATE_SERIES, but I don’t know how to do it. I’m also not entirely sure if
order by day asc rows between unbounded preceding and current row would work correctly always, but maybe not the biggest issue here.
I have tried to craft the query with
PARTITION BY and
Let’s say I write a table and add data such as
create table test ( id int4 NOT NULL GENERATED ALWAYS AS IDENTITY, data int4, created_at timestamptz NOT NULL DEFAULT now() ); insert into test(data, created_at) values(1, '2021-04-01'); insert into test(data, created_at) values(2, '2021-04-01'); insert into test(data, created_at) values(3, '2021-04-02'); insert into test(data, created_at) values(4, '2021-04-03'); insert into test(data, created_at) values(5, '2021-04-05'); insert into test(data, created_at) values(6, '2021-04-07');
and then create a queries such as
SELECT created_at as "Date", count(1) as "Daily count" FROM test WHERE created_at >= '2021-04-01' AND created_at <= '2021-04-30' GROUP BY 1
with data as ( select date_trunc('day', created_at) as day, count(1) from test group by 1 ) select day, running_total(count) over (order by day asc rows between unbounded preceding and current row) from data
But as noted, how could these two be combined without gaps on daily values? Somehow it feels I get close but bump into some (syntax) problem. Maybe those two queries are the simplest and cleanest examples of what I’m thinking.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
You write in your question:
I think I'd need GENERATE_SERIES, but I don't know how to do it.
Indeed you do – and you can as follows (all the code below is also given in the fiddle here):
CREATE TABLE cal_tab (cal_date) AS ( SELECT GENERATE_SERIES ( '2021-04-01'::DATE, '2021-04-07'::DATE, '1 DAY' ) );
Just to check:
SELECT cal_date::DATE FROM cal_tab;
cal_date 2021-04-01 2021-04-02 2021-04-03 ... ... snipped for brevity - you determine how many rows there are with the ... second parameter to GENERATE_SERIES ...
Then, I use your data:
CREATE TABLE test ( id INT4 NOT NULL GENERATED ALWAYS AS IDENTITY, data INT4, created_at DATE NOT NULL DEFAULT NOW() );
Populate (slightly modified from the question – so as not to have the id the same as the provided data (i.e. 1…6), – instead, I explicitly inserted hundreds – improves legibility):
INSERT INTO test (data, created_at) VALUES (100, '2021-04-01'); INSERT INTO test (data, created_at) VALUES (100, '2021-04-01'); INSERT INTO test (data, created_at) VALUES (200, '2021-04-02'); INSERT INTO test (data, created_at) VALUES (300, '2021-04-03'); -- GAP for 4th of April INSERT INTO test (data, created_at) VALUES (500, '2021-04-05'); -- GAP for 6th of April INSERT INTO test (data, created_at) VALUES (600, '2021-04-07');
And then run the following SQL:
SELECT DISTINCT -- try with and without DISTINCT ct.cal_date::DATE, COALESCE(t.data, 0) AS data, -- t.created_at::DATE, COUNT(t.created_at) OVER (PARTITION BY t.created_at ORDER BY ct.cal_date ASC) AS "Cnt/day", COUNT(t.created_at) OVER (ORDER BY ct.cal_date ASC) AS "Cum. cnt/day", SUM(t.data) OVER (ORDER BY ct.cal_date::DATE ASC) AS "Sum/day" FROM cal_tab ct LEFT OUTER JOIN test t ON ct.cal_date = t.created_at::DATE ORDER BY ct.cal_date::DATE ASC;
cal_date data Cnt/day Cum. cnt/day Cum. sum/day 2021-04-01 100 2 2 200 2021-04-02 200 1 3 400 2021-04-03 300 1 4 700 2021-04-04 0 0 4 700 2021-04-05 500 1 5 1200 2021-04-06 0 0 5 1200 2021-04-07 600 1 6 1800
A result which, I believe covers all your requested elements.
In this case, you don’t have to worry about the FRAME clause (i.e. the
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) bit – FRAMEs are a kind of sub-partitioning method – but the long phrase above is actually the default – see here for a good introduction to window frames in PostgreSQL.
You could also (depending on your requirements) use a CTE (Common Table Expression) for your query if you don’t wish to create a permanent calendar table (see my answer here for an example of this). (+1 for having made me think!).
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂