Pick the first timestamp before a gap, but the last one of the day if there is no suitable gap

All we need is an easy explanation of the problem, so here it is.

I have a TIMESTAMP column:

dates
2021-06-24 05:47:05
2021-06-24 09:47:05
2021-06-24 13:47:05
2021-06-24 17:47:05

I want to pick the first timestamp of a given day that is 3 hours or more before the next next timestamp of that same day.

expected output:

2021-06-24 05:47:05

However, if there is no timestamp that is more than 3 hours before any other (on that given day), then the last timestamp of that day should be returned.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This is a completely revised answer which is much more efficient than the previous one. The old answer can be seen either by viewing the edit history or as a footnote at the bottom of this post.

A fiddle for all the code below is to be found at the fiddle here.

So, we have our test table:

CREATE TABLE test
(
  the_date TIMESTAMP NOT NULL
);

Populate it – added records for a day with no gaps > 3 hours:

INSERT INTO test VALUES


('2021-06-23 05:47:05'::TIMESTAMPTZ),     -- NO gaps > 3 hours on this date!
('2021-06-23 07:47:05'::TIMESTAMPTZ),
('2021-06-23 09:47:05'::TIMESTAMPTZ),
('2021-06-23 11:47:05'::TIMESTAMPTZ),
('2021-06-23 13:47:05'::TIMESTAMPTZ),
('2021-06-23 14:47:05'::TIMESTAMPTZ),  
('2021-06-23 16:47:05'::TIMESTAMPTZ),  
('2021-06-23 17:47:05'::TIMESTAMPTZ),
    

('2021-06-24 05:47:05'::TIMESTAMPTZ),  -- TWO gaps > 3 hours on this date

                                      -- 1st gap > 3 hours

('2021-06-24 09:47:05'::TIMESTAMPTZ),

                                      -- 2nd gap > 3 hours

('2021-06-24 13:47:05'::TIMESTAMPTZ),

('2021-06-24 14:47:05'::TIMESTAMPTZ),  -- added for testing
('2021-06-24 16:47:05'::TIMESTAMPTZ),  -- added for testing


('2021-06-24 17:47:05'::TIMESTAMPTZ);

And (demonstrating the logic) then ran the following SQL:

SELECT
  the_date::DATE AS dat, 
  the_date AS td, 
  LEAD(the_date) 
    OVER (PARTITION BY the_date::DATE 
           ORDER BY the_date ASC) AS l_td,
  LEAD(the_date) 
    OVER (PARTITION BY the_date::DATE 
            ORDER BY the_date ASC) - the_date AS diff  -- for demonstration
FROM                                                   -- purposes - see diffs
  test                                                 -- > 3 HOUR - 2 on 24/06
ORDER BY dat, td;

Result:

       dat                      td                   l_td   diff
2021-06-23  2021-06-23 05:47:05+01  2021-06-23 07:47:05+01  02:00:00
2021-06-23  2021-06-23 07:47:05+01  2021-06-23 09:47:05+01  02:00:00
2021-06-23  2021-06-23 09:47:05+01  2021-06-23 11:47:05+01  02:00:00
2021-06-23  2021-06-23 11:47:05+01  2021-06-23 13:47:05+01  02:00:00
2021-06-23  2021-06-23 13:47:05+01  2021-06-23 14:47:05+01  01:00:00
2021-06-23  2021-06-23 14:47:05+01  2021-06-23 16:47:05+01  02:00:00
2021-06-23  2021-06-23 16:47:05+01  2021-06-23 17:47:05+01  01:00:00
2021-06-23  2021-06-23 17:47:05+01  NULL                    NULL        
2021-06-24  2021-06-24 05:47:05+01  2021-06-24 09:47:05+01  04:00:00
2021-06-24  2021-06-24 09:47:05+01  2021-06-24 13:47:05+01  04:00:00
2021-06-24  2021-06-24 13:47:05+01  2021-06-24 14:47:05+01  01:00:00
2021-06-24  2021-06-24 14:47:05+01  2021-06-24 16:47:05+01  02:00:00
2021-06-24  2021-06-24 16:47:05+01  2021-06-24 17:47:05+01  01:00:00
2021-06-24  2021-06-24 17:47:05+01  NULL                    NULL        
14 rows

We’ve used the LEAD() window function. Window functions are extremely powerful and I would stongly urge you to put some effort into learning how to use them – they will repay that effort many times over!

  • It provides a comparison between the value of the_date and the value following it according to the criteria in the ORDER BY – you can do lots of clever stuff by varying the ORDER BY clause in the LEAD() function itself – that and varying other parameters can be seen here.

  • The PARTITION BY the_date::DATE clause is to give separate results for every date that is in your dataset. Note in particular the NULLs – you can’t have a LEAD that spans days thanks to the partitioning, so the LEAD value for the last timestamp on any given day will always be NULL – this relates to the requirements – see below.

Also, note that NULL minus anything is NULL (same for NULL plus…) – we say that NULLs "propagate".

So, now we run this SQL:

WITH leads AS
(
    SELECT
      the_date::DATE AS dat, the_date AS td, LEAD(the_date)
          OVER (PARTITION BY the_date::DATE) AS l_td
    FROM
      test
)
SELECT DISTINCT ON(dat)
    dat AS "The date", td AS "Gap start or last ts"
FROM leads
WHERE l_td - td > INTERVAL '3 HOUR'
   OR l_td IS NULL
ORDER BY dat, td;

Result:

The date    Gap start or last ts
2021-06-23  2021-06-23 17:47:05+01
2021-06-24  2021-06-24 05:47:05+01

The desired outcome! But, what’s going on? From here:

PostgreSQL has a really interesting and powerful construct called
SELECT DISTINCT ON. No, this is not a typical DISTINCT. This is
different. It is perfect when you have groups of data that are similar
and want to pull a single record out of each group, based on a
specific ordering.

or, put another way (from the same link):

With DISTINCT ON, You tell PostgreSQL to return a single row for each
distinct group defined by the ON clause. Which row in that group is
returned is specified with the ORDER BY clause.

Or from the PostgreSQL documentation here:

SELECT DISTINCT ON ( expression [, …] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the “first row” of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first. For example:

SELECT DISTINCT ON (location) location, time, report
    FROM weather_reports
    ORDER BY location, time DESC;

retrieves the most recent weather report for each location. But if we
had not used ORDER BY to force descending order of time values for
each location, we’d have gotten a report from an unpredictable time
for each location.

The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.

As you can see, this (like window functions) is obviously a very powerful tool in the PostgreSQL programmer’s arsenal and is well worth taking the time and the effort to learn.

An interesting alternative approach would be to use the ROW_NUMBER() window function, if say you want the first two gaps or the last record, as follows:

WITH leads AS
(
    SELECT
      the_date::DATE AS dat, the_date AS td,
      LEAD(the_date)
          OVER (PARTITION BY the_date::DATE) AS l_td
    FROM
      test
),
gaps AS
(
    SELECT
      dat, td,
      ROW_NUMBER()
          OVER (PARTITION BY dat ORDER BY td) AS rn
    FROM leads
    WHERE (l_td - td > INTERVAL '3 HOUR')
      OR (l_td IS NULL)
)
SELECT
    dat, td
FROM gaps
WHERE rn <= 2  -- NOTE 2!
ORDER BY dat, td;

Result:

       dat                      td
2021-06-23  2021-06-23 17:47:05+01
2021-06-24  2021-06-24 05:47:05+01
2021-06-24  2021-06-24 09:47:05+01

Note that we now have two records for 2021-06-24.

Finally, and just for the record, the original solution:

WITH long_gaps AS
(
  SELECT dat, MIN(td) AS gap
  FROM
  (
    SELECT
      the_date::DATE AS dat, the_date AS td, LEAD(the_date) OVER (PARTITIION BY the_date::DATE) AS l_td
    FROM
      test
  ) AS t1
  WHERE l_td - td > INTERVAL '3 HOUR'
  GROUP BY dat
),
short_gaps AS
(
  SELECT the_date::DATE AS dat2, MAX(the_date)
  FROM test
  WHERE the_date::DATE NOT IN (SELECT dat FROM long_gaps)
  GROUP BY dat2

)
SELECT dat AS "The date", gap AS "Gap start or last ts" FROM long_gaps
UNION 
SELECT * FROM short_gaps
ORDER BY 1;  -- parameter 1 which ORDERs BY the first field in the query

Result:

  The date     Gap start or last ts
2021-06-23   2021-06-23 17:47:05+01
2021-06-24   2021-06-24 05:47:05+01

A performance analysis of the 3 solutions is given at the bottom of the fiddle – it shows that the DISTINCT ON solution is significantly more performant than the others – however the ROW_NUMBER() has the potential to be more flexible! However, a word of warning – a performance analysis on a very small dataset on a server over which we have no control, nor any idea of what’s happening elsewhere is potentially flawed – I would advise that you benchmark with reasonable datasets on your own hardware.

In future, when you are asking questions of this nature, could you provide a fiddle with sample data covering all of your cases – i.e. in this case, where there are gaps and where there are not. This reduces the possibility of error and eliminates duplication of effort – help us to help you. Also, please always include your version of PostgreSQL.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply