# Count number of occurences per location per hour in PostgreSQL

## All we need is an easy explanation of the problem, so here it is.

I have a dataset in Postgres of boat locations on waterways. Here is a sample of the table:

boat_id ts waterway_id
Boat_A 2019-01-01 16:29:11 WW_01
Boat_A 2019-01-01 17:03:04 WW_02
Boat_B 2019-01-01 16:11:34 WW_01
Boat_B 2019-01-01 16:13:45 WW_01
Boat_B 2019-01-01 17:05:13 WW_01
Boat_C 2019-01-01 16:03:00 WW_01
Boat_C 2019-01-01 16:09:50 WW_02
Boat_C 2019-01-01 16:16:22 WW_01
Boat_C 2019-01-01 16:45:44 WW_01

boat_id is the unique identification of the boat, ts is timestamp and water_id is the unique identifier of the waterway.
I would like to know for each hour in the dataset how many boats passed each waterway. The result should look like this:

waterway_id report_ts passage_count
WW_01 2019-01-01 00:00 3
WW_01 2019-01-01 01:00 1
WW_01 2019-12-31 23:00 5
WW_02 2019-01-01 00:00 13
WW_02 2019-01-01 01:00 11

The raw data contains the position of boats, not passages. Thus:

1. Multiple datapoints of the same boat on the same waterway should be counted as a single passage.
2. If a boat has been on another waterway and comes back it should be counted as another passage.
3. If a boat is detected on the same waterway in multiple hours, without being on anther waterway in between, it should be counted as a single passage in the hour it was first detected.
In the example data above, boat_A makes 1 passage on waterway WW_01 at 16h and 1 on WW_02 at 17h, boat_b makes 1 passages on WW_01 at 16h (there is no passage at 18h because it did not go to antoher waterway in between), boat_C makes 2 passages on waterway WW_01 at 16h and 1 passage on WW_02 at 16h. In a table (waterway-hour combinations with 0 passages do not have to be included in the result):
waterway_id report_ts passage_count
WW_01 2019-01-01 16:00 4
WW_02 2019-01-01 16:00 1
WW_02 2019-01-01 17:00 1

What should the query to get this result look like?
In my mind, it consists of two steps:

1. Computing unique passages per boat per waterway
2. Organizing these in a table as the example above

Fiddle here

## How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

### Method 1

Assuming all involved table columns `NOT NULL`.

This only counts the first hour of each passage:

``````SELECT waterway_id, date_trunc('hour', ts), count(*) AS count
FROM  (
SELECT waterway_id, ts -- , boat_id
, lag(waterway_id, 1, '') OVER (PARTITION BY boat_id ORDER BY ts) <> waterway_id AS switch
FROM   boat_data
) sub
WHERE  switch  -- only the first ts of each passage
GROUP  BY 1, 2
ORDER  BY 1, 2;
``````

db<>fiddle here

We just have to consider the first row after switching waterways for each boat. Identify that with the window function `lag()`. Using `lag(waterway_id, 1, '')` to suppress NULL for the first row in each partition. (Assuming that the empty string (`''`) is distinct from any existing `waterway_id`.)
Then truncate to the full hour with `date_trunc()` and count. Vóila.

My original solution counts every hour of each passage, which is a lot more complex:

``````SELECT waterway_id, report_ts, count(*) AS count
FROM  (
SELECT waterway_id
, generate_series(date_trunc('hour', min(ts))
, max(ts)
, interval '1 hour') AS report_ts
FROM  (
SELECT *
, count(switch) OVER (PARTITION BY boat_id ORDER BY ts) AS passage
FROM  (
SELECT boat_id, ts, waterway_id
, lag(waterway_id) OVER (PARTITION BY boat_id ORDER BY ts) <> waterway_id OR NULL AS switch
FROM   boat_data
) sub1
) sub2
GROUP  BY boat_id, waterway_id, passage
) sub3
GROUP  BY waterway_id, report_ts
ORDER  BY waterway_id, report_ts;
``````

db<>fiddle here

Related:

### Method 2

Editing to address this (emphasis mine) which was not the case with the original request:

In a table (waterway-hour combinations with 0 passages do not have to
be included in the result)
:

#### Primary keys are important

But before we get into that, we need to make sure you have the right primary key defined on your data, which is `(Boat_Id,Timestamp)`. Creating this gives us two things:

1. Non-conforming records are rejected (a `Boat` can’t be in two places at once)
2. A B-Tree for efficiently locating prior records for each `Boat` using a method other than an analytic/windowing function

### Getting Passages

To determine if a passage has occurred, we need to know the last position of each `Boat`, which we get through a correlated subquery searching for the entry with the greatest `Timestamp` less than the current `Timestamp`. Since we are only interested in `Boats` that have moved `Waterways`, we can exclude them from our result set.

``````SELECT
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp) AS Timestamp
,COUNT(*) AS passage_count
FROM
Boat_Data BD
LEFT JOIN
Boat_Data PriorBD
ON PriorBD.Boat_Id = BD.Boat_Id
AND PriorBD.Timestamp =
(
SELECT
MAX(TimeStamp)
FROM
Boat_Data
WHERE
Boat_Id = BD.Boat_Id
AND TimeStamp < BD.Timestamp
)
WHERE
BD.Waterway_ID <> PriorBD.Waterway_Id
OR PriorBD.Waterway_Id IS NULL
GROUP BY
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp)
``````

Alternately, you can use an analytical/windowing function as Erwin and Vérace have done. I provide this as a "second solution" as analytic/windowing functions will force a sort in most instances1. With larger amounts of data (or a different RDBMS), this may be a more expensive operation than just a self join with the proper primary key2. As always, test.

``````SELECT
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp) AS Timestamp
,COUNT(*) AS passage_count
FROM
(
SELECT
Boat_Id
,Timestamp
,Waterway_Id
,CASE
WHEN Waterway_Id <> LAG(Waterway_Id,1,'') OVER (PARTITION BY Boat_Id ORDER BY Timestamp) THEN 1
ELSE 0
END AS Passage_Ind
FROM
Boat_Data
) BD
WHERE
BD.Passage_Ind = 1
GROUP BY
BD.Waterway_ID
,date_trunc('hour',BD.TimeStamp)
;
``````

Modified fiddle here: http://sqlfiddle.com/#!17/2cede7/2

1 In SQL Server (and probably some other commercial platforms) a windowing/analytic function will not force a sort if the `PARTITION BY`
and `ORDER BY` statements match the sort order of the clustered index. This is not the case in MySQL.

2 The more recent versions of Postgres allow the INCLUDE statement to force specified non-key columns to be added to the B-Tree. In this instance, you could include the `Waterway_Id` so the entire query could be fulfilled without touching the heap.

### Method 3

This is part of a class of problems known as Tabibito-san – well worth getting to know! This answer has been highly revised now that I think I’ve grasped your issue.

I changed your schema slightly – I removed the quoted identifiers – they are normally unnecessary and merely add complexity and make the queries less legible.

I also changed the field named `timestamp` to `bts` (boat timestamp) since it’s not a good idea to use SQL keywords as variable names – it makes the SQL difficult to read also and interferes with debugging.

I also only kept data for `boat_1` – easier to reason about. The data I used are available on the fiddle and at the bottom of this post.

You can find the fiddle here (oh, BTW, please always include your version of PostgreSQL in any questions)- unimportant for sqlfiddle.com (they only have 9.6), but if you use dbfiddle.uk (many more servers), it can be most helpful.

Revised DDL:

``````CREATE TABLE boat_data
(boat_id int, bts timestamp, waterway_id varchar(9))
;
``````

And then I ran the following query:

``````SELECT
boat_id,
MIN(bts) AS min_time,
MAX(bts) AS max_time,
waterway_id,
MIN(rn) AS min_rn,
MAX(rn) AS max_rn
FROM
(
SELECT boat_id, bts, waterway_id,
ROW_NUMBER()
OVER
(
PARTITION BY boat_id, waterway_id
ORDER BY boat_id, waterway_id
) AS rn
FROM boat_data
ORDER BY boat_id, waterway_id
) AS tab
GROUP BY boat_id, waterway_id;
``````

Result (snipped for brevity):

``````boat_id min_time    max_time    waterway_id min_rn  max_rn
1   2019-06-03T10:27:25Z    2019-06-03T10:28:45Z    OSDOK003    1   4
1   2019-06-03T10:29:26Z    2019-06-03T10:29:54Z    OSDOK005    1   4
1   2019-06-03T10:32:26Z    2019-06-03T10:32:26Z    OUDSC001    1   1
1   2019-06-03T10:32:45Z    2019-06-03T10:34:34Z    OUDSC002    1   8
1   2019-06-03T10:30:35Z    2019-06-03T10:30:54Z    OUDSC003    1   3
``````

You probably won’t want all of this data – remove as appropriate!

There’s a list of the "passages" giving all of the detail about them – as I said, more than necessary perhaps?

• What the first line is telling you is that for `boat_1`, its first passage started on waterway `OSDOK003` at `2019-06-03T10:27:25Z` and finished at `2019-06-03T10:28:45Z` and there were 4 measurements taken during that passage.

• Then it went on to waterway `OSDOK005` at time x and finished at time y – also 4 measurements.

• Then there was 1 measurement on waterway `OUDSC001`

• Followed by 8 measurements on waterway `OUDSC002`

• Then finally back to `OUDSC003` for 3 measurements.

I’ve "eye-balled" the data and this appears correct!

Now, you may have to take account of the date – in that case, just add `DATE(bts)` to the `SELECT` and the `GROUP BY`

I’ve left some "artefacts" at the bottom of the fiddle so that you can see (more or less in reverse order) where my thinking was going – Postgresql’s window functions are very powerful and well worth mastering – they will repay any effort 10 times over – esp. ROW_NUMBER() – take a look at them and also LAG/LEAD (fiddle)…

========================================

Data for `boat_1` used in this answer.

``````INSERT INTO boat_data
(boat_id, bts, waterway_id)
VALUES
(1, '2019-06-03 10:27:25', 'OSDOK003'),
(1, '2019-06-03 10:27:54', 'OSDOK003'),
(1, '2019-06-03 10:28:05', 'OSDOK003'),
(1, '2019-06-03 10:28:45', 'OSDOK003'),
(1, '2019-06-03 10:29:26', 'OSDOK005'),
(1, '2019-06-03 10:29:35', 'OSDOK005'),
(1, '2019-06-03 10:29:45', 'OSDOK005'),
(1, '2019-06-03 10:29:54', 'OSDOK005'),
(1, '2019-06-03 10:30:35', 'OUDSC003'),
(1, '2019-06-03 10:30:45', 'OUDSC003'),
(1, '2019-06-03 10:30:54', 'OUDSC003'),
(1, '2019-06-03 10:32:26', 'OUDSC001'),
(1, '2019-06-03 10:32:45', 'OUDSC002'),
(1, '2019-06-03 10:32:55', 'OUDSC002'),
(1, '2019-06-03 10:33:34', 'OUDSC002'),
(1, '2019-06-03 10:33:45', 'OUDSC002'),
(1, '2019-06-03 10:33:54', 'OUDSC002'),
(1, '2019-06-03 10:34:04', 'OUDSC002'),
(1, '2019-06-03 10:34:14', 'OUDSC002'),
(1, '2019-06-03 10:34:34', 'OUDSC002');
``````

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂