Getting the row before a row with a certain value in SQL

All we need is an easy explanation of the problem, so here it is.

I have a table like below where user actions are stored with a timestamp. My goal is to identify the action that happened before a specific action (named reference_action) and count the number of those actions to see which actions happens before the specific actions and how they are distributed.

I am aware of window functions like LAG() where I can get the row before a certain row but can’t figure out how to include a constraint like WHERE action_name = "reference_action".

The query engine is Presto and the tables are Hive tables but I’m mostly interested in the general SQL approach, therefore that shouldn’t matter much.

session action_name timestamp
1 "some_action" 1970-01-01 00:01:00
1 "some_action" 1970-01-01 00:02:00
1 "some_action" 1970-01-01 00:03:00
1 "desired_action1" 1970-01-01 00:04:00
1 "reference_action" 1970-01-01 00:05:00
1 "some_action" 1970-01-01 00:06:00
1 "some_action" 1970-01-01 00:07:00
2 "some_action" 1970-01-01 01:23:00
2 "some_action" 1970-01-01 02:34:00
2 "desired_action1" 1970-01-01 03:45:00
2 "reference_action" 1970-01-01 04:56:00
2 "some_action" 1970-01-01 05:58:00
3 "some_action" 1970-01-01 01:23:00
3 "some_action" 1970-01-01 02:34:00
3 "desired_action2" 1970-01-01 03:45:00
3 "reference_action" 1970-01-01 04:56:00
3 "some_action" 1970-01-01 05:58:00

The result should look like:

action count
"desired_action1" 2
"desired_action2" 1

There are two rows where "desired_action1" is directly followed by a row with "reference_action", when ordered by timestamp, hence the count being 2. The same logic applies for why the count is 1 for "desired_action2".

The goal is to know what a user did before he made a purchase (purchase = reference_action). To understand what he did before, I want to look up the action that happened before a purchase. Therefore I need to know the action_name in the row before a reference_action. desired_actions have to be counted, reference_actions are just the rows after the actions I want to count and used to determine which values should be counted.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The steps to do this are actually pretty simple in any SQL variant that supports LAG:

  • Pull out the previous action_name using LAG, in a derived table or CTE.
  • Filter on the outside to only rows which contain 'reference_action', these rows will also have the previous action from the first step. You want to do this on the outside, otherwise you won’t get the correct LAG values.
  • Group by the previous action, and count.
SELECT
    prevAction AS action,
    COUNT(*) AS [count]
FROM (
    SELECT *,
        LAG(action_name) OVER (ORDER BY timestamp) AS prevAction -- you may want a PARTITION also
    FROM YourTable
) AS t
WHERE action_name = 'reference_action'
GROUP BY prevAction;

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply