Get all rows where specific column value occurs more than once (filter out single occurences)

All we need is an easy explanation of the problem, so here it is.

I am trying to filter all sessionIds that occur once out of an existing result set.

This query is being used in a web application and runs on a big dataset (~ 35 million rows), so I want to prevent having subqueries here.

I tried this, which provides a filtered result, except that I now only get one row for each sessionId (and I want every request and response):

CREATE TABLE `api_log` (
  `id` varchar(50) NOT NULL,
  `clientId` varchar(100) DEFAULT NULL,
  `inserted` int(11) DEFAULT NULL,
  `sessionId` mediumtext DEFAULT NULL,
  `stage` varchar(120) DEFAULT NULL,
  `request` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
  `response` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
  PRIMARY KEY (`id`)
);

INSERT INTO api_log
  VALUES
    ("1", "abc", 1621008484, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
    ("2", "abc", 1621008494, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
    ("3", "abc", 1621008584, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
    ("4", "abc", 1621008684, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
    ("5", "abc", 1621008784, "session3", "production", '{"key":"value"}', '{"key":"value"}'),
    ("6", "abc", 1621008884, "session4", "production", '{"key":"value"}', '{"key":"value"}'),
    ("7", "abc", 1621008984, "session5", "production", '{"key":"value"}', '{"key":"value"}'),
    ("8", "abc", 1621009084, "session6", "production", '{"key":"value"}', '{"key":"value"}'),
    ("9", "abc", 1621009184, "session7", "production", '{"key":"value"}', '{"key":"value"}'),
    ("10", "abc", 1621009284, "session8", "production", '{"key":"value"}', '{"key":"value"}');
SELECT
    `clientId`,
    `sessionId`,
    `inserted`,
    `stage`,
    `request`,
    `response`
FROM
    `api_log`
WHERE
    (stage = 'production') 
    AND (clientId = 'abc') 
    AND (
        `inserted` BETWEEN 1621008482 AND 1621009285
    )
GROUP BY
    `clientId`,
    `stage`,
    `sessionId`
HAVING
    COUNT(sessionId) > 1

Is there any trick to get all rows where a sessionId occurs more than once?

In this case, I get two rows, one for session1 and one for session2, but I am missing two more because both mentioned sessionIds have an additional row that should match.

SQL fiddle: http://sqlfiddle.com/#!9/f87bb54/3

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

If I understand your requirement correctly, this is the kind of problem where window aggregation shines. Use the window version of the COUNT(*) function on the filtered dataset to obtain the counts alongside the other columns. Then filter on the count results to get only the rows you want. Your output can include any or all of the columns your table has:

SELECT
  id
, clientId
, inserted
, sessionId
, stage
, request
, response
FROM
  (
    SELECT
      *
    , COUNT(*) OVER (PARTITION BY sessionId) AS sessionIdCounter
    FROM
      api_log
    WHERE (stage = 'production') 
      AND (clientId = 'abc') 
      AND (inserted BETWEEN 1621008482 AND 1621009285)
  ) AS derived
WHERE
  sessionIdCounter > 1
ORDER BY
  inserted ASC
;

You can play with this solution at dbfiddle.uk:

Method 2

The first approach is to create an index by sessionId. But its datatype is MEDIUMTEXT…

So the approach is: normalize your data, move sessionId values into separate table and refer to it by foreign key. The reference column will be compact (4 or 8 bytes) and indexed, so its usage in the query as GROUP BY expression will improve.

The columns clientID and stage seems to be candidates to such normalization too.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply