All we need is an easy explanation of the problem, so here it is.
I am trying to filter all sessionId
s that occur once out of an existing result set.
This query is being used in a web application and runs on a big dataset (~ 35 million rows), so I want to prevent having subqueries here.
I tried this, which provides a filtered result, except that I now only get one row for each sessionId (and I want every request
and response
):
CREATE TABLE `api_log` (
`id` varchar(50) NOT NULL,
`clientId` varchar(100) DEFAULT NULL,
`inserted` int(11) DEFAULT NULL,
`sessionId` mediumtext DEFAULT NULL,
`stage` varchar(120) DEFAULT NULL,
`request` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
`response` longtext CHARACTER SET utf8mb4 DEFAULT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO api_log
VALUES
("1", "abc", 1621008484, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
("2", "abc", 1621008494, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
("3", "abc", 1621008584, "session1", "production", '{"key":"value"}', '{"key":"value"}'),
("4", "abc", 1621008684, "session2", "production", '{"key":"value"}', '{"key":"value"}'),
("5", "abc", 1621008784, "session3", "production", '{"key":"value"}', '{"key":"value"}'),
("6", "abc", 1621008884, "session4", "production", '{"key":"value"}', '{"key":"value"}'),
("7", "abc", 1621008984, "session5", "production", '{"key":"value"}', '{"key":"value"}'),
("8", "abc", 1621009084, "session6", "production", '{"key":"value"}', '{"key":"value"}'),
("9", "abc", 1621009184, "session7", "production", '{"key":"value"}', '{"key":"value"}'),
("10", "abc", 1621009284, "session8", "production", '{"key":"value"}', '{"key":"value"}');
SELECT
`clientId`,
`sessionId`,
`inserted`,
`stage`,
`request`,
`response`
FROM
`api_log`
WHERE
(stage = 'production')
AND (clientId = 'abc')
AND (
`inserted` BETWEEN 1621008482 AND 1621009285
)
GROUP BY
`clientId`,
`stage`,
`sessionId`
HAVING
COUNT(sessionId) > 1
Is there any trick to get all rows where a sessionId
occurs more than once?
In this case, I get two rows, one for session1
and one for session2
, but I am missing two more because both mentioned sessionId
s have an additional row that should match.
SQL fiddle: http://sqlfiddle.com/#!9/f87bb54/3
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Method 1
If I understand your requirement correctly, this is the kind of problem where window aggregation shines. Use the window version of the COUNT(*)
function on the filtered dataset to obtain the counts alongside the other columns. Then filter on the count results to get only the rows you want. Your output can include any or all of the columns your table has:
SELECT
id
, clientId
, inserted
, sessionId
, stage
, request
, response
FROM
(
SELECT
*
, COUNT(*) OVER (PARTITION BY sessionId) AS sessionIdCounter
FROM
api_log
WHERE (stage = 'production')
AND (clientId = 'abc')
AND (inserted BETWEEN 1621008482 AND 1621009285)
) AS derived
WHERE
sessionIdCounter > 1
ORDER BY
inserted ASC
;
You can play with this solution at dbfiddle.uk:
Method 2
The first approach is to create an index by sessionId
. But its datatype is MEDIUMTEXT…
So the approach is: normalize your data, move sessionId
values into separate table and refer to it by foreign key. The reference column will be compact (4 or 8 bytes) and indexed, so its usage in the query as GROUP BY expression will improve.
The columns clientID
and stage
seems to be candidates to such normalization too.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0