How to compact/group by the results of a query by columns' values?

All we need is an easy explanation of the problem, so here it is.

I have the following table (MySQL version >= 5.7.X, using DBeaver 21.1.0 to manage the db):

n_brand population actions_completed_by_unique_user
pepsico pepsicoeur 1
pepsico pepsicoeur 1
pepsico pepsicoeur 1
pepsico pepsicousa 1
pepsico pepsicousa 2
pepsico pepsicousa 2
pepsico pepsicomex 0
pepsico pepsicomex 2
ferrari ferrarieur 1
ferrari ferrarieur 1
ferrari ferrariusa 0
ferrari ferrarimex 1
ferrari ferrarimex 1

I would like to have something like (I don’t necessarily require the grouping column to be named actions_completed_by_unique_user, would be a nice to have):

n_brand population actions_completed_by_unique_user
pepsico pepsicoeur 3
pepsico pepsicousa 5
pepsico pepsicomex 2

My query is:

SELECT n_brand , population, actions_completed_by_unique_user
FROM agg_table
WHERE n_brand = 'pepsico'
AND actions_completed_by_unique_user > 0
GROUP BY population

But it only takes the first occurrence of each population and doesn’t sum the values, is it possible to do this just using SQL querying or do I have to do this programatically?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

To resolve this, I did the following. All the code below is available on the fiddle here.

I took another look at this and realised that my first answer (see edit history) had merely "handed you a fish", rather than "teaching you how to fish(*)"!

Overview:

You have two issues here, one easy enough to spot and deal with – aggregation – in this case, SUM()-ing over and then GROUP-ing BY certain fields (see this and this) – ("compact"-ing in the question) and the other problem is more subtle:

  • MySQL in this case doesn’t perform GROUP BY correctly! "What??", I hear you say, "the most popular Open Source server can’t do a simple query??". And the answer is that, in this case, sadly, "No, it can’t!".

Part 1 – clarification of MySQL’s GROUP BY:

This latter issue is potentially both more harmful and confusing, and requires clarification before answering the question, so I will deal with it first.

The first thing to do is to create a test table and populate it with your data:

CREATE TABLE test
(
  n_brand VARCHAR (25) NOT NULL,
  population VARCHAR (25) NOT NULL,
  actions_u_user INTEGER NOT NULL
);

data:

INSERT INTO test VALUES
('pepsico', 'pepsicoeur',   1),
('pepsico', 'pepsicoeur',   1),
('pepsico', 'pepsicoeur',   1),
('pepsico', 'pepsicousa',   1),
('pepsico', 'pepsicousa',   2),
('pepsico', 'pepsicousa',   2),
('pepsico', 'pepsicomex',   0),
('pepsico', 'pepsicomex',   2),
('ferrari', 'ferrarieur',   1),
('ferrari', 'ferrarieur',   1),
('ferrari', 'ferrariusa',   0),
('ferrari', 'ferrarimex',   1),
('ferrari', 'ferrarimex',   1);

If I run (on dbfiddle):

SHOW VARIABLES LIKE '%sql_mode%';

I obtain:

Variable_name   Value
sql_mode    ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION

Note the ONLY_FULL_GROUP_BY (in this case it’s the first entry but that’s unimportant). Now, if you run the SHOW VARIABLES LIKE '%sql_mode%' command above on your own system it won’t have the ONLY_FULL_GROUP_BY bit. How do I know this?

Well, if I try to run your query on dbfiddle:

SELECT n_brand , population, actions_u_user
FROM test
WHERE n_brand = 'pepsico'
AND actions_u_user > 0
GROUP BY population

I get:

Expression #3 of SELECT list is not in GROUP BY clause and
contains nonaggregated column 'db_1748585530.test.actions_u_user'
which is not functionally dependent on columns in GROUP BY
clause; this is incompatible with sql_mode=only_full_group_by

This is because the server setting for sql_mode on dbfiddle.uk has been (correctly) set to include the ONLY_FULL_GROUP_BY option and therefore your query won’t work!

You’re possibly running a version of MySQL 5.7.x where x is < 5 – see the article here and also check out the link to an article by the same author (Debunking GROUP BY myths). Read these after finishing this answer – they provide additional clarity on this surprisingly tricky issue!

This (not very useful) query will work:

SELECT n_brand , population, actions_u_user
FROM test
WHERE n_brand = 'pepsico'
AND actions_u_user > 0
GROUP BY n_brand, population, actions_u_user;

because, you’ve GROUPed BY all of the fields in your SELECT.

Result:

n_brand     population  actions_u_user
pepsico     pepsicoeur  1
pepsico     pepsicomex  2
pepsico     pepsicousa  1
pepsico     pepsicousa  2

So, what’s this telling us?

Well, it’s telling us that we have one set of values pepsico pepsicoeur 1, one set pepsico pepsicomex 2 &c. – i.e. it’s one record for each different record (n_brand, population & actions_u_user).

This query:

SELECT DISTINCT n_brand, population, actions_u_user
FROM test
WHERE n_brand = 'pepsico'
  AND actions_u_user > 0
ORDER BY n_brand, population;

Will give the same result (probably cheaper) – see the fiddle.

You can run this, slightly more useful, query (note the SUM() aggregate):

SELECT n_brand , population, actions_u_user, SUM(actions_u_user) AS sum_p
FROM test
WHERE n_brand = 'pepsico'
AND actions_u_user > 0
GROUP BY n_brand, population, actions_u_user;

Result:

n_brand     population  actions_u_user  sum_p
pepsico     pepsicoeur               1      3
pepsico     pepsicomex               2      2
pepsico     pepsicousa               1      1
pepsico     pepsicousa               2      4

Slightly more useful – but the total sum of n_brand = pepsico with population = pepsicousa is 5 (4 + 1), so we’re partly on the way to answering the question (see below).

However, to see where the problem lies with NOT having ONLY_FULL_GROUP_BY in the sql_mode, let’s remove it!

SET sql_mode = 'STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';

So, it’s been removed! Now, we rerun your original query:

SELECT n_brand , population, actions_u_user
FROM test
WHERE n_brand = 'pepsico'
AND actions_u_user > 0
GROUP BY population;

and we obtain a "result" (this may or may not vary depending on your version of MySQL, the time of day, the phases of the moon and the direction the wind happens to blowing!):

n_brand     population  actions_u_user
pepsico     pepsicoeur  1
pepsico     pepsicomex  2
pepsico     pepsicousa  1

Great you might think, "I have an answer!" – indeed you do – but WTF is the question? You have 3 records, and on inspection, they are actually the first record (by INSERTion) for n_brand = ‘pepsico’ and the different population records.

It appears that MySQL putsf an implicit PRIMARY KEY on the table (see here) and those are the first records by implicit PK which are returned for different values of population. But, hey, who knows?

God help you if you rely on this behaviour! It’s not documented (or rather it’s documented as being "undetermined")! If values are UPDATEd, what’s the result then? DELETEd? Result? Better by far to have a deterministic documented answer – which is what you’ll get if you include ONLY_FULL_GROUP_BY in your sql_mode.

I like to think of it as providing enough information to the server to provide rational, deterministic answers to SQL queries! If you’re ever in doubt, do what I do – and test on PostgreSQL, a sane RDBMS!

So, that’s that! Set your sql_mode to include ONLY_FULL_GROUP_BY and you won’t fall into this particular tar-pit!

Part 2 – the answer to the question:

Let’s move on to (far more interesting) aggregation. The five major aggregation functions are:

  • AVG() – return the average value.
    
  • COUNT() – return the number of values.
    
  • MAX() – return the maximum value.
    
  • MIN() – return the minimum value.
    
  • SUM() – return the sum of all or distinct values.  
    

(set our sql_mode back to its original setting – not critical for properly formulated queries, but why not? (See the fiddle)).

So, if we take the SUM() (since that’s the relevant one here) over all the values in the table, that’s all fine and dandy, i.e.

SELECT 
  SUM(actions_u_user) AS "The sum of all" 
FROM test;

Result:

The sum of all
            14

but if, as is the case now, you wish to get a SUM() over 6 different values of n_brand & population, you have to GROUP BY these values, otherwise what do these SUM()s correspond to?

So, we run this:

SELECT 
  n_brand AS "The Brand",      

  -- for presentation purposes. With multi-word aliases, 
  -- you have to use double quotes! I use them all the time anyway!

  population AS "Population", 
  SUM(actions_u_user) AS "The column name"  

  -- or AS col_name - no quotes necessary if there are no
  -- spaces in the column alias
    
FROM test
GROUP BY n_brand, population
ORDER BY n_brand, population;

Result:

The Brand   Population  The column name
ferrari     ferrarieur                2
ferrari     ferrarimex                2
ferrari     ferrariusa                0
pepsico     pepsicoeur                3
pepsico     pepsicomex                2
pepsico     pepsicousa                5
6 rows

So, now we have the SUM() of actions_u_user over the entire table for a given value of n_brand & population. Finally, we can refine this query to return the desired result set – we’re only interested in pepsico and where the SUM() is > 0, hence:

SELECT 
  n_brand AS "The Brand",      

  -- for presentation purposes. With multi-word aliases, 
  -- you have to use double quotes! 

  population AS "Population", 
  SUM(actions_u_user) AS "The column name"  

  -- or AS col_name - no quotes necessary if there are no
  -- spaces in the column alias
    
FROM test

WHERE n_brand = 'pepsico'  -- added these two lines
  AND actions_u_user > 0

GROUP BY n_brand, population
ORDER BY n_brand, population;

Result:

The Brand   Population  The column name
  pepsico   pepsicoeur                3
  pepsico   pepsicomex                2
  pepsico   pepsicousa                5

Which is the desired result!

I would urge you to study up on, and master, aggregate functions – they are the "bread and butter" of SQL and will lead on to stuff such as window functions (and other goodies) which are ever more powerful ways of moving your data up to the apex of the DIKW pyramid!

Finally, I would urge you to look here and here to see the perils of not having ONLY_FULL_GROUP_BY set in MySQL – I have seen gurus say that in the hands of an expert, it can be OK to unset it – personally, I wouldn’t… why bother for starters?

Queries can provide meaningful results without having to resort to this abomination… so I say no! YMMV! Or just switch to PostgreSQL! However, see the article by Roland Bouman (Debunking GROUP BY myths) cited above for an alternative viewpoint and a neat workaround. This will provide further explanation.

p.s. welcome to dba.se and +1 – this answer forced me to marshall my thoughts! p.p.s. in future, when asking a question, could you please provide a working fiddle – it helps to have a single point of truth for a question and eliminates duplication of effort – help us to help you!

Method 2

I think you’re looking for:

SELECT n_brand, population, SUM(actions_completed_by_unique_user) as actions
  FROM agg_table
 WHERE n_brand = 'pepsico'
 GROUP BY n_brand, population

There’s no need for actions_completed_by_unique_user > 0 unless you have negative values. I also assume that n_brand is the same as platform.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply