Query to divide each column value to two categories: [Common , Not_Common]

All we need is an easy explanation of the problem, so here it is.

I have a table with below structure and data:

create table PTEST
(
  col_name  VARCHAR(50),
  col_value VARCHAR(50)
)

    COL_NAME    COL_VALUE
   -----------------------
     first       apple
     first       banana
     second      apple
     second      banana
     second      orange
     third       apple
     third       banana

**) what I want to do is to divide each value in the col_value column into two categories : [Common,Not common]

**) A value is considered 'Common' if it is appeared for each col_name,So apple is common since it is appeared for col_name = first and col_name = second and col_name = third . The same is true for banana. Orange is not common since it is just appeared for col_name = second.

The desired output would be like this:

    COL_NAME   COL_VALUE   STATUS
   ---------------------------------
    first       apple       Common
    first       banana      Common
    second      banana      Common
    second      apple       Common
    second      orange      Not common
    third       apple       Common
    third       banana      Common

The query I wrote for this is :

select col_name,
       col_value,
       case
         when count_col = count_val then
          'Common'
         else
          'Not common'
       end STATUS
  from (select t.col_name,
               count(distinct t.col_name) over() count_col,
               t.col_value,
               count(t.col_value) over(partition by t.col_value) count_val
          from PTEST t)

I was wondering if there are better ways to do that.

Thanks in advance

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Two ways of doing this would be the following (all of the code below is available on the fiddle for SQL Server – with plans – here – peformance analysis at the end:

Table:

CREATE TABLE ptest
(
  col_name  VARCHAR (50) NOT NULL,
  col_value VARCHAR (50) NOT NULL,
  
  CONSTRAINT name_value_uq UNIQUE (col_name, col_value)
);

Populate it:

INSERT INTO ptest VALUES
('first',  'apple'),
('first',  'banana'),
('second', 'apple'),
('second', 'banana'),
('third',  'apple'),
('third',  'banana'),
('third',  'orange');

First way:

First, we want to find out how many times a fruit appears in the table overall.

SELECT 
  col_name,
  col_value,
  COUNT(col_value) OVER (PARTITION BY col_value) AS cnt
FROM
  ptest;

Result:

col_name    col_value   cnt 
   first        apple     3 
   first       banana     3 
  second        apple     3 
  second       banana     3 
   third        apple     3 
   third       banana     3 
   third       orange     1 
7 rows

You have various ways of finding the number of times a fruit appeared less than the maximum of cnt (3) which is your defintion of common – so we can see at a glance that orange is uncommon.

So, I’m using CTEs to do it:

WITH cte1 AS
(
  SELECT 
    col_name,
    col_value,
    COUNT(col_value) OVER (PARTITION BY col_value) AS cnt
    -- COUNT(col_value) OVER (PARTITION BY col_name ORDER BY col_value)
  FROM
    ptest
),
cte2 AS
(
  SELECT MAX (cnt) AS mcnt FROM cte1
)
SELECT * FROM cte1 WHERE cnt < (SELECT mcnt FROM cte2);

Result:

col_name    col_value   cnt
   third       orange     1

Et voilà!

To get something closer to your own original (not working – see fiddle) query, you can do this (again, in fiddle):

WITH cte1 AS
(
  SELECT 
    col_name,
    col_value,
    COUNT(col_value) OVER (PARTITION BY col_value) AS cnt
    -- COUNT(col_value) OVER (PARTITION BY col_name ORDER BY col_value)
  FROM
    ptest
),
cte2 AS
(
  SELECT MAX (cnt) AS mcnt FROM cte1
)
SELECT 
  col_name,
  col_value,
  CASE
    WHEN cnt < (SELECT mcnt FROM cte2) THEN 'Uncommon'
    ELSE 'Common'
  END AS status
FROM cte1;

Same result.

Second way:

You can also do it if you’re running an antique (or recent versions of MySQL 🙂 ) which don’t have window functions as follows:

SELECT * FROM
(
  SELECT
    col_value, COUNT(col_value) AS cnt
  FROM 
    ptest
  GROUP BY col_value
) AS t
WHERE cnt < 
(
  SELECT MAX(cnt) FROM 
  (
    SELECT
      col_value, COUNT(col_value) AS cnt
    FROM 
      ptest
    GROUP BY col_value
  ) AS u
);

Result:

col_value   cnt
   orange     1

Et voilà encore!!

You ask in the question:

I was wondering if there are better ways to do that.

So, I added to the bottom of the fiddle the following lines (documented here):

SET STATISTICS PROFILE ON;  
SET STATISTICS TIME ON;
SET STATISTICS IO ON;

and finally

SET SHOWPLAN_ALL ON;

It appears impossible to obtain very fine-grained timings from db<>fiddle, but the plans are interesting.

The window function query produces the following plan (23 lines):

|--Nested Loops(Inner Join, WHERE:([Expr1003]<[Expr1008]))  1   2   1   Nested Loops    Inner Join  WHERE:([Expr1003]<[Expr1008])       7   0   2.926E-05   47  0.02971825  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Expr1003]     PLAN_ROW    False   1
           |--Stream Aggregate(DEFINE:([Expr1008]=MAX([Expr1007]))) 1   3   2   Stream Aggregate    Aggregate       [Expr1008]=MAX([Expr1007])  1   0   4.7E-06 11  0.01484579  [Expr1008]      PLAN_ROW    False   1
           |    |--Nested Loops(Inner Join) 1   4   3   Nested Loops    Inner Join          7   0   0.0001227688    11  0.01484109  [Expr1007]      PLAN_ROW    False   1
           |         |--Table Spool 1   5   4   Table Spool Lazy Spool          3   0   0   36  0.01471354  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
           |         |    |--Segment    1   6   5   Segment Segment [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     7   0   1.5944E-05  36  0.0146976   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Segment1011]      PLAN_ROW    False   1
           |         |         |--Sort(ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC))  1   7   6   Sort    Sort    ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC)      7   0.01126126  0.0001306923    36  0.01468165  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
           |         |              |--Index Scan(OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq])) 1   8   7   Index Scan  Index Scan  OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq])    [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] 7   0.003125    0.0001647   36  0.0032897   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
           |         |--Nested Loops(Inner Join, WHERE:((1)))   1   9   4   Nested Loops    Inner Join  WHERE:((1))     2.333333    0   1.5944E-06  36  3.1888E-06  [Expr1007]      PLAN_ROW    False   4
           |              |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1012],0))) 1   10  9   Compute Scalar  Compute Scalar  DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1012],0))  [Expr1007]=CONVERT_IMPLICIT(int,[Expr1012],0)   1   0   1.5944E-07  36  1.75384E-06 [Expr1007], [Expr1007]      PLAN_ROW    False   4
           |              |    |--Stream Aggregate(DEFINE:([Expr1012]=Count(*)))    1   11  10  Stream Aggregate    Aggregate       [Expr1012]=Count(*) 1   0   1.5944E-06  36  1.5944E-06  [Expr1012]      PLAN_ROW    False   4
           |              |         |--Table Spool  1   12  11  Table Spool Lazy Spool          2.333333    0   0   36  0   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   4
           |              |--Table Spool    1   13  9   Table Spool Lazy Spool          2.333333    0   0   36  0   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   4
           |--Nested Loops(Inner Join)  1   14  2   Nested Loops    Inner Join          7   0   0.0001227688    47  0.0148411   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Expr1003]     PLAN_ROW    False   1
                |--Table Spool  1   15  14  Table Spool Lazy Spool          3   0   0   43  0.01471355  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
                |    |--Segment 1   16  15  Segment Segment [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     7   0   1.5944E-05  43  0.0146976   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Segment1013]      PLAN_ROW    False   1
                |         |--Sort(ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC))   1   17  16  Sort    Sort    ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC)      7   0.01126126  0.0001306993    43  0.01468166  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
                |              |--Index Scan(OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq]))  1   18  17  Index Scan  Index Scan  OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq])    [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] 7   0.003125    0.0001647   43  0.0032897   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
                |--Nested Loops(Inner Join, WHERE:((1)))    1   19  14  Nested Loops    Inner Join  WHERE:((1))     2.333333    0   1.5944E-06  43  3.1888E-06  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Expr1003]     PLAN_ROW    False   4
                     |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1014],0)))  1   20  19  Compute Scalar  Compute Scalar  DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1014],0))  [Expr1003]=CONVERT_IMPLICIT(int,[Expr1014],0)   1   0   1.5944E-07  43  1.75384E-06 [Expr1003], [Expr1003]      PLAN_ROW    False   4
                     |    |--Stream Aggregate(DEFINE:([Expr1014]=Count(*))) 1   21  20  Stream Aggregate    Aggregate       [Expr1014]=Count(*) 1   0   1.5944E-06  43  1.5944E-06  [Expr1014]      PLAN_ROW    False   4
                     |         |--Table Spool   1   22  21  Table Spool Lazy Spool          2.333333    0   0   43  0   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   4
                     |--Table Spool 1   23  19  Table Spool Lazy Spool          2.333333    0   0   43  0   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_name], [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   4
    23 rows
    SQL Server parse and compile time: 
       CPU time = 0 ms, elapsed time = 0 ms.

And the "old-fashioned" one produces this plan of 11 lines:

|--Nested Loops(Inner Join, WHERE:([Expr1003]<[Expr1008]))  1   2   1   Nested Loops    Inner Join  WHERE:([Expr1003]<[Expr1008])       3   0   1.254E-05   20  0.02939043  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Expr1003]     PLAN_ROW    False   1
       |--Stream Aggregate(DEFINE:([Expr1008]=MAX([Expr1007]))) 1   3   2   Stream Aggregate    Aggregate       [Expr1008]=MAX([Expr1007])  1   0   2.3E-06 11  0.01468965  [Expr1008]      PLAN_ROW    False   1
       |    |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1015],0)))   1   4   3   Compute Scalar  Compute Scalar  DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1015],0))  [Expr1007]=CONVERT_IMPLICIT(int,[Expr1015],0)   3   0   0   11  0.01468735  [Expr1007]      PLAN_ROW    False   1
       |         |--Stream Aggregate(GROUP BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]) DEFINE:([Expr1015]=Count(*)))   1   5   4   Stream Aggregate    Aggregate   GROUP BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value])  [Expr1015]=Count(*) 3   0   5.7E-06 11  0.01468735  [Expr1015]      PLAN_ROW    False   1
       |              |--Sort(ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC))   1   6   5   Sort    Sort    ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC)      7   0.01126126  0.0001306923    36  0.01468165  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
       |                   |--Index Scan(OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq]))  1   7   6   Index Scan  Index Scan  OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq])    [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] 7   0.003125    0.0001647   36  0.0032897   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
       |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1016],0)))    1   8   2   Compute Scalar  Compute Scalar  DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1016],0))  [Expr1003]=CONVERT_IMPLICIT(int,[Expr1016],0)   3   0   0   20  0.01468733  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Expr1003]     PLAN_ROW    False   1
            |--Stream Aggregate(GROUP BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]) DEFINE:([Expr1016]=Count(*)))    1   9   8   Stream Aggregate    Aggregate   GROUP BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value])  [Expr1016]=Count(*) 3   0   5.7E-06 20  0.01468733  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value], [Expr1016]     PLAN_ROW    False   1
                 |--Sort(ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC))    1   10  9   Sort    Sort    ORDER BY:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] ASC)      7   0.01126126  0.0001306723    16  0.01468163  [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
                      |--Index Scan(OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq]))   1   11  10  Index Scan  Index Scan  OBJECT:([fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[name_value_uq])    [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value] 7   0.003125    0.0001647   16  0.0032897   [fiddle_404696dc8d6846e389dd3d04d0ec3512].[dbo].[ptest].[col_value]     PLAN_ROW    False   1
11 rows
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

Given that we lack explicit timings – and anyway, testing with such a small amount of data is more or less meaningless, I would urge you to test any and all proposed solutions against your own tables and hardware… However, as a rule of thumb, the longer plans are, the slower they are and window functions incur an overhead! From here:

As you can see, there is a big performance hit with window aggregates over traditional methods.

In future, when asking questions of this nature, could you please provide the fiddle yourself – it provides a single source of truth and eliminates duplication of effort – help us to help you! 🙂

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply