SELECT DISTINCT in extra space data from a single column ( no duplicates )

All we need is an easy explanation of the problem, so here it is.

Hello i am having problems with this column data

Los Angeles
Manhatttan  Beach
New York
Palo Alto
San Francisco
Takoma  Park -- maybe this city must have problems also

How i can filter that value, what is the easy way with trim? i did my research but i find long sql statements that i dont understand if that could help is a simple error.

This is the query:

-- 3. Write a query that will list all the cities that have customers with a heading of Cities. Only
-- list each city once (no duplicates) and sort in descending alphabetical order.

select distinct customer_city as cities
FROM customers
ORDER BY customer_city ASC

-- left Los  Angeles

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

As Erik said in the comments, you should fix bad data rather than trying to query around it, but if you absolutely cannot fix the data, the below query will get the distinct list of cities by replacing double spaces with a single space:

SELECT DISTINCT REPLACE(customer_city, '  ', ' ') as cities
FROM customers
ORDER BY customer_city ASC

This is a really basic example, however, if the input data is not being validated, this may not be the only type of whitespace you’re encountering causing duplication.

Prior to SQL 2017, you need to daisy chain multiple REPLACE statements to replace multiple characters. For example, this code replaces double spaces and tab characters with a single space:

SELECT DISTINCT REPLACE(REPLACE(customer_city, '  ', ' '), CHAR(9), ' ') as cities
FROM customers
ORDER BY customer_city ASC

In 2017, you can use the TRANSLATE function to swap all of the characters you’re searching for with a single character, then replace that character with nothing to ensure you’re finding all exact duplicates:

SELECT DISTINCT REPLACE(TRANSLATE(customer_city, CHAR(9) + CHAR(10) + CHAR(13) + CHAR(32), '####'), '#', '') as cities
FROM customers
ORDER BY customer_city ASC

This means you don’t have to repeat REPLACE for every character you want to strip, you just add the character code (+ CHAR(?)) to the TRANSLATE function and add another replacement character (#). As you can see, the TRANSLATE example replaces 4 characters for basically the same amount of code as the two-character replacement in prior versions.

Method 2

As has already been said, correcting the data is the best way to deal with situations like this if that is possible. You could either fix it in-place, or if for some reason you need to keep the errant values (perhaps they match up to vales in another system yours is loosely coupled with, that has become dependent on these values) then perhaps by maintaining a shadow column with normalised data as Akina suggests.

If you are dealing with a large amount of data then applying a function in DISTINCT may be a bad idea because the distinct operation implies a sort which for a significant number of rows could result in an expensive spool to disk. If you have an appropriate index on customer_city the query planner might otherwise be able to use that to remove the need to sort at all. You might minimise the effect of this by performing the function on the result of the DISTINCT and then doing it again:

    FROM (SELECT DISTINCT customer_city FROM customers) AS subq
ORDER BY FunctionToNormaliseCity(customer_city)

Also note sorting by the same values (the result of the function) to try avoid an extra sort after performing the DISTINCT sort & filter. Obviously for small amounts of data this is overkill and you should instead keep with the simpler query to make the code easier to understand.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from or, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply