SQL Server PATINDEX issue/bug when using different case-sensitive collations

All we need is an easy explanation of the problem, so here it is.

I have a function (which I found here years ago) that uses STUFF / PATINDEX to strip out non-alphanumeric characters from a string. When running on a Case Insensitive collation, it works fine. Recently I needed to use it on a Case Sensitive collation DB and found some odd behaviors. If the %pattern% for PATINDEX is just specifying lower case (ex: %[^a-z0-9_-]% ), then all upper case "Z"s are removed when the Collation is Latin1_General_100_CS_AS. If the Collation is SQL_Latin1_General_CP1_CS_AS, then upper case "A"s are removed. Is this a bug or did I miss something?

USE TestCollation
GO
PRINT '---------SourceString---------'
PRINT 'ABCDEFGHIJKLMNO_Z_A_PQRSTUVWXYZ-abcdefghijklmnopqrstuvwxyz'
GO
ALTER DATABASE TestCollation COLLATE SQL_Latin1_General_CP1_CI_AS;
GO
PRINT '---------SQL_Latin1_General_CP1_CI_AS---------'
GO

DECLARE @ExternalId VARCHAR(255) =
     'ABCDEFGHIJKLMNO_Z_A_PQRSTUVWXYZ-abcdefghijklmnopqrstuvwxyz'
DECLARE @return VARCHAR(255)
SET @return = @ExternalId
DECLARE @KeepValues AS VARCHAR(50)
SET @KeepValues = '%[^a-z0-9_-]%'
WHILE PATINDEX ( @KeepValues, @return ) > 0
BEGIN
    SET @return = STUFF ( @return, PATINDEX ( @KeepValues, @return ), 1, '' )
END
PRINT @return

go

ALTER DATABASE TestCollation COLLATE Latin1_General_100_CS_AS;
GO
PRINT '---------Latin1_General_100_CS_AS---------'
GO
DECLARE @ExternalId VARCHAR(255) =
     'ABCDEFGHIJKLMNO_Z_A_PQRSTUVWXYZ-abcdefghijklmnopqrstuvwxyz'
DECLARE @return VARCHAR(255)
SET @return = @ExternalId
DECLARE @KeepValues AS VARCHAR(50)
SET @KeepValues = '%[^a-z0-9_-]%'
WHILE PATINDEX ( @KeepValues, @return ) > 0
BEGIN
    SET @return = STUFF ( @return, PATINDEX ( @KeepValues, @return ), 1, '' )
END
PRINT @return

ALTER DATABASE TestCollation COLLATE SQL_Latin1_General_CP1_CS_AS;
GO
PRINT '---------SQL_Latin1_General_CP1_CS_AS---------'
GO
DECLARE @ExternalId VARCHAR(255) =
     'ABCDEFGHIJKLMNO_Z_A_PQRSTUVWXYZ-abcdefghijklmnopqrstuvwxyz'
DECLARE @return VARCHAR(255)
SET @return = @ExternalId
DECLARE @KeepValues AS VARCHAR(50)
SET @KeepValues = '%[^a-z0-9_-]%'
WHILE PATINDEX ( @KeepValues, @return ) > 0
BEGIN
    SET @return = STUFF ( @return, PATINDEX ( @KeepValues, @return ), 1, '' )
END
PRINT @return

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This is not a bug. It is merely a difference in which case comes first when doing case-sensitive sorting. While it might not appear that any sorting is being done, the [...] character range wildcard used in both LIKE and PATINDEX does, in a sense, sort characters when applying a range, such as any {character}-{character} pattern (in this case a-z and 0-9). So, the two options are:

  1. AaBbCc…Zz (a-z excludes A)
  2. aAbBcC…zZ (a-z excludes Z)

The SQL Server collations (i.e. those having names starting with SQL_) mostly use one approach while the Windows collations (i.e. those having names not starting with SQL_) use the other approach.

To illustrate the behavior:

SELECT * FROM (VALUES ('A'), ('a'), ('Z'), ('z')) tmp (val)
WHERE tmp.val LIKE '%[a-z]%' COLLATE SQL_Latin1_General_CP1_CS_AS
ORDER BY tmp.val COLLATE SQL_Latin1_General_CP1_CS_AS;
/*
a
Z
z
*/

SELECT * FROM (VALUES ('A'), ('a'), ('Z'), ('z')) tmp (val)
WHERE tmp.val LIKE '%[a-z]%' COLLATE Latin1_General_100_CS_AS
ORDER BY tmp.val COLLATE Latin1_General_100_CS_AS;
/*
a
A
z
*/

SELECT * FROM (VALUES ('A'), ('a'), ('Z'), ('z')) tmp (val)
WHERE tmp.val LIKE '%[a-z]%' COLLATE Latin1_General_100_BIN2
ORDER BY tmp.val COLLATE Latin1_General_100_BIN2;
/*
a
z
*/

FYI: You can avoid having to deal with the database’s default collation by either forcing a collation in the two calls to PATINDEX:

GO
DECLARE @ExternalId VARCHAR(255) =
    'ABCDEFGHIJKLMNO_Z_A_PQRSTUVWXYZ-abcdefghijklmnopqrstuvwxyz'
DECLARE @return VARCHAR(255)
SET @return = @ExternalId
DECLARE @KeepValues AS VARCHAR(50)
SET @KeepValues = '%[^a-z0-9_-]%'
WHILE PATINDEX ( @KeepValues COLLATE Latin1_General_100_CI_AS, @return ) > 0
BEGIN
    SET @return = STUFF ( @return,
                          PATINDEX ( @KeepValues COLLATE Latin1_General_100_CI_AS,
                                     @return ),
                          1,
                          '' )
END
PRINT @return
GO

or, adding A-Z to the pattern:

SET @KeepValues = '%[^a-zA-Z0-9_-]%'

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply