All we need is an easy explanation of the problem, so here it is.
Starting from SQL Server 2019, it supports UTF-8 as collation. However, according to the following queries:
SELECT COLLATIONPROPERTY('Arabic_100_CS_AS_KS_WS_SC_UTF8', 'CodePage') SELECT COLLATIONPROPERTY('Latin1_General_100_CS_AS_KS_WS_SC_UTF8', 'CodePage');
both return code page
65001 which is Unicode in Windows. Also all new
_UTF8 collations use code page
SELECT * FROM sys.fn_helpcollations() WHERE name LIKE '%_UTF8';
Are there any differences between using
Latin1_General_100_CS_AS_KS_WS_SC_UTF8 as collation?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
_UTF8 collations use code page 65001 as that is the code page for UTF-8. You can even use 65001 in a DOS / Command Window via:
though not all programs and fonts will work seamlessly with it.
_UTF8 collations, the code page is not controlled by the culture (i.e.
Arabic) as it is for non-
_UTF8 collations because code pages indicate the specific 8-bit encoding used for
VARCHAR data (i.e. 8-bit character data). For non-Unicode 8-bit encodings, the culture is often tied to the code page which is the character set (e.g. Latin1 is code page Windows-1252 which has different characters in the 128-255 range than Windows-1255 which is the code page for Hebrew). But for UTF-8, it is the 8-bit encoding for the singular, all-encompassing character set that is Unicode.
As far as differences between
Latin1_General_100_CS_AS_KS_WS_SC_UTF8 go, it would really only be the culture-specific rules for sorting and comparison of various characters. Of course, those two languages do not really share any characters, but there can still be differences in how some code points are handled.
Looking through the "Windows Server 2008 Sorting Weight Table" file (which is what the version
_100_ collations are mostly based on, from what I’m told), I can’t find any sort/comparison differences between those two collations. So, they are likely the same in terms of behavior. However, they are not the same in the sense that they still have a different LCID (the locale/culture identifier) so converting their values to non-UTF8
VARCHAR could result in data loss/corruption, and any process/functionality looking at the collation to determine some other behavior might behave differently.
That being said, I did find an example of a difference in behavior for Arabic characters when using an Urdu collation as those collations do have a few modifications to the default sort weights (9 registered in the "Windows Server 2008 Sorting Weight Table" file).
Looking at the "Teh Marbuta" character (U+0629), it has a weight of 29 in the default table (i.e. the table used for US English / Latin1), which has a lower sort weight than the "Peheh" character (U+06A6), which has a default weight of 137. The 41 indicates which "script" the character is in, and these are both Arabic characters. However, the Urdu collations modify the sort weight of "Teh Marbuta" (U+0629) to be 183, which then has a higher sort weight than "Peheh" (U+06A6), still being 137.
-- Default 0x0629 41 29 2 2 ;Arabic Teh Marbuta -- ة 0x06a6 41 137 2 2 ;Arabic Peheh -- ڦ -- Urdu modifications 0x0629 41 183 2 2 ;Teh Marbuta -- ة
If we sort those two characters using
Arabic_100_CS_AS_KS_WS_SC_UTF8, we should get the default behavior. And, even if we use a
Yakut collation, which uses the Cyrillic script and has its own modifications to the default sort weights, it doesn’t modify either of these Arabic characters, hence they should behave the same as when using a
SELECT * FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar) ORDER BY tmp.[TheChar] COLLATE Latin1_General_100_CS_AS_KS_WS_SC_UTF8 ASC SELECT * FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar) ORDER BY tmp.[TheChar] COLLATE Arabic_100_CS_AS_KS_WS_SC_UTF8 ASC SELECT * FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar) ORDER BY tmp.[TheChar] COLLATE Yakut_100_CS_AS_KS_WS_SC_UTF8 ASC
All three of the queries shown above return the following results:
ID TheChar 1 ة 2 ڦ
However, when we switch to an
Urdu collation, the order of those two characters does indeed change:
SELECT * FROM (VALUES (1, NCHAR(0x0629)), (2, NCHAR(0x06a6))) tmp(ID, TheChar) ORDER BY tmp.[TheChar] COLLATE Urdu_100_CS_AS_SC_UTF8 ASC
ID TheChar 2 ڦ 1 ة
Finally, keep in mind that while rare to run into this, collations can also affect upper/lower -case mappings. I believe this is confined to only the
Turkish collations, and only for the letters ‘i’ and ‘I’ (those cultures have a dotted upper-case ‘I’ and a dotless lower-case ‘i’), but still best to be aware of the potential:
SELECT UPPER(N'i' COLLATE Arabic_100_CS_AS_KS_WS_SC_UTF8) AS [Arabic], UPPER(N'i' COLLATE Turkish_100_CS_AS_KS_WS_SC_UTF8) AS [Turkish], UPPER(N'i' COLLATE Azeri_Cyrillic_100_CS_AS_KS_WS_SC_UTF8) AS [Azeri_Cyrillic], UPPER(N'i' COLLATE Azeri_Latin_100_CS_AS_KS_WS_SC_UTF8) AS [Azeri_Latin];
Arabic Turkish Azeri_Cyrillic Azeri_Latin I İ İ İ
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂