Look for a list of specific characters using RegEx

All we need is an easy explanation of the problem, so here it is.

I’m trying to write a query (Postgres 8.2.15) which is checking whether the value in column1 consists of:

  1. English alphabet only (A-Z), there should be no specific characters with umlaut/ablaut coming from Spanish, Arabic, German etc.
  2. Single space, dot, dash, ampersand, apostrophe, brackets and pipe are allowed.

I have the below code which is working fine for space, dot and dash but when I insert some of the other symbols into the list the query is not producing the correct results:

select
    column1,
    case 
        when column1 !~ '^.*[^A-Za-z0-9 .-].*$'
        then 'ok' 
        else 'not ok' 
        end as "check",
from 
    table1

How can I expand the list with ampersand, apostrophe, brackets and pipe?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Your SQL is as follows:

when column1 !~ '^.*[^A-Za-z0-9 .-].*$'

This would be better written as

WHEN column1 !~ '[^A-Za-z0-9 .-]+'

So, only ASCII strings with the letters, digits and spaces, full stops (periods) and hyphens.

You can check out the long version below, but if you want to have ampersand, apostrophe, brackets and pipes to be allowed. – just use the (\) backslash escape character – you can change your pattern above to:

WHEN column1 !~  '[^A-Za-z0-9&\(\)\| \''.-]+'

The double apostrophe is to escape it further, since it’s the pattern delimiter.

I created a table as follows (all the code below is available on the fiddle here – it’s 9.5 – that’s the oldest version that I could find):

CREATE TABLE str_test
(
  str TEXT NOT NULL
);

and then entered some data as follows:

INSERT INTO str_test VALUES
('$ff asfd &*$'), 
('xyz asfd abc'), 
('abc 1234 agg'), 
('adf اشكرك agg'),            -- Arabic characters 
('The essential thing is'),
('afda sfsáásfd sdfs'),       -- á - a acute
('afda sfs谢谢你 sfd sdfs'),  -- Chinese characters
('adf (((( &*$'), 
('adf ||| &*$'),
('Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz'), -- German word
('afasdfsdfdadfdsf fasfsafsdafasdfasfaadsf');

Now, your original pattern is '^.*[^A-Za-z0-9 .-].*$' so, I’ll run it using the SUBSTRING() function. The strings which appear where will be false for the !~ pattern matching operator – it’s sometimes clearer if you spell out the string!

SELECT 
  SUBSTRING (str, '^.*[^A-Za-z0-9 .-].*$') FROM str_test;

Result:

substring
$ff asfd &*$
NULL
NULL
adf اشكرك agg
NULL
afda sfsáásfd sdfs
afda sfs谢谢你 sfd sdfs
adf (((( &*$
adf ||| &*$
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
NULL
11 rows

So, what the pattern is doing is picking out all characters that are not alphnumerics (ASCII alphas) and then returning the entire string – this is because of the ^.* and the .*$ – everything before and after.

If you don’t want the pipes and brackets, you can do the following:

SELECT
  SUBSTRING (str, '^.*[^A-Za-z0-9&\* .\|\$\(-]') FROM str_test;

Result:

substring
NULL
NULL
NULL
adf اشكرك
NULL
afda sfsáá
afda sfs谢谢你
Rindfleischetikettierungsüberwachungsaufgabenü
NULL
11 rows

So, we can see that the only strings that match the pattern are now the ones containing the Arabic characters, the á (a acute character), the Chinese characters and the German umlauts.

The &, |, $, * and bracket characters are now eliminated – this is done by putting the escape backslash (\) character before them (apart from the & which is not a regular expression meta-character – or special character).

Now, there is the \w meta-character-class (also known as a shorthand class) – it stands for words –

SELECT
  SUBSTRING (str, '[^\w &\.\-\$\|\(\)\*]') FROM str_test;

Result:

substring
谢
12 rows

All NULL except for the Chinese characters – the Arabic and the a acute and the German are eliminated – there’s only the Chinese to be tackled. So, I investigated and found this page which suggested that \u4e00-\u9fa5 singled out Chinese characters

I decided to look at Japanese – this is the Japanese for thanks: 有り難う.

INSERT INTO str_test VALUES ('有り難う');

I also went here – and got the codes for Katekana/Hiragana…

So, the final code is

SELECT
  SUBSTRING (str, '[^\u3000\u3400-\u4DBF\u4E00-\u9FFFぁ-んァ-ン\w &\.\-\$\|\(\)\*]') FROM str_test;

Result:

substring
...
... all NULL
...
13 rows

Now, it’s not very clear from your question what, exactly, you wish to return as OK – you appear to not want very much to be OK.

Here is another fiddle with several trials of mine – look at them and make sure that you understand what’s going on – regular expressions are very powerful but they are also tricky – it can be difficult to figure out exactly what’s going on!

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply