All we need is an easy explanation of the problem, so here it is.
I’m trying to write a query (Postgres 8.2.15) which is checking whether the value in column1 consists of:
- English alphabet only (A-Z), there should be no specific characters with umlaut/ablaut coming from Spanish, Arabic, German etc.
- Single space, dot, dash, ampersand, apostrophe, brackets and pipe are allowed.
I have the below code which is working fine for space, dot and dash but when I insert some of the other symbols into the list the query is not producing the correct results:
select column1, case when column1 !~ '^.*[^A-Za-z0-9 .-].*$' then 'ok' else 'not ok' end as "check", from table1
How can I expand the list with ampersand, apostrophe, brackets and pipe?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Your SQL is as follows:
when column1 !~ '^.*[^A-Za-z0-9 .-].*$'
This would be better written as
WHEN column1 !~ '[^A-Za-z0-9 .-]+'
So, only ASCII strings with the letters, digits and spaces, full stops (periods) and hyphens.
You can check out the long version below, but if you want to have
ampersand, apostrophe, brackets and pipes to be allowed. – just use the (
\) backslash escape character – you can change your pattern above to:
WHEN column1 !~ '[^A-Za-z0-9&\(\)\| \''.-]+'
The double apostrophe is to escape it further, since it’s the pattern delimiter.
I created a table as follows (all the code below is available on the fiddle here – it’s 9.5 – that’s the oldest version that I could find):
CREATE TABLE str_test ( str TEXT NOT NULL );
and then entered some data as follows:
INSERT INTO str_test VALUES ('$ff asfd &*$'), ('xyz asfd abc'), ('abc 1234 agg'), ('adf اشكرك agg'), -- Arabic characters ('The essential thing is'), ('afda sfsáásfd sdfs'), -- á - a acute ('afda sfs谢谢你 sfd sdfs'), -- Chinese characters ('adf (((( &*$'), ('adf ||| &*$'), ('Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz'), -- German word ('afasdfsdfdadfdsf fasfsafsdafasdfasfaadsf');
Now, your original pattern is
'^.*[^A-Za-z0-9 .-].*$' so, I’ll run it using the
SUBSTRING() function. The strings which appear where will be
false for the
!~ pattern matching operator – it’s sometimes clearer if you spell out the string!
SELECT SUBSTRING (str, '^.*[^A-Za-z0-9 .-].*$') FROM str_test;
substring $ff asfd &*$ NULL NULL adf اشكرك agg NULL afda sfsáásfd sdfs afda sfs谢谢你 sfd sdfs adf (((( &*$ adf ||| &*$ Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz NULL 11 rows
So, what the pattern is doing is picking out all characters that are not alphnumerics (ASCII alphas) and then returning the entire string – this is because of the
^.* and the
.*$ – everything before and after.
If you don’t want the pipes and brackets, you can do the following:
SELECT SUBSTRING (str, '^.*[^A-Za-z0-9&\* .\|\$\(-]') FROM str_test;
substring NULL NULL NULL adf اشكرك NULL afda sfsáá afda sfs谢谢你 Rindfleischetikettierungsüberwachungsaufgabenü NULL 11 rows
So, we can see that the only strings that match the pattern are now the ones containing the Arabic characters, the
á (a acute character), the Chinese characters and the German umlauts.
* and bracket characters are now eliminated – this is done by putting the escape backslash (
\) character before them (apart from the
& which is not a regular expression meta-character – or special character).
Now, there is the
\w meta-character-class (also known as a shorthand class) – it stands for words –
SELECT SUBSTRING (str, '[^\w &\.\-\$\|\(\)\*]') FROM str_test;
substring 谢 12 rows
NULL except for the Chinese characters – the Arabic and the a acute and the German are eliminated – there’s only the Chinese to be tackled. So, I investigated and found this page which suggested that
\u4e00-\u9fa5 singled out Chinese characters
I decided to look at Japanese – this is the Japanese for thanks:
INSERT INTO str_test VALUES ('有り難う');
I also went here – and got the codes for Katekana/Hiragana…
So, the final code is
SELECT SUBSTRING (str, '[^\u3000\u3400-\u4DBF\u4E00-\u9FFFぁ-んァ-ン\w &\.\-\$\|\(\)\*]') FROM str_test;
substring ... ... all NULL ... 13 rows
Now, it’s not very clear from your question what, exactly, you wish to return as OK – you appear to not want very much to be OK.
Here is another fiddle with several trials of mine – look at them and make sure that you understand what’s going on – regular expressions are very powerful but they are also tricky – it can be difficult to figure out exactly what’s going on!
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂