Extract a substring where the delimiter might appear more than once

All we need is an easy explanation of the problem, so here it is.

I have a column of strings with this pattern <email> - <id>. Email is always the first string.

I would like to extract only the email address but the problem here is that an email address can also contain hyphens so I can’t be certain that the delimiter will only occur once.

So basically I would like to match .* until the last hyphen and extract this as email.

Well it’s not exactly about administration, it’s about writing a query to extract data so it’s in data mining area, however this forum is related to database entirely so I think it’ s more appropriate than stackoverflow.

I only tried with SUBSTRING_INDEX() but I ended up getting bad results with it.

It’s a production system so I can’t really interfere with the design, hence the need to extract the info.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

There a few possibilities here – Solution1 uses standard MariaDB string functions, and Solution2 makes use of regular expressions (regexes – excellent site here, quick start here). You can also use GENERATED columns to make your life easier.

Solution 1 (using ordinary MySQL/MariaDB string functions):

If you are sure that your data is clean, and every field starts with < + email + > - <.... more stuff..., you can do the following (all of the code for Solution 1 can be found on the fiddle here):

CREATE TABLE test_ter
(
  field VARCHAR (200) NOT NULL
);

data:

INSERT INTO test_ter VALUES
('<[email protected]> - <1345>'),
('<[email protected]> - <1345>'),
('<rubbish> - <[email protected]> - <1345>'),
('<more_rubbish> - <[email protected]> - <1345>'),
('<more stuff> - <[email protected] - <34343>');

Then, we run:

SELECT
  field, 
  INSTR(field, '> - <') AS instr, 
  POSITION('> - <' IN field) AS pos, 
  LOCATE('> - <', field) AS loc,
  SUBSTRING_INDEX(field, '> - <', 1) AS substr
FROM
  test_ter;

Result:

field                                       instr pos loc   substr
<[email protected]> - <1345>                        14  14   14  <[email protected]
<[email protected]> - <1345>                        14  14   14  <[email protected]
<rubbish> - <[email protected]> - <1345>             9   9    9  <rubbish
<more_rubbish> - <[email protected]> - <1345>   14  14   14  <more_rubbish
<more stuff> - <[email protected] - <34343>  12  12   12  <more stuff

We can see that SUBSTRING_INDEX() gets us closest to the answer we want – otherwise we’ll have to use more nested functions to obtain our desired result – see previous edits of this answer.

We combine SUBSTING_INDEX() with the TRIM() function to obtain our answer:

SELECT 
  TRIM(LEADING '<' FROM SUBSTRING_INDEX(field, '> - <', 1))
FROM                                                                  
  test_ter;

Result:

TRIM(LEADING '<' FROM SUBSTRING_INDEX(field, '> - <', 1))
[email protected]
[email protected]
rubbish
more_rubbish
more stuff

Depending on how clean your input data is (I’m assuming some invalid emails – the most basic check is that the string contains an @ sign.

We can combine this with using GENERATED columns, you can do quite a bit of checking before any bits hit the disk as follows:

ALTER TABLE test_ter
ADD COLUMN email VARCHAR (200)
GENERATED ALWAYS AS
(
  CASE
    WHEN 
      INSTR(TRIM(LEADING '<' FROM SUBSTRING_INDEX(field, '> - <', 1)), '@') = 0 
        THEN NULL
    ELSE
      TRIM(LEADING '<' FROM SUBSTRING_INDEX(field, '> - <', 1))
  END
);

and to check: SELECT * FROM test_ter; – Result:

field                                          email
<[email protected]> - <1345>                        [email protected]
<[email protected]> - <1345>                        [email protected]
<rubbish> - <[email protected]> - <1345>            NULL
<more_rubbish> - <[email protected]> - <1345>   NULL
<more stuff> - <[email protected] - <34343>  NULL

So, we can see that records that don’t contain an email in the first < > pair are deemed to be NULL – but if you’re happy that your inputs are clean, then this is unnecesary.

You can also use an index on your GENERATED field to speed up searches if this is appropriate:

CREATE INDEX tt_email_ix ON test_ter (email);

Another answer raised the possiblity that the < and the > were just placeholders and that your data is in the form [email protected] - stuff...., then all you require is something like

...
SUBSTRING_INDEX(field, ' ', 1) -- 1 space, or use 1 space and a hyphen ' -`
...

This will truncate the string leaving only the email (see fiddle).

Solution 2 (using regexes):

You can do the following (all the code for Solution 2 can be found on the fiddle here):

CREATE TABLE test
(
  field VARCHAR (200) NOT NULL
);

Populate with some sample data:

INSERT INTO test VALUES
('<[email protected]> - <1345>'),
('<[email protected]> - <1345>'),
('<rubbish> - <[email protected]> - <1345>'),
('<more_rubbish> - <[email protected]> - <1345>');

and then run (using a regular expression – regex):

SELECT 
  REGEXP_SUBSTR
  (
    field, 
    '[A-Z][A-Z0-9._-][email protected][A-Z0-9_-]+\.[A-Z]{2,4}'
  ) AS email
FROM 
  test;

Result:

email
[email protected]
[email protected]
[email protected]
[email protected]

Now, the simple regex that I’ve used for an email is [A-Z][A-Z0-9._-][email protected][A-Z0-9_-]+\.[A-Z]{2,4} – you can make it as complex as you desire/require – see here – one regex solution linked to has 6,500 characters, perhaps overkill? A search will give you your compromise between the solution being robust and being suitable for you.

Regex explained (an excellent site on regexes can be found here, quick start here):

  • [A-Z]

    must start with a single letter – i.e. A-Z or a-z. Not quite correct according to here – but this is just a simple first approximation. In MySQL/MariaDB, just [A-Z] will work with the non-case-sensitive collations which are the default.

  • [A-Z0-9._-]+

    the rest of the email before the @ sign – match the characters A-Z, a-z or ._- one or more times (the + “metacharacter” specifies this – see the quick start – metacharacters have a special meaning in regexes),

    the square brackets [ and ] enclose what are called character classes or character sets – see the quick start link above,

  • @ match the literal "at" sign,

  • more letters, digits and _- for the site name,

  • \. match the literal dot (full stop or period – i.e. . character). The . is escaped with the backslash (\) as the dot is also a metacharacter – unescaped it represents any single character – like underscore (_) in SQL,

  • [A-Z]{2,4} the domain name – match the letters [A-Z] (and [a-z]) occurring 2, 3 or 4 times – i.e. .fr, .com or info for example.

    The curly braces ({, }) are to specify the number of repetitions. If you just had {3}, that would mean 3 and three only occurrences of your desired pattern.

Be aware that regexes are expensive and depending on your table size and the length of your strings, your queries may be slow. You can reduce the query time cost at the expense of a bit of space on disk using GENERATED COLUMNs as follows:

CREATE TABLE test_bis
(
  field VARCHAR (200) NOT NULL,
  email VARCHAR (200) AS
  (
    REGEXP_SUBSTR
    (
      field, 
      '[A-Z][A-Z0-9._-][email protected][A-Z0-9._-]+\.[A-Z]{2,4}'
    )
  ) PERSISTENT -- HDD cost,  also works with VIRTUAL - CPU cost.
);

Did the same INSERT – see fiddle and the result is:

field                                         email
<[email protected]> - <1345>                       [email protected]
<[email protected]> - <1345>                       [email protected]
<rubbish> - <[email protected]> - <1345>           [email protected]
<more_rubbish> - <[email protected]> - <1345>  [email protected]

You can index this PERSISTENT field to speed up searching:

CREATE INDEX fb_regex_email
ON test_bis (email);

As far as I can tell, MariaDB does not yet have functional (or expression) indexes (see PostgreSQL for example).

If you don’t wish to sacrifice HDD space, you can make the GENERATED column VIRTUAL instead – at the cost of CPU cycles – à vous le choix! I can’t test the index because the sample table is so small that MySQL just does a table scan anyway, index present or not.

I would just suggest that you test these solutions with your own hardware and your own data just to be sure that your performance is optimal for your requirements/constraints.

Method 2

Do it backwards: search the substring until the hyphen from the end of the string (using the SUBSTRING_INDEX function) then TRIM the found part from the value.

If the delimiter is strictly those as shown (- hyphen with space before and after) then use this as the three-character delimiter. – Akina

Example

CREATE TABLE test
(
    field varchar (200) NOT NULL
);

INSERT INTO test VALUES
('[email protected] - 1234'),
('[email protected] - 5678'); 

SELECT
    TRIM(TRAILING 
        CONCAT(' - ', 
            SUBSTRING_INDEX(field, ' - ', -1))
        FROM field) AS email
FROM test;

db<>fiddle demo

Method 3

I have a column of strings with this pattern –

And therein lies your biggest problem.
You have two bits of Data in one field and that’s a fundamentally Bad Idea.

The first question you should ask before deciding how to store any Data is

How am I going to access this Data?

You really should have this in two, separate fields and then this "extraction" problem just "goes away".

Databases are really, really good at finding little bits of Data and putting them together.
They’re generally pretty rubbish at finding big bits of Data and pulling them apart.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply