All we need is an easy explanation of the problem, so here it is.
I have some data like this:
metaphone | lag |
---|---|
FLKSW | [null] |
PPS | FLKSW |
PPS | PPS |
PSP | PPS |
And I want to compare the string values in both columns on the folowing condition: they’re similar (assign some value, like 1) if they share at least 2 chars. Otherwise, they’re not similar.
So in the example, PPS and PSP would be similar.
How can this substring comparison be achieved?
I know one approach would be to extract substrings and manually compare them, but it feels hacky and I don’t know the maximum number of chars that can occur.
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Method 1
they’re similar … if they share at least 2 chars.
Unfortunately, there is no built-in "intersect" operator or function for strings or arrays. You can roll your own function to count overlapping characters:
CREATE FUNCTION f_count_overlapping_char(text, text)
RETURNS int
LANGUAGE sql PARALLEL SAFE IMMUTABLE STRICT AS
$func$
SELECT count(*)::int
FROM (
SELECT unnest(string_to_array($1, NULL))
INTERSECT ALL
SELECT unnest(string_to_array($2, NULL))
) sub;
$func$;
INTERSECT ALL
includes duplicate matching characters. To fold duplicates, use just INTERSECT
instead.
Then your query can be:
SELECT *, f_count_overlapping_char(t1.metaphone, t2.metaphone) AS overlap
FROM tbl t1
JOIN tbl t2 ON t1.id < t2.id
AND f_count_overlapping_char(t1.metaphone, t2.metaphone) >= 2;
db<>fiddle here
But it’s expensive and does not scale well with more rows in the table – O(N²). Depending on your actual objective there are various superior alternatives – like trigram similarity provided by the additional module pg_trgm. See:
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0