Select rows where array of letters is contained in given array

All we need is an easy explanation of the problem, so here it is.

I want to search for the letters in the "spelling" (text[]) column:
a n n o y t

I need it to find the words: any, annoy, no, an, toy
But not find: derivatives of annoy (annoying, annoyance), only find no once.

I also need the query to NOT find annoy if even a single letter is missing (such as anoyt).

I am using PostgreSQL 13.5

ronshome=# SELECT reference, word, spelling FROM word_mash_dictionary
           WHERE word LIKE 'annoy';
 reference | word  |  spelling
-----------+-------+-------------
       420 | annoy | {a,n,n,o,y}
(1 row)

This is the table structure:

                                                      Table "public.word_mash_dictionary"
   Column    |  Type  | Collation | Nullable |                         Default                         | Storage  | Stats target | Description
-------------+--------+-----------+----------+---------------------------------------------------------+----------+--------------+-------------
 reference   | bigint |           | not null | nextval('word_mash_dictionary_reference_seq'::regclass) | plain    |              |
 word        | text   |           |          |                                                         | extended |              |
 spelling    | text[] |           |          |                                                         | extended |              |
 ignore      | bigint |           |          |                                                         | plain    |              |
 list_100    | bigint |           |          |                                                         | plain    |              |
 list_300    | bigint |           |          |                                                         | plain    |              |
 list_500    | bigint |           |          |                                                         | plain    |              |
 list_800    | bigint |           |          |                                                         | plain    |              |
 list_1000   | bigint |           |          |                                                         | plain    |              |
 list_2000   | bigint |           |          |                                                         | plain    |              |
 list_3000   | bigint |           |          |                                                         | plain    |              |
 list_5000   | bigint |           |          |                                                         | plain    |              |
 list_7000   | bigint |           |          |                                                         | plain    |              |
 list_10000  | bigint |           |          |                                                         | plain    |              |
 word_length | bigint |           |          |                                                         | plain    |              |

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

The "is contained" operator <@ for arrays mostly does it:

SELECT reference, word, spelling
FROM   word_mash_dictionary
WHERE  spelling <@ '{a,n,n,o,y,t}'::text[];

This can be supported with a GIN index on the array, which makes it fast for big tables. Like:

CREATE INDEX ON word_mash_dictionary USING gin (spelling);

However, one element in the search array covers any number of matches in spelling. So '{a,n,o,y}' would find '{a,n,n,o,y}' etc. False positives for words with repeated letters.

A set-operation with EXCEPT ALL would be exact (consider each copy of the same element separately). Wrapped into a custom function:

CREATE OR REPLACE FUNCTION f_arr_is_contained(arr1 text[], arr2 text[])
  RETURNS bool
  LANGUAGE plpgsql IMMUTABLE STRICT PARALLEL SAFE AS
$func$
DECLARE
BEGIN
   PERFORM unnest(arr1) EXCEPT ALL SELECT unnest(arr2);

   RETURN NOT FOUND;
END
$func$;

If every single letter is covered in the second term, no row is returned, and FOUND is false. So return NOT FOUND.

I chose LANGUAGE plpgsql because the function cannot be "inlined" anyway, so plpgsql might be faster. You can test the equivalent alternative with LANGUAGE plpgsql:

CREATE OR REPLACE FUNCTION f_arr_is_contained_sql(arr1 text[], arr2 text[])
  RETURNS bool
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS
'SELECT NOT EXISTS (SELECT unnest (arr1) EXCEPT ALL SELECT unnest (arr2))';

The function cannot use any indexes, though, which would lead to expensive sequential scans over the whole table.

Combine both to be fast and exact:

SELECT reference, word, spelling
FROM   word_mash_dictionary
WHERE  spelling <@ '{a,n,o,y,t}'::text[]
AND    f_arr_is_contained(spelling, '{a,n,o,y,t}'::text[]);

db<>fiddle here

The first predicate finds all matches (and possibly some false positives) quickly with index support; the second predicate weeds out the (few!) false positives.

Aside, word and spelling should probably be declared NOT NULL.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply