PostgreSQL similarity operator – how to optimize / index?

All we need is an easy explanation of the problem, so here it is.

I’m a newbie to both SQL and PostgreSQL. I’m trying to figure out how to get this type of query to utilize an index for the trigram operation. Using PostgreSQL 12.7 on x86_64-pc-linux-gnu

The basic idea is we take a search phrase, split it up into distinct words and then see how many ‘similarity’ matches we get on the database column of search-ready names. The more words that have similarity to words found in the search-ready name, the higher the score. We also consider the overall search phrase against the original name, as a multiplier for a ‘boost’ to the weight.

The dpl_base table is 71,000 rows and looks like this:

PostgreSQL similarity operator - how to optimize / index?

The dpl_codes table is 100 rows and looks like this:

PostgreSQL similarity operator - how to optimize / index?

So far I’ve tried both:

create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops);    
create index trgm_idx_gin_dpl_base on dpl_base using gin (denied_name_searchable, denied_name_original gin_trgm_ops);

Along with various other ‘standard’ indices. Either with or without the indices the query EXPLAIN ANALYZE gives the same exact plan. So the indices seem to make no difference. The queries run pretty quickly, usually less than 3 seconds. Maybe I’m chasing something I don’t need to…I’m just trying to learn how to properly index for a query of this design:

SET pg_trgm.similarity_threshold = 0.35;
SELECT
/* create weighting value for the distinct-word hits within the SEARCHABLE column */
/* multiply by the similarity value for the original search phrase, against the ORIGINAL column */
(
  ('BAD' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int + 
  ('ACTOR' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int
) 
* (-(DPLB.DENIED_NAME_ORIGINAL <-> 'Bad Actor') + 1) AS WEIGHT,
/* add in the remaining columns from our two tables */
DPLB.DENIED_NAME_ORIGINAL, DPLC.DENIAL_REASON
FROM DPL_BASE DPLB 
INNER JOIN DPL_CODES DPLC ON DPLB.DENIAL_CODE = DPLC.DENIAL_CODE 
WHERE
/* must have at least one hit from our distinct words, in the SEARCHABLE column */
( 
  ('Bad' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int + 
  ('Actor' % ANY(STRING_TO_ARRAY(UPPER(DPLB.DENIED_NAME_SEARCHABLE),' ')))::int
) > 0 
ORDER BY WEIGHT DESC, DPLB.DENIED_NAME_ORIGINAL ASC;

Here’s an example of a query plan. Any tips or suggestions about (a) correct way to index and/or (b) better query design or optimization – would be very appreciated.

|QUERY PLAN                                                                                                                                                                                                                                                                                                                                                                                       |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|Gather Merge  (cost=9448.31..11733.27 rows=19584 width=62) (actual time=525.228..529.633 rows=204 loops=1)                                                                                                                                                                                                                                                                                       |
|  Workers Planned: 2                                                                                                                                                                                                                                                                                                                                                                             |
|  Workers Launched: 2                                                                                                                                                                                                                                                                                                                                                                            |
|  ->  Sort  (cost=8448.29..8472.77 rows=9792 width=62) (actual time=519.954..520.104 rows=68 loops=3)                                                                                                                                                                                                                                                                                            |
|        Sort Key: (((((('YOUTH'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((dplb.denied_name_searchable)::text), ' '::text))))::integer))::double precision * ((- ((dplb.denied_name_original)::text <-> 'Youth Society'::text)) + '1'::double precision))) DESC, dplb.denied_name_original|
|        Sort Method: quicksort  Memory: 34kB                                                                                                                                                                                                                                                                                                                                                     |
|        Worker 0:  Sort Method: quicksort  Memory: 34kB                                                                                                                                                                                                                                                                                                                                          |
|        Worker 1:  Sort Method: quicksort  Memory: 34kB                                                                                                                                                                                                                                                                                                                                          |
|        ->  Hash Join  (cost=4.25..7799.21 rows=9792 width=62) (actual time=23.524..519.630 rows=68 loops=3)                                                                                                                                                                                                                                                                                     |
|              Hash Cond: (dplb.denial_code = dplc.denial_code)                                                                                                                                                                                                                                                                                                                                   |
|              ->  Parallel Seq Scan on dpl_base dplb  (cost=0.00..7229.60 rows=9792 width=70) (actual time=22.937..516.516 rows=68 loops=3)                                                                                                                                                                                                                                                      |
|                    Filter: (((('YOUTH'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer + (('SOCIETY'::text % ANY (string_to_array(upper((denied_name_searchable)::text), ' '::text))))::integer) > 0)                                                                                                                                                |
|                    Rows Removed by Filter: 23432                                                                                                                                                                                                                                                                                                                                                |
|              ->  Hash  (cost=3.00..3.00 rows=100 width=38) (actual time=0.401..0.407 rows=100 loops=3)                                                                                                                                                                                                                                                                                          |
|                    Buckets: 1024  Batches: 1  Memory Usage: 15kB                                                                                                                                                                                                                                                                                                                                |
|                    ->  Seq Scan on dpl_codes dplc  (cost=0.00..3.00 rows=100 width=38) (actual time=0.044..0.216 rows=100 loops=3)                                                                                                                                                                                                                                                              |
|Planning Time: 0.399 ms                                                                                                                                                                                                                                                                                                                                                                          |
|Execution Time: 530.078 ms    

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Turning Booleans into ints and then doing arithmetic on them is sure to screw up indexing.

(foo::int + bar::int) >0

Should be the same thing as:

foo or bar

only the latter has a much better chance of being indexed. Also,

'cat' % ANY(string_to_array('hot dog',' '))

Should be similar to, but not exactly the same as

'cat' <% 'hot dog'

But again has at least some chance of using an index. Altenatively, decompose your table into a different table which has one line for each element of string_to_array(upper((denied_name_searchable)::text), ' '::text) so that you don’t need to decompose it on the fly.

Finally,

create index trgm_idx_gist_dpl_base on dpl_base using gist (denied_name_searchable, denied_name_original gist_trgm_ops);

the index operator does not distribute over ,. You need to specify it for each column. So that index can’t be used for trigram searching over "denied_name_searchable" at all. Also, there doesn’t seem to be any point in including "denied_name_original" in the index in the first place.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply