How to remove duplicates and sort in a subquery?

All we need is an easy explanation of the problem, so here it is.

I have records with a "title" column that I am splitting up by space and performing a full text search with each word. I am storing the result in a materialized view.

This works, but I get duplicate results for various words and I need to sort the results by their ranking. I can do one or the other – not both. How do I do both?

My query:

SELECT
    asset.id,
    (
        select
            jsonb_agg(resultsForWord)
        FROM
            UNNEST(
                string_to_array(TRIM(regexp_replace(asset.title, '[^a-zA-Z+]', ' ', 'g')), ' ')
            ) as word
            INNER JOIN LATERAL 
            (
                SELECT
                    searchresult.id,
                    searchresult.title,
                    ts_rank(ts, to_tsquery ('english', word)) rank
                FROM
                    assets searchresult
                WHERE
                    searchresult.id != asset.id AND
                    ts_rank(ts, to_tsquery ('english', word)) > 0.5
                LIMIT 5
            ) AS resultsForWord ON 1=1
     ) results
FROM
    assets asset
WHERE asset.id = 'abc'
GROUP BY asset.id;

To filter out duplicates I just did

jsonb_agg(DISTINCT resultsForWord)

To order by rank I just did

jsonb_agg(resultsForWord ORDER BY rank DESC)

When I do both I get:

ERROR: in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list

Example data:

CREATE TABLE assets (
  id TEXT PRIMARY KEY,
  title TEXT,
  ts tsvector 
   GENERATED ALWAYS AS (setweight(to_tsvector('english', coalesce(title, '')), 'A')) STORED
)

INSERT INTO assets (id, title) VALUES ('a', 'Hello world!'),
  ('b', 'Hello sir'),
  ('c', 'I am above the world'),
  ('d', 'World hello')

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Seems you should flip the order of the join with UNNEST so that you only join a maximum of one row.

Also you can remove the outer GROUP BY. It seems unnecessary

SELECT
    asset.id,
    (
        select
            jsonb_agg(results ORDER BY results.rank DESC)
        FROM (
            SELECT
                searchresult.id,
                searchresult.title,
                resultsForWord.rank
            FROM
                assets searchresult
            CROSS JOIN LATERAL 
            (
                SELECT ts_rank(ts, to_tsquery ('english', word)) rank
                FROM UNNEST(
                    string_to_array(TRIM(regexp_replace(asset.title, '[^a-zA-Z+]', ' ', 'g')), ' ')
                ) as word
                WHERE ts_rank(ts, to_tsquery ('english', word)) > 0.5
                ORDER BY rank DESC
                LIMIT 1
            ) AS resultsForWord
            WHERE
                searchresult.id != asset.id
            ORDER BY rank DESC
            LIMIT 5
        ) results
     ) results
FROM
    assets asset
WHERE asset.id = 'a';

db<>fiddle

Method 2

Since id is the PRIMARY KEY, there can only be a single match in the outer query with WHERE a.id = 'abc', so the outer GROUP BY is definitely not needed (like Charlie already suggested).

"Duplicate results" like you report can be introduced in multiple spots:

  1. Splitting up title produces duplicate words
  2. Multiple (distinct) words can match the same row

Remove dupes early.

This looks mighty convoluted:

unnest(string_to_array(trim(regexp_replace(a.title, '[^a-zA-Z+]', ' ', 'g')), ' '))

Consider regexp_split_to_table() instead:

regexp_split_to_table(a.title, '[^a-zA-Z]+')

(And I suggest you want '[^a-zA-Z]+' rather than '[^a-zA-Z+]'.)

The only shortcoming: may produce leading or trailing empty strings, but those can cheaply be eliminated with a WHERE clause.

So, I think, you rather want this query:

SELECT a.id
    , (  SELECT jsonb_agg(resultsforword)
         FROM  (
            SELECT *
            FROM  (
               SELECT DISTINCT ON (r.id)
                      r.id, r.title, r.rank
               FROM  (
                  SELECT word        -- remove duplicate words early
                  FROM   regexp_split_to_table(a.title, '[^a-zA-Z]+') word
                  WHERE  word <> ''  -- trim possible leading / trailing ''
                  ) w
               CROSS  JOIN LATERAL (
                  SELECT s.id, s.title
                       , ts_rank(s.ts, to_tsquery('english', w.word)) AS rank
                  FROM   assets s
                  WHERE  s.id <> a.id
                  AND    ts_rank(s.ts, to_tsquery('english', w.word)) > 0.5
                  ORDER  BY rank DESC
                  LIMIT  5                  -- max. 5 best matches per word
                  ) r
               ORDER  BY r.id, r.rank DESC  -- take best rank for each dupe result
               ) r
            ORDER  BY r.rank DESC, r.id     -- best rank overall, id as tiebreaker
            LIMIT  5                        -- max 5 overall 
            ) resultsforword
     ) AS results
FROM   assets a
WHERE  a.id = 'e';

db<>fiddle here

This gets the 5 best matches for any of the words in the selected title. Looks more complex now, but operating with due diligence we have to:

  1. Extract words from selected title.
  2. Get the best (max.) 5 matches per word. This can match to the same row multiple times (with different rank).
  3. Get best rank for each resulting row – in case the same row matched on multiple words.
  4. Get 5 rows with best rank.

See:

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply