Unique array values for this string_to_array

All we need is an easy explanation of the problem, so here it is.

This is a follow-up to:

Based on these sample tables:

data_providers:
id | field_map
--------------
1  | {"segments": "SEGMENT IDS", "full_name": "FULL NAME"}

leads:
id | data_provider_id | email | data
------------------------------------
1  | 201              | [email protected] | {"SEGMENT IDS": "id1,id1,id1,id2,id3", "FULL NAME": "John Doe"}
2  | 201              | [email protected] | {"FULL NAME": "Billy Bob"}

desired output:

data_provider_id | email | full_name | segment
----------------------------------------------
201              | [email protected] | John Doe  | id1
201              | [email protected] | John Doe  | id2
201              | [email protected] | John Doe  | id3
201              | [email protected] | Billy Bob | NULL

I have the following query:

SELECT
  leads.data_provider_id,
  leads.email,
  leads.data->>(p.field_map->>'full_name') AS full_name,
  segment
FROM leads
LEFT OUTER JOIN data_providers p ON p.id = leads.data_provider_id
LEFT JOIN LATERAL unnest(string_to_array(leads.data->>(p.field_map->>'segments'), ',')) AS segment ON true

This query is doing 2 particular things:

  1. its joining on data_providers table to get the field_map column which contains a JSONB mapping if CSV column headers. So something like {"segments": "SEGMENT ID", "full_name": "FULL NAME"}

  2. Within the data JSONB column of leads, there is a key (which I discover through the field map above) that contains a comma separated string of segment_ids (it comes in a CSV and they chose to put 2 values within 1 row). I want to split it so each segment_id gets its own row (and obviously all other columns remain the same on both rows).

I have 2 goals:

  1. If there is an empty string or the key doesn’t exist within the map, I want to return the row but just with NULL for the segment_id. I already got this working by changing CROSS JOIN to LEFT JOIN.

  2. I’m trying to remove duplicates in segment ids, so if someone enters ‘id1,id1’ it should only produce 1 row. I do this because there is a unique index on that column for the materialized view.

I’m currently stuck on #2.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Make it a subquery and throw in DISTINCT:

SELECT l.data_provider_id
     , l.email
     , l.data->>(p.field_map->>'full_name') AS full_name
     , s.segment
FROM   leads l
LEFT   JOIN data_providers p ON p.id = l.data_provider_id
LEFT   JOIN LATERAL (
   SELECT DISTINCT segment
   FROM   unnest(string_to_array(l.data->>(p.field_map->>'segment'), ',')) AS segment
   ) s ON true

Your field_map holds the key ‘segment’, not ‘segments’, btw.

You could even use this short syntax:

...
LEFT   JOIN LATERAL (
   SELECT DISTINCT unnest(string_to_array(l.data->>(p.field_map->>'segment'), ','))
   ) s(segment) ON true

(But the last one might make unsuspecting SQL purists cringe.)

Original order of array elements is not preserved. If you need that, see:

And use GROUP BY rather than DISTINCT and also aggregate the minimum ordinal position for each group of duplicates.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply