Postgres improvements in conversion functions for bulk load

All we need is an easy explanation of the problem, so here it is.

I regularly import data from httparchive.org. The data is a MySQL CSV export and I use pgloader, which handles the quirks of this export (\N for NULL), etc.
I also need to do some additional processing for normalisation:

  • split the url in protocol (http|https) and host parts
  • convert the string date "Mon DD YYYY" to date object

Currently, I have some triggers to do this when importing the data but I’m looking for ways to improve this, particularly to see whether some steps can be run in parallel.

I have the following CTE for extracting protocol and port:

with split as
(select regexp_match(url, '(https|http)://(.+)/' )as parts 
from urls )

Running locally this seems faster than tsdebug

This works well with a select but seems very slow as an update.

with split as
(select regexp_match(url, '(https|http)://(.+)/' )as parts 
from urls )
update urls
set 
protocol = parts[1],
host = parts[2] 
from split

An alternative, especially when working with a text source would be split the URL before it goes to Postgres.

The uncompressed CSV is 3526430884 bytes and takes around 20 minutes to import with no processing. But more than twice this with the processing. FWIW I have also tried using a Foreign Data Wrapper. But, even after solving various problems with the CSV (nulls, encoding), this leads to a memory error.

With some help, I’ve managed to run benchmarks and improve my triggers.

CREATE OR REPLACE FUNCTION public.extract_protocol()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
DECLARE elements text [];
BEGIN
elements := regexp_match(NEW.url, '(https|http)://(.+)/');
NEW.protocol = elements[1];
NEW.host = elements[2];
RETURN NEW;
END;
$function$

This now runs faster than doing a subsquent update but neither are limiting factors. The bottleneck is now the overhead of the indices when inserting the cleaned up data into the main table. I think my only options there are to weigh up the cost of indices for the insert over disabling and then adding them.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Your UPDATE syntax generates a cross join of the table urls with the split result. Which is essentially a cross join of the table with itself.

You need to have some kind of join condition between the target table and the source. The obvious choice would be the primary key column of the table.

with split as (
  select pk_column, regexp_match(url, '(https|http)://(.+)/' ) as parts 
  from urls 
)
update urls
  set protocol = s.parts[1],
      host = s.parts[2] 
from split s 
where urls.pk_column = s.pk_column --<< here

I think your attempt to avoid evaluating the regex expression twice by using the CTE makes things slower, rather than faster. I would expect the overhead of joining the table with itself is far bigger than the overhead of evaluating the regex twice.

So I think, you should also try:

update urls
  set protocol = (regexp_match(url, '(https|http)://(.+)/' ))[1]
      host = (regexp_match(url, '(https|http)://(.+)/' ))[2]

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply