Optimize postgres function

All we need is an easy explanation of the problem, so here it is.

I have following function which I want to use to update status and distance columns.

CREATE OR REPLACE FUNCTION public.check_deviation_table(
    )
    RETURNS void
    LANGUAGE 'plpgsql'

    COST 100
    VOLATILE 
    
AS $BODY$
declare 
status integer;
geoms integer;
distance float;
selected_rotue_name text;
Begin
    update ping_data set distance=sub.distance from
    (select min(st_distance(st_setsrid(st_transform(rl.geom,3857),3857),st_setsrid(st_transform(ST_SetSRID(ST_MakePoint(pd.longitude,pd.latitude),4326),3857),3857))) as distance
    ,pd.id from ping_data pd inner join order_data od on od."orderNumber"=pd.ordernumber
    inner join route_line rl on od."pickupLocName"||'-'||od."deliveryLocName"=rl."route name"
    where pd.distance is null group by pd.id)sub
    where ping_data.id=sub.id;
    update ping_data set status='No deviation' where ping_data.status is null and ping_data.distance<='3000'::int::float;
    update ping_data set status='Deviated' where ping_data.status is null and ping_data.distance>'3000'::int::float;
End
$BODY$;

It works fine on few thousand records but when I run it on whole ping_data table which has 642810 records it just keeps running. I have also created following indexes on tables involved but to no avail.

create index btree_ping on ping_data using btree(id);
create index gist_route on route_line using gist(geom);

Please refer to following details

Postgres Version:PostgreSQL 12.4, compiled by Visual C++ build 1914,64-bit

PostGIS Version:3.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1

Machine details:Intel i5 16GB RAM Windows 10

And shared below are schemas of tables involved

CREATE TABLE public.route_line
(
    "route name" character varying COLLATE pg_catalog."default",
    geom geometry,
    routeid integer NOT NULL DEFAULT nextval('route_line_routeid_seq'::regclass),
    "pickup location" text COLLATE pg_catalog."default",
    "destination location" text COLLATE pg_catalog."default",
    CONSTRAINT route_line_pkey PRIMARY KEY (routeid)
)
CREATE TABLE public.ping_data
(
    latitude double precision,
    longitude double precision,
    pingdt character varying COLLATE pg_catalog."default",
    shipmentid integer,
    addedat timestamp with time zone DEFAULT now(),
    ordernumber bigint,
    id integer NOT NULL DEFAULT nextval('ping_data_id_seq'::regclass),
    ping_dt timestamp with time zone,
    status text COLLATE pg_catalog."default",
    distance double precision
)
CREATE TABLE public.order_data
(
    "assetNumber" text COLLATE pg_catalog."default",
    "createDt" timestamp with time zone,
    "deliveryLocName" text COLLATE pg_catalog."default",
    "orderNumber" bigint,
    "pickupLocName" text COLLATE pg_catalog."default",
    "releaseNumber" bigint,
    assigned character varying(1) COLLATE pg_catalog."default" DEFAULT '0'::character varying,
    city text COLLATE pg_catalog."default",
    transportername text COLLATE pg_catalog."default"
)

@ErwinBrandstetter please refer to the following result of explain analyze of the update query you shared.

Update on ping_data p  (cost=2199987.43..6027475.29 rows=97076 width=162) (actual time=6014063.492..6014063.861 rows=0 loops=1)
  ->  Hash Join  (cost=2199987.43..6027475.29 rows=97076 width=162) (actual time=168649.223..5998253.555 rows=545483 loops=1)
        Hash Cond: (sub.id = p.id)
        ->  Subquery Scan on sub  (cost=2144323.35..5958823.04 rows=97076 width=48) (actual time=167552.663..5987759.840 rows=545483 loops=1)
              ->  GroupAggregate  (cost=2144323.35..5957852.28 rows=97076 width=12) (actual time=167552.647..5984534.784 rows=545483 loops=1)
                    Group Key: pd.id
                    ->  Sort  (cost=2144323.35..2144630.74 rows=122956 width=13212) (actual time=165047.817..177245.569 rows=891149 loops=1)
                          Sort Key: pd.id
                          Sort Method: external merge  Disk: 2274960kB
                          ->  Hash Join  (cost=3815.85..47337.28 rows=122956 width=13212) (actual time=1134.751..2381.401 rows=891149 loops=1)
                                Hash Cond: (pd.ordernumber = od.orderNumber)
                                ->  Seq Scan on ping_data pd  (cost=0.00..38132.37 rows=97076 width=28) (actual time=0.122..381.017 rows=642810 loops=1)
                                      Filter: (distance IS NULL)
                                ->  Hash  (cost=1451.69..1451.69 rows=1453 width=13200) (actual time=1134.316..1134.399 rows=1632 loops=1)
                                      Buckets: 512  Batches: 8  Memory Usage: 576kB
                                      ->  Hash Join  (cost=46.93..1451.69 rows=1453 width=13200) (actual time=159.161..1119.843 rows=1632 loops=1)
                                            Hash Cond: ((rl.route name)::text = ((od.pickupLocName || '-'::text) || od.deliveryLocName))
                                            ->  Seq Scan on route_line rl  (cost=0.00..958.54 rows=954 width=13210) (actual time=25.129..121.613 rows=954 loops=1)
                                            ->  Hash  (cost=31.97..31.97 rows=1197 width=51) (actual time=11.256..11.259 rows=1197 loops=1)
                                                  Buckets: 2048  Batches: 1  Memory Usage: 114kB
                                                  ->  Seq Scan on order_data od  (cost=0.00..31.97 rows=1197 width=51) (actual time=9.379..10.686 rows=1197 loops=1)
        ->  Hash  (cost=38132.37..38132.37 rows=645737 width=89) (actual time=1042.800..1042.801 rows=642810 loops=1)
              Buckets: 32768  Batches: 32  Memory Usage: 2704kB
              ->  Seq Scan on ping_data p  (cost=0.00..38132.37 rows=645737 width=89) (actual time=0.135..674.748 rows=642810 loops=1)
Planning Time: 155.156 ms
Execution Time: 6016315.903 ms

Can someone help me optimize this function?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This is exactly equivalent to your original:

CREATE OR REPLACE FUNCTION public.check_deviation_table()
  RETURNS void
  LANGUAGE sql AS
$func$
UPDATE ping_data p
SET    distance = sub.distance
     , status = COALESCE(p.status
                       , CASE WHEN sub.distance <= 3000 THEN 'No deviation'
                              WHEN sub.distance >  3000 THEN 'Deviated' END)
FROM  (
   SELECT min(st_distance(st_setsrid(st_transform(rl.geom,3857),3857)
                         ,st_setsrid(st_transform(st_setsrid(st_makepoint(pd.longitude, pd.latitude),4326),3857),3857))) AS distance
        , pd.id
   FROM   ping_data  pd
   JOIN   order_data od ON od."orderNumber" = pd.ordernumber
   JOIN   route_line rl ON od."pickupLocName" || '-' || od."deliveryLocName" = rl."route name"
   WHERE  pd.distance IS NULL
   GROUP  BY pd.id
   ) sub
WHERE  ping_data.id = sub.id;
$func$;

Should be quite a bit faster already.

Remove all declared variables which serve no purpose. Then there’s nothing left that would require PL/pgSQL. Use a simpler SQL function, or just the bare query.

Most importantly, use one UPDATE instead of three. Your original would write two new row versions instead of just one. With several sequential scans instead of one. (Indexes are probably useless for this or make it even more expensive, but that depends …)

There is probably more potential, but I’ll stop here without detailed information.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply