Postgres query using index does never finish

All we need is an easy explanation of the problem, so here it is.

So I have the following 3 tables:

create table imdb_dev.dim_title
(
    id                 uuid        not null,
    tconst             varchar(10) not null,
    title_type_id      uuid        not null,
    primary_title      text,
    original_title     text,
    is_adult           boolean,
    start_year         integer,
    end_year           integer,
    runtime_in_minutes integer,
    constraint dim_title_pkey
        primary key (id),
    constraint dim_title_title_type_id_fkey
        foreign key (title_type_id) references imdb_dev.dim_title_type
);

create table imdb_dev.dim_title_type
(
    id         uuid not null,
    title_type text not null,
    constraint dim_title_type_pkey
        primary key (id)
);

create table staging.title_basics
(
    tconst         varchar(10) not null,
    titletype      varchar(20) not null,
    primarytitle   text,
    originaltitle  text,
    isadult        boolean,
    startyear      integer,
    endyear        integer,
    runtimeminutes integer,
    genres         text,
    constraint title_basics_pkey
        primary key (tconst, titletype)
)
    partition by LIST (titletype);

I want to select rows that exist in title_basics and insert then in dim_title. Both dim_title and title_basics have around 8 million rows while dim_title_type is just a mapping table with just 12 rows.

This is the query I’m using to select the rows I need to insert. For some reason it just keeps running and never finishes:

SELECT
    md5(tconst)::UUID AS id,
    tconst,
    tt.id AS title_type_id,
    tb.primarytitle AS primary_title,
    tb.originaltitle AS original_title,
    tb.isadult AS is_adult,
    tb.startyear AS start_year,
    tb.endyear AS end_year,
    tb.runtimeminutes AS runtime_in_minutes
FROM
    "funbro"."staging"."title_basics" tb
LEFT JOIN
    "funbro"."imdb_dev"."dim_title_type" tt ON tt.title_type = tb.titleType

WHERE md5(tconst)::UUID NOT IN (select id from "funbro"."imdb_dev"."dim_title")
;

I’m surely doing something wrong but I fail to see what that is. This is the query plan:

Gather  (cost=1001.27..533568642515.39 rows=4040707 width=97)
  Workers Planned: 2
  ->  Hash Left Join  (cost=1.27..533568237444.69 rows=1683628 width=97)
        Hash Cond: ((tb.titletype)::text = tt.title_type)
        ->  Parallel Append  (cost=0.00..533568197457.27 rows=1683627 width=73)
              ->  Parallel Seq Scan on title_basics_tvepisode tb_4  (cost=0.00..381512705366.11 rows=1231620 width=75)
                    Filter: (NOT (SubPlan 1))
                    SubPlan 1
                      ->  Materialize  (cost=0.00..289561.20 rows=8081413 width=16)
                            ->  Seq Scan on dim_title  (cost=0.00..209693.13 rows=8081413 width=16)
              ->  Parallel Seq Scan on title_basics_short tb_3  (cost=0.00..52946232635.92 rows=170924 width=63)
                    Filter: (NOT (SubPlan 1))
              ->  Parallel Seq Scan on title_basics_movie tb_1  (cost=0.00..37574087716.28 rows=121299 width=65)
                    Filter: (NOT (SubPlan 1))
              ->  Parallel Seq Scan on title_basics_video tb_7  (cost=0.00..20256814175.33 rows=65394 width=81)
                    Filter: (NOT (SubPlan 1))
              ->  Parallel Seq Scan on title_basics_tvseries tb_6  (cost=0.00..19194394831.39 rows=61965 width=68)
                    Filter: (NOT (SubPlan 1))
              ->  Parallel Seq Scan on title_basics_tvmovie tb_5  (cost=0.00..12033907879.19 rows=38848 width=77)
                    Filter: (NOT (SubPlan 1))
              ->  Parallel Seq Scan on title_basics_others tb_2  (cost=0.00..10050046434.90 rows=32444 width=79)
                    Filter: (NOT (SubPlan 1))
        ->  Hash  (cost=1.12..1.12 rows=12 width=25)
              ->  Seq Scan on dim_title_type tt  (cost=0.00..1.12 rows=12 width=25)

These are the specs of the machine where Postgres 13 is running. The disk is an SSD. The Postgres configuration is the default one:

Postgres query using index does never finish

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Ok, figured out the problem.

Firstly, I thought that adding an index as an expression at staging.title_basics as suggested would make the query planner use the index, but I did that and still faced the same problem: Postgres would keep doing a Seq Scan:

create index title_basics_id_uuid on staging.title_basics
(( md5(tconst)::UUID ));

The real problem is that using NOT IN in the where predicate is super inefficient. Changing that for a LEFT JOIN where id is null makes the query finish in less than 1 minute. Reference: wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_NOT_IN

Final query:

SELECT
    md5(tb.tconst)::UUID as id,
    tt.id AS title_type_id,
    tb.primarytitle AS primary_title,
    tb.originaltitle AS original_title,
    tb.isadult AS is_adult,
    tb.startyear AS start_year,
    tb.endyear AS end_year,
    tb.runtimeminutes AS runtime_in_minutes
FROM
    "funbro"."staging"."title_basics" tb
LEFT JOIN
    "funbro"."imdb_dev"."dim_title_type" tt ON tt.title_type = tb.titleType
LEFT JOIN
    imdb_dev.dim_title t ON md5(tb.tconst)::UUID = t.id

WHERE t.id is null
;

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply