what should I do to make the not in more efficient

All we need is an easy explanation of the problem, so here it is.

I have two table article and article_content in PostgreSQL, the article_id in table article_content was the article table id. Now the sql look like this:

select * 
from article_content ac 
where article_id not in(
   select id from article a 
)
limit 10

find the record in article_content that not exists in article. this is the query plain:

Limit  (cost=1000.00..44996.68 rows=1 width=415)
  ->  Gather  (cost=1000.00..38955542254.16 rows=885420 width=415)
        Workers Planned: 2
        ->  Parallel Seq Scan on article_content ac  (cost=0.00..38955452712.16 rows=368925 width=415)
              Filter: (NOT (SubPlan 1))
              SubPlan 1
                ->  Materialize  (cost=0.00..101153.51 rows=1775167 width=8)
                      ->  Seq Scan on article a  (cost=0.00..85342.67 rows=1775167 width=8)

now the table article and article_content have so many rows. seems this sql could not complete forever. what should I do to do remove the article content rows that did not exists in artcle?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Typically NOT EXISTS is more efficient:

select ac.* 
from article_content ac 
where not exists (select *
                  from article a 
                  where a.id = ac.article_id)
limit 10;

An index on article_content (article_id) will improve the performance. I assume there is already an index on article (id)

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply