Table design and query Optimization: Query to find the suitable work from a list of work items

All we need is an easy explanation of the problem, so here it is.

I have table with jsonb column as below

        id SERIAL NOT NULL,
        work_data JSONB

sample data is as follows:

100 {"work_id": [7245, 3991, 3358, 1028]}

I created a gin index for for work_id as below:

CREATE INDEX idzworkdata ON work USING gin ((work_data -> 'work_id'));

Postgres documentation says gin index works for @> containment operator.
But I need to find all the work records which has work_id’s that user inputs, for which I need to use <@ operator.

Link to postgres documentation:

Section 8.14.4

“The default GIN operator class for jsonb supports queries with the
@>, ?, ?& and ?| operators. (For details of the semantics that these
operators implement, see Table 9-41.) An example of creating an index
with this operator class is”

When I execute the following query:

select *  
where  work_json ->'skill'  <@ '[ 3587, 3422,7250, 458 ]'

Execution Plan:

Gather  (cost=1000.00..246319.01 rows=10000 width=114) (actual time=0.568..2647.415 rows=1 loops=1)                          
  Workers Planned: 2                                                                                                         
  Workers Launched: 2                                                                                                        
  ->  Parallel Seq Scan on work  (cost=0.00..244319.01 rows=4167 width=114) (actual time=1746.766..2627.820 rows=0 loops=3)  
        Filter: ((work_json -> 'skill'::text) <@ '[3587, 3422, 7250, 458]'::jsonb)                                           
        Rows Removed by Filter: 3333333                                                                                      
Planning Time: 1.456 ms                                                                                                      
Execution Time: 2647.470 ms

The query does not use the gin index . Is there any workaround I can use to use the gin index for <@ operator?

Update 2:

Approach which is not postgres specific:

The query is taking around 40 to 50 secs which is huge

I have used two tables

    id integer NOT NULL DEFAULT nextval('work_id_seq'::regclass),
    work_data_id integer[],
    work_json jsonb

CREATE TABLE public.work_data
    work_data_id bigint,
    work_id bigint


from work  
   inner join work_data on ( 
group by 
having sum(case when work_data.work_data_id in (2269,3805,828,9127) then 0 else 1 end)=0 
Finalize GroupAggregate  (cost=3618094.30..6459924.90 rows=50000 width=4) (actual time=41891.301..64750.815 rows=1 loops=1)                                      
  Group Key:                                                                                                                                             
  Filter: (sum(CASE WHEN (work_data.work_data_id = ANY ('{2269,3805,828,9127}'::bigint[])) THEN 0 ELSE 1 END) = 0)                                               
  Rows Removed by Filter: 9999999                                                                                                                                
  ->  Gather Merge  (cost=3618094.30..6234924.88 rows=20000002 width=12) (actual time=41891.217..58887.351 rows=10000581 loops=1)                                
        Workers Planned: 2                                                                                                                                       
        Workers Launched: 2                                                                                                                                      
        ->  Partial GroupAggregate  (cost=3617094.28..3925428.38 rows=10000001 width=12) (actual time=41792.169..53183.859 rows=3333527 loops=3)                 
              Group Key:                                                                                                                                 
              ->  Sort  (cost=3617094.28..3658761.10 rows=16666727 width=12) (actual time=41792.125..45907.253 rows=13333333 loops=3)                            
                    Sort Key:                                                                                                                            
                    Sort Method: external merge  Disk: 339000kB                                                                                                  
                    Worker 0:  Sort Method: external merge  Disk: 338992kB                                                                                       
                    Worker 1:  Sort Method: external merge  Disk: 339784kB                                                                                       
                    ->  Parallel Hash Join  (cost=291846.01..1048214.42 rows=16666727 width=12) (actual time=13844.982..23748.244 rows=13333333 loops=3)         
                          Hash Cond: (work_data.work_id =                                                                                               
                          ->  Parallel Seq Scan on work_data  (cost=0.00..382884.27 rows=16666727 width=16) (actual time=0.020..4094.341 rows=13333333 loops=3)  
                          ->  Parallel Hash  (cost=223485.67..223485.67 rows=4166667 width=4) (actual time=3345.351..3345.351 rows=3333334 loops=3)              
                                Buckets: 131072  Batches: 256  Memory Usage: 2592kB                                                                              
                                ->  Parallel Seq Scan on work  (cost=0.00..223485.67 rows=4166667 width=4) (actual time=0.182..1603.437 rows=3333334 loops=3)    
Planning Time: 1.544 ms                                                                                                                                          
Execution Time: 65503.341 ms 

NOTE: Little background: work table has details of work and respective work id’s that are needed to perform the work. Each user can perform certain work Id’s which are super set than any work’s work id.
So User always has more work Id’s. I tried normal Join queries with work table and work id list table as separate tables but the query is doing a table scan and it takes around 40 secs which is huge.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

You could use a helper function that converts a jsonb array to an integer array:

CREATE FUNCTION jsonarr2intarr(text) RETURNS int[]
$$SELECT translate($1, '[]', '{}')::int[]$$;

This can be used with an index:

CREATE INDEX ON work USING gin (jsonarr2intarr(work_data ->> 'work_id'));

A modified query can make use of that index:

WHERE jsonarr2intarr(work_data ->> 'work_id')
      <@ ARRAY[1,2,3,5,6,11,7245,3991,3358,1028];

                                                        QUERY PLAN                                                        
 Bitmap Heap Scan on work
   Recheck Cond: (jsonarr2intarr((work_data ->> 'work_id'::text)) <@ '{1,2,3,5,6,11,7245,3991,3358,1028}'::integer[])
   ->  Bitmap Index Scan on work_jsonarr2intarr_idx
         Index Cond: (jsonarr2intarr((work_data ->> 'work_id'::text)) <@ '{1,2,3,5,6,11,7245,3991,3358,1028}'::integer[])
(4 rows)

Method 2

The direction of containment you want is not well-supported by GIN indexes. While switching the direction might be a simple thing conceptually, it is a totally different type of optimization problem operationally. You could try the extension, but I would not have great hopes for it.

Why does it take so long to scan the table? How big is the table? Once a task has been completed, it doesn’t need to be completed again, right? So you could delete it from the work table, to keep it small.

40 seconds doesn’t seem very long to gather all the tasks a user is eligible to do. Once that list has been gathered, they can work from the local copy, only double checking one row at a time that it still needs to be done. This should be fast.

You also mention another way you tried to do it. But you didn’t give enough details on that alternative for us to know if it was “fixable” or not.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from or, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply