gin index 'tid' array does not match table array

All we need is an easy explanation of the problem, so here it is.

I created the following table and inserted some values into it as follows:

CREATE TABLE query_all_lexeme (
        payload text,
        normalized tsvector GENERATED ALWAYS AS (to_tsvector('english', payload)) STORED
    );
    
INSERT INTO query_all_lexeme (payload)
    VALUES ('fat cats ate rats');

INSERT INTO query_all_lexeme (payload)
    VALUES ('summarize the functions and operators that are provided for full text searching');

INSERT INTO query_all_lexeme (payload)
    VALUES ('Constructs a phrase query');

INSERT INTO query_all_lexeme (payload)
SELECT
    'Constructs a phrase query this is a test'
FROM
    generate_series(1, 10000);

INSERT INTO query_all_lexeme (payload)
SELECT
    'Constructs a phrase query this is a test'
FROM
    generate_series(1, 100);

Then I created a gin index:

CREATE INDEX query_all_lexeme_vector ON query_all_lexeme USING gin (normalized);

And then I run the query below to get gin index info:

SELECT * FROM
        gin_metapage_info (get_raw_page ('query_all_lexeme_vector', 0)) \gx

Result:

+-[ RECORD 1 ]-----+------------+
| pending_head     | 4294967295 |
| pending_tail     | 4294967295 |
| tail_free_size   | 0          |
| n_pending_pages  | 0          |
| n_pending_tuples | 0          |
| n_total_pages    | 14         |
| n_entry_pages    | 1          |
| n_data_pages     | 12         |
| n_entries        | 15         |
| version          | 2          |
+------------------+------------+
WITH cte AS (
        SELECT
            flags,
            p
        FROM
            generate_series(1, 13) AS p,
            gin_page_opaque_info (get_raw_page ('query_all_lexeme_vector', p)))
    SELECT
        array_agg(p)
    FROM
        cte
    WHERE
        flags::text = '{data,leaf,compressed}';

return

+----------------------+
|      array_agg       |
+----------------------+
| {3,4,6,7,9,10,12,13} |
+----------------------+

In the following query, I should expect at least one row for the column gin_tid_vs_table_tid value that is true. However, the column values are all false.

WITH cte AS (
    SELECT
        (unnest(normalized)).lexeme AS elements
        , array_agg(ctid) AS ctids
    FROM
        query_all_lexeme
    GROUP BY
        1
)
SELECT
    elements
    , pg_typeof(ctids)
    , ctids = (
        SELECT
            tids
        FROM
            gin_leafpage_items (get_raw_page ('query_all_lexeme_vector' , 3))
        ORDER BY
            1
        LIMIT 1) AS gin_tid_vs_table_tid
FROM
    cte;

I have already run vacuum analyze. Now the data is very stable (only SELECT), and the tid value is stabilized. So why the last query’s column gin_tid_vs_table_tid has false values?
My logic is like gin index stored lexemes and lexemes‘ corresponding tid (physical tuple location). So the gin index tid should be equal as array_agg(ctid) with the same lexeme.

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Your logic would only apply if one index entry were guaranteed to contain all the ctids for a given lexeme. There is no such guarantee, and can’t be because index tuples have a strictly bounded size which is far less than enough to contain all possible ctids.

Maybe you could switch from array equality to overlaps or contains (&&, @>).

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply