Expression index on a citext column ignored, why?

All we need is an easy explanation of the problem, so here it is.

Running on RDS with about 32M rows.

PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit

Also testing locally on macOS with about 8M rows.

PostgreSQL 11.5 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit

I’ve got a column named old_value that’s of type citext. I asked about this already, but posted way to many of my discovery steps along the way. Here’s a boiled down version that I’m hoping gets to the point.

Background

I’ve got a field change log table named record_changes_log_detail with 32M rows and growing that includes a citext field named old_value.

The data is very skeweed. Most values are less than a dozen characters, some are more than 5,000.

Postgres chokes on large values with an error about B-tree entries being limited to 2172 characters. So I believe that for a B-tree, I need to substring the source value.

My users primary interest is in an = search, a starts-with search, and, sometimes, a contains-this-substring search. So = string% and %string%

Goals

Create an index that supports those searches that the planner uses.

Tried and failed

A straight B-tree fails to build, in some cases, because of long values.

An expression B-tree like this builds, but is not used

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON  record_changes_log_detail 
    USING btree (substring(old_value,1,1024));

Adding text_pattern_opts does not help.

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON  record_changes_log_detail 
    USING btree (substring(old_value,1,1024) text_pattern_opts);

Tried and works partially

A hash index works, but only for equality. (Like it says on the tin.)

This is the closest I’ve gotten to success:

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree (old_value citext_pattern_ops);

This works for quality, but not for LIKE. The release notes for PG 11 say it should work for LIKE:

https://www.postgresql.org/docs/11/release-11.html

By “work” I mean “the index is used.”

I was unable to substring succesfully with this approach.

What do people do in this situation with citext fields?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

It is unusual to index such a long column entirely.

Three ideas:

  1. Modify the query like this:

    WHERE substring(old_value, 1, 100) LIKE substring(pattern, 1, 100)
      AND old_value LIKE pattern
    

    (pattern here would be the pattern string, something like 'string%'.)

    Then a b-tree index on substring(old_value, 1, 100) can be used (if the pattern doesn’t start with a wildcard character of course).

  2. Depending on the exact requirements (are you searching complete words or word prefixes in a natural language text or not), full text search may be a good solution.

  3. Another option are of course trigram indexes:

    CREATE INDEX ON record_changes_log_detail USING gin (old_value gin_trgm_ops);
    

    This requires the pg_trgm extension to be installed.

    Such an index will work also for search patterns that start with a wildcard. For good performance, enforce a minimum length on the search string.

Method 2

Please edit your question, rather than posting answers to it that don’t answer it.

If you create an index on the expression substring(old_value,1,1024), then that index can only get used if you query involves substring(old_value,1,1024).

While it is theoretically possible to prove that old_value='foo' implies that substring(old_value,1,1024)='foo' (and thus the contrapositive to that) if you have enough insight into the internals of substring, PostgreSQL makes no attempt to prove that. You need to write the query in a way that no such proof is needed.

Method 3

I’m back to close this question out. Following up on a suggestion from Laurenz Albe, I gave the Postgres tri-gram implementation a try. They rule!

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_tgrm;
CREATE INDEX record_changes_log_detail_old_value_ix_tgrm
    ON record_changes_log_detail 
    USING gin (old_value gin_trgm_ops);

The secret here when you’re using citext is to cast your value to ::text, like so:

select * from record_changes_log_detail 
where old_value::text LIKE '%Gold Kerrison Neuro%';

Running that with explain analyze confirms that the index is used. I noticed that I have to use LIKE for an = search, but that’s okay.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply