All we need is an easy explanation of the problem, so here it is.
Running on RDS with about 32M rows.
PostgreSQL 11.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit
Also testing locally on macOS with about 8M rows.
PostgreSQL 11.5 on x86_64-apple-darwin16.7.0, compiled by Apple LLVM version 8.1.0 (clang-802.0.42), 64-bit
I’ve got a column named
old_value that’s of type citext. I asked about this already, but posted way to many of my discovery steps along the way. Here’s a boiled down version that I’m hoping gets to the point.
I’ve got a field change log table named record_changes_log_detail with 32M rows and growing that includes a citext field named old_value.
The data is very skeweed. Most values are less than a dozen characters, some are more than 5,000.
Postgres chokes on large values with an error about B-tree entries being limited to 2172 characters. So I believe that for a B-tree, I need to substring the source value.
My users primary interest is in an = search, a starts-with search, and, sometimes, a contains-this-substring search. So = string% and %string%
Create an index that supports those searches that the planner uses.
Tried and failed
A straight B-tree fails to build, in some cases, because of long values.
An expression B-tree like this builds, but is not used
CREATE INDEX record_changes_log_detail_old_value_ix_btree ON record_changes_log_detail USING btree (substring(old_value,1,1024));
Adding text_pattern_opts does not help.
CREATE INDEX record_changes_log_detail_old_value_ix_btree ON record_changes_log_detail USING btree (substring(old_value,1,1024) text_pattern_opts);
Tried and works partially
A hash index works, but only for equality. (Like it says on the tin.)
This is the closest I’ve gotten to success:
CREATE INDEX record_changes_log_detail_old_value_ix_btree ON record_changes_log_detail USING btree (old_value citext_pattern_ops);
This works for quality, but not for LIKE. The release notes for PG 11 say it should work for LIKE:
By “work” I mean “the index is used.”
I was unable to substring succesfully with this approach.
What do people do in this situation with citext fields?
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
It is unusual to index such a long column entirely.
Modify the query like this:
WHERE substring(old_value, 1, 100) LIKE substring(pattern, 1, 100) AND old_value LIKE pattern
patternhere would be the pattern string, something like
Then a b-tree index on
substring(old_value, 1, 100)can be used (if the pattern doesn’t start with a wildcard character of course).
Depending on the exact requirements (are you searching complete words or word prefixes in a natural language text or not), full text search may be a good solution.
Another option are of course trigram indexes:
CREATE INDEX ON record_changes_log_detail USING gin (old_value gin_trgm_ops);
This requires the
pg_trgmextension to be installed.
Such an index will work also for search patterns that start with a wildcard. For good performance, enforce a minimum length on the search string.
Please edit your question, rather than posting answers to it that don’t answer it.
If you create an index on the expression
substring(old_value,1,1024), then that index can only get used if you query involves
While it is theoretically possible to prove that
old_value='foo' implies that
substring(old_value,1,1024)='foo' (and thus the contrapositive to that) if you have enough insight into the internals of substring, PostgreSQL makes no attempt to prove that. You need to write the query in a way that no such proof is needed.
I’m back to close this question out. Following up on a suggestion from Laurenz Albe, I gave the Postgres tri-gram implementation a try. They rule!
DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_tgrm; CREATE INDEX record_changes_log_detail_old_value_ix_tgrm ON record_changes_log_detail USING gin (old_value gin_trgm_ops);
The secret here when you’re using citext is to cast your value to ::text, like so:
select * from record_changes_log_detail where old_value::text LIKE '%Gold Kerrison Neuro%';
Running that with explain analyze confirms that the index is used. I noticed that I have to use LIKE for an = search, but that’s okay.
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂