All we need is an easy explanation of the problem, so here it is.
Postgres docs state the following about Index-Only Scans and Covering-Indexes:
if you commonly run queries like
SELECT y FROM tab WHERE x = 'key';
the traditional approach to speeding up such queries would be to
create an index on x only. However, an index defined as
CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);
could handle these queries as index-only scans, because y can be
obtained from the index without visiting the heap.
Because column y is not part of the index’s search key, it does not
have to be of a data type that the index can handle; it’s merely
stored in the index and is not interpreted by the index machinery.
Also, if the index is a unique index, that is
CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);
the uniqueness condition applies to just column x, not to the
combination of x and y. (An INCLUDE clause can also be written in
UNIQUE and PRIMARY KEY constraints, providing alternative syntax for
setting up an index like this.)
Question 1: If the data type of
y can be added in index and there is no uniqueness requirement then is there any advantage of using
CREATE INDEX tab_x_y ON tab(x) INCLUDE (y) over
CREATE INDEX tab_x_y ON tab(x, y) for queries like
SELECT y FROM tab WHERE x = 'key';?
It’s wise to be conservative about adding non-key payload columns to
an index, especially wide columns. If an index tuple exceeds the
maximum size allowed for the index type, data insertion will fail. In
any case, non-key columns duplicate data from the index’s table and
bloat the size of the index, thus potentially slowing searches.
Question 2: Can someone explain with an example what
wide columns mean?
Question 3: Can someone explain the below statement in context of
INCLUDE supports index only scans then
y will also have to be stored in index. Then how does the below statement not hold for
In any case, non-key columns duplicate data from the index’s table and
bloat the size of the index
How to solve :
I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.
Rule of thumb 1: If you never use an index column for filtering or sorting (or joining, or to enforce uniqueness), you might as well move it to the
INCLUDE clause. Nothing lost, something gained.
Rule of thumb 2:
INCLUDE columns only ever make sense if you actually get index-only scans from them. And in some cases not even then.
Answer 1: The
INCLUDE feature is predominantly useful for the two excluded cases: uniqueness, or not allowed in the index otherwise. But there are still minor benefits for other cases. The manual explains further down:
Suffix truncation always removes non-key columns from upper B-Tree levels. As payload columns, they are never used to guide index scans.
The truncation process also removes one or more trailing key column(s)
when the remaining prefix of key column(s) happens to be sufficient to
describe tuples on the lowest B-Tree level. In practice, covering
indexes without an
INCLUDEclause often avoid storing columns that
are effectively payload in the upper levels. However, explicitly
defining payload columns as non-key columns reliably keeps the tuples
in upper levels small.
Now, B-tree indexes are only few levels deep. But the upper levels are the ones that have to be read all the time. Keeping those small helps the most. Even a small effect is enhanced by that. The benefit is biggest for large cardinalities (multiple index levels), and little duplication (suffix truncation can’t make up for not moving a payload column to the
Plus, there is a case with expression indexes. Postgres is currently (Postgres 15) not smart enough to chose an index-only scan unless the involved column itself is included in the index. The manual again:
If an index-only scan seems sufficiently worthwhile, this can be
worked around by adding
xas an included column, for example
CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);
x is also involved in the query,)
INCLUDE (x) only serves as awkward hint for the query planner, while only
f(x) is actually used.
Answer 2: "Wide" columns are big columns, columns that occupy a lot of storage "on disk" (often not a "disk" nowadays) or in RAM. On-disk storage governs how many data pages have to be visited to satisfy a query, which is commonly the most important factor for performance. What counts is the internal representation, not the text representation you see. Test with
pg_column_size() – but be aware that the size of data in "packed" format ("on disk") can be more compact than in RAM. And there are various overheads. See:
- Measure the size of a PostgreSQL table row
- Why *not* ERROR: index row size xxxx exceeds maximum 2712 for index "foo"?
- What is the overhead for varchar(n)?
Answer 3: Write costs and bloated size applies for
INCLUDE(y) as well as for regular index columns. The question is whether to add
INCLUDE columns at all, which are always logically optional. (Regular index columns are often not optional.) Also, see answer 1.
The difference between covering and including isn’t when selecting, it is when inserting and updating. The INCLUDEd columns do not have to be kept in a stable order so if you update those columns (without changing their size if variable, or updating the others covered by the index) things do not need to be reordered. This can reduce page splits or other extra writes, making the operation more efficient and reducing internal fragmentation.
I assume wide columns means strings of any significant size or variable size. Other column types are generally smaller and of fixed size.
Just what it says: the value is copied into the index so the index is larger, and if there is an index scan or partial scan more pages will probably need to be accessed.
Note: I’m an MS SQL Server person mainly, there may the postgres specific edges around this.
In addition to Erwin’s great answer, there is an additional advantage to using the
INCLUDE syntax: documentation.
Imagine that you decide that you need an index on columns
(a, b) of table
tab. Now you find that there is already an index on
(a, c). In this situation you have two options:
simply go ahead and create another index
if you know for sure that column
cwas only added to the index to support an index-only scan and is never used as a search condition, you can drop the old index and create a new one on
(a, b, c), thus saving an index
Now it is usually difficult to determine that an index column is never used as a search condition, unless – well, unless it appears in the
INCLUDE clause. In that case, you don’t have to think twice and can replace the index on
(a) INCLUDE (c) with one on
(a, b) INCLUDE (c).
Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂