Advantage of using INCLUDE as against adding the column in INDEX for covering index

All we need is an easy explanation of the problem, so here it is.

Postgres docs state the following about Index-Only Scans and Covering-Indexes:

if you commonly run queries like

SELECT y FROM tab WHERE x = 'key';

the traditional approach to speeding up such queries would be to
create an index on x only. However, an index defined as

CREATE INDEX tab_x_y ON tab(x) INCLUDE (y);

could handle these queries as index-only scans, because y can be
obtained from the index without visiting the heap.

Because column y is not part of the index’s search key, it does not
have to be of a data type that the index can handle; it’s merely
stored in the index and is not interpreted by the index machinery.
Also, if the index is a unique index, that is

CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y);

the uniqueness condition applies to just column x, not to the
combination of x and y. (An INCLUDE clause can also be written in
UNIQUE and PRIMARY KEY constraints, providing alternative syntax for
setting up an index like this.)

Question 1: If the data type of y can be added in index and there is no uniqueness requirement then is there any advantage of using CREATE INDEX tab_x_y ON tab(x) INCLUDE (y) over CREATE INDEX tab_x_y ON tab(x, y) for queries like SELECT y FROM tab WHERE x = 'key';?

It’s wise to be conservative about adding non-key payload columns to
an index, especially wide columns. If an index tuple exceeds the
maximum size allowed for the index type, data insertion will fail. In
any case, non-key columns duplicate data from the index’s table and
bloat the size of the index, thus potentially slowing searches.

Question 2: Can someone explain with an example what wide columns mean?

Question 3: Can someone explain the below statement in context of INCLUDE(y). If INCLUDE supports index only scans then y will also have to be stored in index. Then how does the below statement not hold for INCLUDE(y).

In any case, non-key columns duplicate data from the index’s table and
bloat the size of the index

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

Rule of thumb 1: If you never use an index column for filtering or sorting (or joining, or to enforce uniqueness), you might as well move it to the INCLUDE clause. Nothing lost, something gained.

Rule of thumb 2: INCLUDE columns only ever make sense if you actually get index-only scans from them. And in some cases not even then.

Answer 1: The INCLUDE feature is predominantly useful for the two excluded cases: uniqueness, or not allowed in the index otherwise. But there are still minor benefits for other cases. The manual explains further down:

Suffix truncation always removes non-key columns from upper B-Tree levels. As payload columns, they are never used to guide index scans.
The truncation process also removes one or more trailing key column(s)
when the remaining prefix of key column(s) happens to be sufficient to
describe tuples on the lowest B-Tree level. In practice, covering
indexes without an INCLUDE clause often avoid storing columns that
are effectively payload in the upper levels. However, explicitly
defining payload columns as non-key columns reliably keeps the tuples
in upper levels small.

Now, B-tree indexes are only few levels deep. But the upper levels are the ones that have to be read all the time. Keeping those small helps the most. Even a small effect is enhanced by that. The benefit is biggest for large cardinalities (multiple index levels), and little duplication (suffix truncation can’t make up for not moving a payload column to the INCLUDE part).

Plus, there is a case with expression indexes. Postgres is currently (Postgres 15) not smart enough to chose an index-only scan unless the involved column itself is included in the index. The manual again:

If an index-only scan seems sufficiently worthwhile, this can be
worked around by adding x as an included column, for example

CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x);

(Unless plain x is also involved in the query,) INCLUDE (x) only serves as awkward hint for the query planner, while only f(x) is actually used.

Answer 2: "Wide" columns are big columns, columns that occupy a lot of storage "on disk" (often not a "disk" nowadays) or in RAM. On-disk storage governs how many data pages have to be visited to satisfy a query, which is commonly the most important factor for performance. What counts is the internal representation, not the text representation you see. Test with pg_column_size() – but be aware that the size of data in "packed" format ("on disk") can be more compact than in RAM. And there are various overheads. See:

Answer 3: Write costs and bloated size applies for INCLUDE(y) as well as for regular index columns. The question is whether to add INCLUDE columns at all, which are always logically optional. (Regular index columns are often not optional.) Also, see answer 1.

Method 2

  1. The difference between covering and including isn’t when selecting, it is when inserting and updating. The INCLUDEd columns do not have to be kept in a stable order so if you update those columns (without changing their size if variable, or updating the others covered by the index) things do not need to be reordered. This can reduce page splits or other extra writes, making the operation more efficient and reducing internal fragmentation.

  2. I assume wide columns means strings of any significant size or variable size. Other column types are generally smaller and of fixed size.

  3. Just what it says: the value is copied into the index so the index is larger, and if there is an index scan or partial scan more pages will probably need to be accessed.

Note: I’m an MS SQL Server person mainly, there may the postgres specific edges around this.

Method 3

In addition to Erwin’s great answer, there is an additional advantage to using the INCLUDE syntax: documentation.

Imagine that you decide that you need an index on columns (a, b) of table tab. Now you find that there is already an index on (a, c). In this situation you have two options:

  • simply go ahead and create another index

  • if you know for sure that column c was only added to the index to support an index-only scan and is never used as a search condition, you can drop the old index and create a new one on (a, b, c), thus saving an index

Now it is usually difficult to determine that an index column is never used as a search condition, unless – well, unless it appears in the INCLUDE clause. In that case, you don’t have to think twice and can replace the index on (a) INCLUDE (c) with one on (a, b) INCLUDE (c).

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply