Why is the default page size for databases so small?

All we need is an easy explanation of the problem, so here it is.

In PostgreSQL and SQL Server, the default page size is 8 KB, in MySQL is 16 KB and in IBM DB2 and Oracle it is only 4 KB.

Why are these page sizes so small?

Is there a historical or memory usage reason?

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

I’m running Linux (Fedora 34, 64 bit, two cores, four CPUs, 32 GB RAM – PostgreSQL 13.3).

If I run stat -f some_random_file as follows:

[[email protected] inst]$ stat -f blah.txt
  File: "blah.txt"
    ID: f1b798b1610e7067 Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 322411548  Free: 316122834  Available: 299727775
Inodes: Total: 81960960   Free: 81739842
[[email protected] inst]$

Note: Block size: 4096 = 4096 bytes = 32768 bits.

Now, even if you have a file that is two bytes long ("Hi") – it will still occupy 4096 bytes on disk – it’s basically the minimum I/O that can be performed by the OS. The OS takes stuff off disk as 4K "chunks" and spits them back on in 4K chunks – see here for a quick overview. You might like to test on your own system.

The disk itself has its own "atomic" unit. With HDDs, this was normally 512 bytes, but see the link above – "and at the hardware level the old drives used 512B sectors while new devices often write data in larger chunks (often 4kB or even 8kB)". See here for HDDs and here for SSDs. (Thanks to @RonJohn for his comment).

Likewise, the database will read data in and out in blocks (also called pages – the terminology can be confusing) – if you change one solitary bit in a record, the database will still have to read the entire page the record is on and write the entire page back onto disk once the modification is completed.

On PostgreSQL, you have the default block size of 8K.

test_1=# SELECT name, setting, short_desc, extra_desc FROM pg_settings WHERE name like '%block%' or short_desc LIKE '%block%';

      name      | setting |                  short_desc                  | extra_desc
----------------+---------+----------------------------------------------+------------
 block_size     | 8192    | Shows the size of a disk block.              |
 wal_block_size | 8192    | Shows the block size in the write ahead log. |
(2 rows)

test_1=#

It’s important that there not be too big a gap between the HDD, the OS and the RDBMS "atomic unit" sizes – otherwise, you run the risk of torn pages – from the link:

Avoiding Torn Pages

For the Postgres file layout, Postgres reads and writes data to disk
8kb at a time. Most operating systems make use of a smaller page size,
such as 4kb. If Postgres is running on one of these operating systems,
an interesting edge case can occur. Since Postgres writes to disk in
units of 8kb and the OS writes to disk in units of 4kb, if the power
went out at just the right time, it is possible that only 4kb of an
8kb write Postgres was performing were written to disk. This edge case
is sometimes referred to as “torn pages”. Postgres does have a way of
working around torn pages, but it does increase the amount of I/O
Postgres needs to perform.

Also, see here:

Partial Writes / Torn Pages

So what are full page writes about? As the comment in postgresql.conf
says it’s a way to recover from partial page writes – PostgreSQL uses
8kB pages (by default), but other parts of the stack use different
chunk sizes. Linux filesystems typically use 4kB pages (it’s possible
to use smaller pages, but 4kB is the max on x86), and at the hardware
level the old drives used 512B sectors while new devices often write
data in larger chunks (often 4kB or even 8kB).

So when PostgreSQL writes the 8kB page, the other layers of the
storage stack may break this into smaller chunks, managed separately.
This presents a problem regarding write atomicity. The 8kB PostgreSQL
page may be split into two 4kB filesystem pages, and then into 512B
sectors. Now, what if the server crashes (power failure, kernel bug,
…)?

As with so much related to computer science, it’s a matter of trade-offs and compromises – here’s a benchmark of PostgreSQL run against the same system just changing the block size – from the post:

Samsung SSD 840, 500 GB                          TPS (txns/second)

blocksize=2k                                     147.9
blocksize=4k                                     141.7
blocksize=8k                                     133.9
blocksize=16k                                    127.2
blocksize=1MB                                     42.5

So, you can see that a naïve, "make the db block size as big as possible" approach doesn’t work very well. All I will say about this is that database benchmarks are a total quagmire… for some applications 1 MB may well be suitable – although straying beyond 16 KB would require considerable justification. Systems’ default parameters are just that – defaults – chosen to be reasonably good under the widest range of circumstances…

Re. the historical part of the question – yes, a lot of it relates to history when disks came in 512 byte sectors… HDDs, despite advances in speed and incremental improvements in the composition of the rust, are essentially unchanged since the first HDDs – HDD performance has fallen way behind that of CPUs and RAM… capacity has increased, speed not so much (see here) – hence the birth of the "NoSQL" bandwagon movement (but that’s another day’s work 🙂 )!

There’s a lot happening in the area these days…

If you are interested – and have the time – I’ve perused it a couple of times, but it’s a bit above my pay grade… there’s an article here on Linux I/O and how it’s being revolutionised by io_uring (see wiki – and links therein).

Intel are also providing an open source toolkit, the SPDK (the Storage Performance Development Kit) which appears (at least to my untrained eye) to be some sort of way of allowing user space processes to directly access hardware without going through the kernel… here’s an interesting take on how this could be applied to databases.

And, also arriving on the scene, is (8) byte-addressable storage… for reasons best known to hardware designers, SSD’s (at least some of them) also have blocks and pages… They are not a panacea (check out TLC SSD and normal HDD write speed – only a 30% gain).

However, on the (far?) horizon, there’s Persistent Memory – from the post:

8-Byte Atomicity

Persistent Memory (such as Intel Optane DC Persistent Memory) natively
operates byte-by-byte rather than in blocks of data like conventional
storage. Data is persisted in chunks of 8-bytes at most in Persistent
Memory (again, using the default behavior). For applications (like
databases) that are based on a BLOCK construct, the 8-byte atomicity
of Persistent Memory can be an issue. Writing 8,192 bytes of data (an
8K block) will get persisted in 1,024 chunks of 8-bytes each. Power
failure or other abnormal condition can leave blocks of data
"fractured" or "torn" into pieces, with portions of the block
containing old data and other portions with new data. Applications
(such as databases) need to be changed to tolerate this sort of
fracturing or tearing of blocks. Otherwise, these are effectively
corrupted blocks of data. Continue reading for more information on the
topic of corruption and especially how this relates to Oracle
databases.

So, we can see how problems such as torn pages can still occur with these systems – but they do offer the possibility – when database programmers catch up – of having the block size = 8 bytes (not 8 KB) – you want to change a BIGINT’s value, all you have to do is read 8 bytes and write 8 bytes…

Maybe if we get down to this level, or even to specificity by a single byte, the whole notion of pages will go out the window for the disk, the OS and the RDBMS? I’m sure that there’ll still be file systems – just not sure how they’ll work.

This is a fascinating area (+1 for the question!), especially for database geeks.

Method 2

I’m gonna answer from my experience with SQL Server, although I believe the reason might be the same for the other RDBMS you mentioned.

If you check the Pages and Extents Architecture Guide doc, you’ll find that:

Disk I/O operations are performed at the page level. That is, SQL
Server reads or writes whole data pages.

That means when you request data, it will be loaded in memory by page and not by row. With that in mind, consider the following image as a representation of a page:

Why is the default page size for databases so small?

A page can contain empty space and if the default size would be, let’s say, 1Gb to hold more data per page, a new page would have nearly 1Gb of empty space and just a few would be needed to allocate your server memory quickly with empty space.

Another point related to memory is that as long as you can keep a page in memory (Page Life Expectancy (PLE) in SQL Server) you don’t have to spend time reading it from disk into memory every time that data is requested. If the server memory is consumed fast with few pages, each page will be purged from memory more often to allocate the newly requested ones so that SQL Server can work with them.

The basic fundamental of Microsoft SQL Server for querying data is to
load the necessary data pages into the buffer pool from the disk and
return it to the client. If the buffer does not have enough space to
process it, then the oldest data pages will be eradicated to make some
room as required for new pages.

Those are basic reasons as why the pages are small as you say.

Method 3

Small is a subjective term in this context. The bigger the Page Size setting is in a database, the more data that will be stored to a Page, and therefore more data that needs to be loaded any time a given Page needs to be loaded off disk. You can think of Pages as a the unit of measure for how the data is physically stored on disk, and disk is generally the slowest hardware component to a server.

For example, if a query you run only needs to return 4 KB of data, but your Page Size is set to 1 GB, that means you’ll need to wait for an entire 1 GB of data to be loaded off disk to serve just 4 KB of data. Likely, that’s not going to be great performance-wise.

Also, that’s only under the assumption your 4 KB of data is consecutively stored on the same Page, which will be dependent on your data and the predicates of your query. If your data was spread across 4 Pages, for example, now 4 GB of data needs to be loaded off disk to serve only 4 KB of data.

For reference, 4 KB of data is roughly 1,000 rows for a single integer or datetime column. So even if we’re talking about a dataset that is 10 columns wide with an average data size of an integer data type, that’s still 100 rows of data that 4 KB can hold.

So the Page Size is chosen to not be too big such that wasted I/O is spent loading more data off disk than necessary to serve queries, but conversely not too small either, otherwise you can run into a performance bottleneck due to an increase in the number of operations needed to load many Pages for a small amount of data. 4 KB to 16 KB has been in the realm of reasonable among databases, which is why it’s the default. You can always adjust it if you find your databases workload and use cases support changing it, but it’s usually not necessary to change.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply