Is partition alignment to SSD erase block size pointless?

All we need is an easy explanation of the problem, so here it is.

Many people seem to have the idea (1, 2, 3, 4, 5) that aligning the start of your SSD partitions at a multiple of the SSD erase block size is somehow benefitial. I do not see the benefit; consider the following partitioning (please, suspend your disbelief about the 16K erase blocks; they are likely to be much larger in practice and so are the partitions):

Partitions:      [    1   ]              [        2        ]
Logical blocks:  [ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ]
Physical blocks: [ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ][ 4K ]
Erase blocks:    [          16K         ][          16K         ]

Now if logical block K corresponded to physical block K for any K (e.g. if there were no wear-levelling done by the SSD controller), then there might be some theoretical merit to this. Suppose for example that partition 2 in the above figure starts one logical / physical block earlier. Then any write at the beginning of partition 2 will cause the erasure of the first erase block as will any write to partition 1, which will cause additional wear to that particular erase block.

With wear-levelling, however, there is no set correspondence between logical and physical blocks (e.g. the logical block K can correspond to an arbitrary physical block L), so the erase-block alignment should be completely meaningless. Alignment to block size should be sufficient, so that pages (for swapping) and filesystem blocks (for data) written out to the partition do not occupy more blocks on the SSD than necessary.

Related questions:

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

This question is very hard, especially in view of the fact that SSD technology
is in constant evolution, and especially since modern operating systems are
constantly improving their handling of SSD.

In addition, I’m not sure that your problem is with Wear leveling.
It should rather be with SSD optimizations designed to avoid block erases.

Let us first get our terms right :

  • An SSD block or Erase block is the unit that the SSD can erase in one atomic operation, which can usually go up to 4MB bytes
    (but 128KB or 256KB are more common).
    An SSD cannot write to a block without erasing it first.
  • An SSD page is the smallest atomic unit that the SSD software can track.
    A block usually contains multiple pages, usually up to 4KB in size.
    The SSD keeps a mapping per page of where the OS thinks it is located
    on the disk (the SSD writes pages wherever it prefers although the OS will
    think in terms of a sequential disk).
  • A sector is the smallest element that the operating system thinks a hard disk
    can write in one operation. The OS will also think in terms of disk cylinders
    and tracks, even if they do not apply to SSD.
    The OS will usually inform the SSD when a sector becomes free
    (TRIM).
    Smart SSD firmware will usually announce to the OS its page-size as the sector-size where possible.

It is clear that the SSD firmware would prefer always writing to empty blocks,
as they are already erased. Otherwise, to add a page to a block that contains
data will require the sequence of read-block/store-page/erase-block/write-block.

Too liberal application of the above will cause pages to be dispersed all over
the SSD and most blocks to become partially empty, so the SSD may soon run out
of empty blocks. To avoid that, the SSD will continuously do
Garbage collection in the background, consolidating partially-written
blocks and ensuring enough empty blocks are available.
This operation may look like this:

[image1][1]

Garbage collection introduces another factor –
Write amplification
– meaning that one OS write to the SSD may need more than one physical write
on the SSD.

As an SSD block can only be erased and written a certain number of times before
it dies, Wear leveling
is designed to distribute block writes uniformly
across the SSD so no block is written much more than others.

The question of partition alignment

From the above, it looks like the mechanism that allows the SSD to map pages
to any physical location, keeping wherever the OS thinks they are stored,
voids the need for partition alignment. Since the page is not written where
the OS thinks it is written, there is no more any importance as to where the OS
thinks it writes the data.

However, this ignores the fact that the OS itself attempts to optimize
disk accesses. For classical hard disk it will attempt to minimize head
movements by allocating data accordingly on different tracks.
Clever SSD firmware should manipulate the fictional cylinder and tracks
information that it reports to the OS so that track-size will equal
block-size, and page-size will equal sector-size.

When the view the OS has of the SSD is in somewhat more in line with reality,
the optimizations done by the OS may avoid the need for the SSD to map pages
and avoid garbage collection, which will reduce Write amplification and
increase the lifetime of the SSD.

It should be noted that too much fragmentation of SSD (meaning too much
mapping of pages) increases the amount of work done by the SSD.
The 2009 article
Long-term performance analysis of Intel Mainstream SSDs
indicated that if the drive is abused for too long with a mixture of small and large writes, it can get into a state where the performance degredation is permanent, and that with Wear leveling this condition may extend to more
of the drive.
This condition is the reason while many SSD owners see performance degrade
over time.

My final advice is to align partitions to respect erase-blocks layout.
The OS will assume that a partition is well-aligned as regarding the disk,
and the decisions taken by it on the placement of files might be more
intelligently done. As always, individual idiosyncrasies of OS driver
versus SSD firmware may invalidate such concerns, but better to play it safe.

Method 2

Is partition alignment to SSD erase block size pointless?

Based on the below quoted advice from 2008 it seems that aligning partitions to block size boundaries depend on the make, model, and control chips algorithms within any given SSD device.

For the older generation SSD devices that do not use smart technology and algorithms, doing this may be more important but you should check the SSD manufacture’s technical guide.

Aligning Filesystems to an SSD’s Erase Block
Size

Aligning your file system on an erase block boundary is critical on
first generation SSD’s, but the Intel X25-M is supposed to have
smarter algorithms that allow it to reduce the effect of
write-amplification. The details are a little bit vague, but
presumably there is a mapping table which maps sectors (at some
internal sector size — we don’t know for sure whether it’s 512 bytes
or some larger size) to individual erase blocks.


Fast forward some to 2014 and based on the below advice from then, it seems that read/writes are aligned on page size and erases are done at SSD idle that this does not matter and is pointless.

Really still need EBS (Erase Block Size) for partitioning a SSD?

Reads/writes are aligned on page size, and erases happen in the
background when the SSD is idle. As such, erase block size does not
matter. Page size matters, but it’s always small so just do
MiB-alignment. The filesystem will break it down to something around
4k anyway. It’s not possible to align to 1536KiB, not with hundreds of
small files in a filesystem. Even if you were to put your partitions
on 1536KiB boundaries and set your filesystem up with a 1536KiB raid
stride, you would not notice any difference. Reads/writes are done on
page level, and redistributed anyway by a flash translation layer. So
no matter how much you try to align from the outside, the SSD will
mangle it to its own purposes anyway, so what’s the point?


Conclusion

I think wear-leveling is one of those [smarter] newer SSD algorithms from the 2008 post that was mentioned, and now that this—and other correlated modern SSDs and partitioning technologies—is more common with SSD devices that would imply such alignments would more likely be pointless.

I think some of the talk around the Internet on this topic may be correct in some cases based on whatever SSD technology those may refer to in such writings, etc. and thus the content may simply not be applicable to others depending on what they use.

This comes down to understanding the specs, features, functions, etc. of your SSD device(s) and whatever partitioning tools you use at the time you perform such operations to ensure optimal configuration in your environment whether or not it is a pointless operation to perform.


Further Resources

Method 3

The answer by Drink is excellent. But I’d like to help explain this further.

If you search for this topic on the internet, you’ll find a huge heaping pile of "guides" (even recent ones) based on the old, misguided belief that you can somehow "tell the filesystem to align/stripe its data inside the SSD’s erase block boundaries".

That’s a very, very, very silly belief.

The fact is that SSDs place each sector wherever it feels like. Each erase block contains lots of sectors. The laymen who wrote those "guides" incorrectly thought SSDs place sectors sequentially (like hard disks do) within those erase blocks and that you therefore benefit by forcing the filesystem to place as much related data as possible in each physical erase block, to make future erase operations get rid of a whole file (therefore a whole erase block) at once, and that future file writes will replace the contents of the entire erase block in-place via a single "erase all of the sectors in this erase block, and then write all of the sectors in this erase block", without having to keep some sectors from the erase blocks and rewriting them elsewhere. Basically, the idea of "stride/stripe" was to hint at the filesystem to group together related data within physical SSD erase blocks.

The idea sounds nice in theory. But that’s not how SSDs work anymore.

That’s really, really not how any of this works.

SSDs don’t care about your "sequential sectors" and will write the data into any physical sector that is empty anywhere on the NAND flash chips.

When you tell a SSD to write to sector 1, it might place the data in physical sector 21349239. But it will pretend to the OS that it’s in "sector 1". Furthermore, when your operating system tells the SSD to write sequential sectors (1, 2, 3, 4, …), the SSD will spread the sectors across its NAND chips anywhere that there are individual empty sectors. It won’t place them sequentially across "erase blocks" whatsoever!

That’s because the SSD uses wear leveling algorithms to decide which physical sector ANYWHERE ON THE NAND FLASH CHIPS to place the given "virtual sector" in. It then stores a lookup table saying "virtual sector (what the OS thinks it is) = Stored at physical location X". So when you write a "sequential bunch of sectors" from your OS, your data ALWAYS ends up smattered randomly all over the entire SSD due to wear leveling.

Furthermore, whenever a sector dies (is worn out completely and can’t reliably hold data anymore), the SSD will move it somewhere else on the flash and remap it too.

You simply cannot know where the SSD will place your sectors. It is NOT sequential whatsoever.

In 2009, back when these "stride" discussions were gaining traction, Ted Tso, one of the primary engineers of the Linux kernel filesystems (particularly ext4), wrote a blog post theorizing that all of this was pointless because the SSD places its data wherever it wants. Unfortunately not enough people listened to him. Here’s his outdated blog post about the subject (don’t follow any of its advice; it’s all very outdated and all tools he’s talking about have been updated to do MiB alignment these days): https://tytso.livejournal.com/2009/02/20/

The interesting thing about that blog post is that it tells us the origin for these misguided beliefs about "SSD erase block boundaries". Because it talks about very old, first-generation SSDs that DIDN’T do wear-leveling and DIDN’T map "virtual sectors (what the OS sees)" to random hardware sectors. In early SSDs, the sectors WERE truly sequential, which is why the stride/strip advice was born. Modern SSDs are way more advanced. And anyone following the old "stride/stripe" advice from misguided laymen will just be wasting their time.

The blog post further talked about Intel’s SSDs at the time being more advanced and having a virtual sector remapping table but lacking TRIM. He said that this virtual remapping was already on the verge of killing the need for stride/stripe. And he then theorized (correctly) that if TRIM is implemented, there will be ZERO need for "aligning along erase blocks" anymore.

And that’s where we are TODAY, with modern SSDs.

Modern SSDs use virtual remapping (sectors are NOT contiguous/sequential on the hardware), and they use TRIM to free up sectors. So there’s ZERO BENEFIT to trying to align to "erase blocks" because there is NO SUCH THING on the hardware anymore. As I said, the sectors you write from your OS will end up randomly all over the disk on modern wear-leveling SSDs.

Trying to align filesystem data across a modern SSD’s erase blocks is futile and will NEVER work. So don’t waste your time.

The ONLY things that matter are:

1. Sector Alignment.

Align your partitions (start offset and size), LUKS payload offset, LVM payload (1st PE) offset, etc, all at 1 MiB boundaries. This ensures that they align perfectly with any physical sector size regardless of whether the SSD uses 512 byte, 4096, 8192, 16384 or any other power-of-2 physical sector sizes.

This is because 1 MiB (1 048 576 bytes) is evenly divisible with all sector sizes used by SSDs. Therefore, as long as you align your partitions and data structures (payloads of LUKS and LVM) on these boundaries, all virtual sectors that the OS operates on will be properly aligned with physical sectors.

The purpose of doing this is to ensure that writes to virtual sectors (what the OS reads/writes from) will be aligned with physical sectors on the SSD so that the data from a single virtual sector only occupies a single physical sector, rather than being spread out across multiple sectors (which would then always require the SSD to read two or more sectors, modify the overlapping portion of them that the OS virtual sector is at, and then write all of the modified sectors to a new location on the SSD).

If you align this at 1 MiB boundaries, your OS sectors always be perfectly aligned with physical sectors regardless of their size. If you don’t, then the SSD will slow down massively and get heavy write amplification due to the aforementioned process of always modifying multiple physical sectors per write instead of just a single sector.

Luckily all modern tools (like GParted, cryptsetup and LVM) do 1 MiB alignment by default. That is all you need.

2. TRIM.

This is extremely important. It’s insanely important. It’s stupidly important. It’s as important as the alignment itself.

TRIM is what tells the SSD which sectors it can discard. If you run without TRIM (such as LUKS and LVM with default settings, which won’t forward discards), then your SSD’s internal lookup table will fill up completely, thinking that every sector matters and contains "living data". Therefore, the SSD won’t be able to "garbage collect" anymore. It will think that it is completely full even if your drive is only partially full. And writes will be an extremely slow operation, if it even works at all anymore. Without TRIM, there will be extreme write-amplification because all of that dead data will continue being kept and juggled around/written to new locations by the SSD as it struggles to write data (involving lots of read-modify-write of all sectors in entire erase blocks), because the SSD thinks that the entire drive is full of important data.

By contrast, with TRIM enabled, your SSD will be told which data is dead. The SSD controller will then take all living SSD sectors from multiple erase blocks, merge them all together into a new block with living data (such as merging three 30% full blocks where the remaining 70% is garbage, into a single 90% full block with 10% empty blocks ready for future writes), and then it erases all of the "garbage blocks", so that their internal sectors are completely empty and ready for fresh writes. This garbage collection goes on continuously (whenever the SSD is idle) in modern SSD controllers, constantly moving and compacting blocks and checking the health of living sectors and optimizing the NAND flash for future writes.

And this goes back to the original point about how pointless it is to believe that you can use "filesystem stride" to align sectors along SSD erase blocks. Remember, SSDs can write into any still-empty sector in any "erase block" anywhere on the physical storage. And SSDs will constantly move those sectors around into new erase blocks to optimize the drive performance.

The only thing you should think about when using TRIM is that you should NOT use "instant discards" (i.e. mounting ext4 with the "discard" flag). Because constantly telling the SSD about dead data instantly whenever sectors become free, will cause it to excessively shuffle/compact/garbage collect small amounts of data all the time. Instead, you should be using a weekly timer that does a big "all at once" TRIM of ALL unused blocks. This is how it’s done by default on modern operating systems. For example, Windows uses a weekly timer (you can see it by opening the built-in "Optimize Drives" application), and Linux systemd-based operating systems basically all contain "fstrim.timer", which causes "fstrim.service" to execute weekly. Which in turn executes /usr/sbin/fstrim --fstab --verbose --quiet. The fstrim command informs storage devices (physical and virtual) about unused blocks. In fact, this doesn’t just help the SSD. It also helps things like LVM thin provisioning (where multiple volumes share the same physical pool of space) to return blocks to the free pool. So TRIM is just an overall health boost for every part of your system!

Lastly, a small sidenote regarding encryption. If you’re using LUKS (encryption), you may have seen passionate rants saying that you shouldn’t use TRIM because it’s "insecure". That’s not true. The only thing that attackers can derive from seeing empty (zeroed) sectors on an encrypted drive, is that they will know how much data you have on there (such as if it’s 50% full), which really doesn’t help them at all. It also tells them the virtual locations of the used sectors which lets them pretty accurately guess which filesystem is used if the drive is nearly empty, since things like filesystem metadata storage offsets will be visible. But it won’t let them get into your data. And guess what? Unless you’re the government, I can assure you that you will be happier with a healthy, fast, TRIM’d drive that lives a long, very healthy life. So enable TRIM for your sake! Besides, do you really think that you as a private person would ever encounter an attacker that tries to analyze free (trimmed) sectors to guess things about your computer? That paranoid threat model makes zero sense for most of us. It’s far more important that you use a secure passphrase so that random people don’t just type "password" and unlock your system! 😉

Alright, that’s everything you ever wanted to know about the subject. The short executive summary is:

  • Aligning by "erase blocks" via stride/stripe is COMPLETELY IMPOSSIBLE on modern SSDs, on anything manufactured since around the year 2010. So don’t even attempt it. It’s based on ancient information for first-generation SSDs which used to store sectors sequentially and lacked TRIM. None of that is true anymore. All SSDs nowadays use wear leveling to store sectors randomly all over the NAND anywhere that there are empty sectors and uses a virtual mapping table to keep track of where data is truly stored on the disk. And they rely on TRIM to get rid of wasteful storage/write amplification.
  • But you MUST align all data containers (such as partition start+size, LUKS payload and LVM’s 1st Physical Extent) on 1 MiB boundaries to ensure that your virtual storage layer sectors all line up with physical sectors, to ensure that your writes only occupy single sectors. All modern tools (such as GParted, cryptsetup and LVM) now do this universally perfect 1 MiB alignment by default.
  • Enable TRIM. It’s extremely important since the SSD requires knowledge about dead/useless sectors for proper garbage collection, performance, and a long and healthy life.
  • Bonus: As for filesystem block size (regardless of what filesystem you use), you should be using 4096 bytes regardless of physical sector size, because this strikes the ultimate performance/space efficiency balance for modern filesizes. The larger the filesystem blocks, the less overhead in the Kernel’s interrupts (since each block has to go through an interrupt and a queue). But larger than 4096 doesn’t make sense because of diminishing returns in the performance, while increasing wastefulness for small files. So this is why all modern filesystems default to 4096 byte blocks.

Enjoy!

Method 4

Having a 4096 byte block may be beneficial for performance with regard to the computer hardware/OS, however since many SSD’s have a physical nand page size of 2048 bytes, for instance, setting a file system block (cluster) to 4096 bytes will result in a great deal of unnecessary manipulation of the second physical nand page block when a file write is equal or less than 2048 bytes (6144, 10240, 14336, etc.) in size. While it may be possible to have firmware adequately handle such situations this would still place additional requirements on the storage device. A 4096 byte file system block size would also place 2x the demand for 2048 byte physical nand page blocks as well as increase the SSD’s requirement to trigger the cycle to relocate and clear erase blocks. In my opinion since computer hardware has become blisteringly fast it would be most beneficial for the storage to be ready to receive data efficiently versus saving interrupt cycles at the computer. Matching file system block (cluster) size to physical nand page size relieves demand on the SSD storage in several ways and is more conducive to an efficient SSD state. From my understanding of the principals involved, while certainly not in all but in the majority of cases, whenever possible, it is most efficient for the OS file system block (cluster) size to match the physical NAND page block size.

I have found your comment to be thought provoking and worth constant consideration. I am interested in hearing your viewpoint regarding my conclusions about cluster size in relation to SSD alignment ‘protocols’ and file system arrangement.

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply