usc/cookbook/zfs-tuning/

Definitions

Most of us came from traditional storage systems, and had to wrap our minds around new concepts introduced by zfs. Let’s break it down:

zpool (RAID array + volume manager)

A zpool combines multiple physical disks into a single storage pool, handling redundancy, caching, and data integrity at the block level. It’s comparable to a:

RAID array, but more flexible and self-healing;
volume manager, dynamically allocating storage without requiring fixed partitions;
storage backend, on top of which ZFS datasets (filesystems) are created;

Instead of manually partitioning disks or setting up traditional RAID, zfs automatically distributes data across the zpool.

dataset (filesystem on steroids)

A dataset is like a traditional filesystem (such as ext4, ufs or xfs), but with ZFS-specific benefits:

Independent settings for most zfs tunables (quota, encryption, nfs, etc.);
Snapshots & cloning, enabling instant backups and rollbacks;
No fixed partitions: datasets grow and shrink dynamically within the zpool;

Datasets can be nested within each other to inherit properties from their parents, and creating datasets for each specific use quickly becomes second nature!

Tuning zpools

Laying at the foundation of our ZFS storage, zpool options often need to be set upon creation. There’s quite a few of them, as one can see by doing a zpool get all!

Let’s see those that truly matter in mose cases.

TRIM

When building a zpool out of SSD drives (like zroot, hosting the base system, in most cases), one may want to:

zpool set autotrim="on" zroot

This enables automatic TRIM operations, which:

helps optimize SSD performance by clearing unused blocks, improving garbage collection, and reducing write amplification;
enhances SSD lifespan and ensures better wear leveling;

TRIM is particularly beneficial for SSD-backed zpools with frequent write/delete operations. Potential downsides may include increased I/O overhead and compatibility issues with older models, but the benefits are widely considered to outweigh the risks for modern SSDs.

Bit shift (ashift)

ashift is an important zpool property that can only be set upon zpool creation, and cannot be changed later. It determines block size alignment for the zpool, which directly affects performance and efficiency, since misalignment can lead to increased write amplification.

Checking drives

Most SSDs and hard drives use 4KB physical sectors these days, so we want to align ZFS blocks to these 4KB boundaries, provided our disks are 4KB indeed! Let’s make sure by querying their SMART registery, which provides detailed health & status information for disks, including temperature, read/write error rates, and… sector size!

Using smartmontools (assuming /dev/da[0-3] in this scenario):

for part in `seq 0 3`
do
    echo -n "disk ${part}: "
    smartctl -a /dev/da"${part}" | grep -i size
done

Output confirms disks have logical sectors of 512 bytes, and physical sectors of 4KB indeed:

da0: Sector Sizes:     512 bytes logical, 4096 bytes physical
da1: Sector Sizes:     512 bytes logical, 4096 bytes physical
da2: Sector Sizes:     512 bytes logical, 4096 bytes physical
da3: Sector Sizes:     512 bytes logical, 4096 bytes physical

Disk scenarios

Here are recommended values for most scenarios:

Scenario	Sectors	Setting	Reason
Modern disks	4KB	`ashift=12`	because 2^12 = 4096
Some SSDs	8KB	`ashift=13`	because 2^13 = 8192
Old HDDs	512-byte	`ashift=9`	because 2^9 = 512

In doubt, the highter the better, according to the ZFS Tuning and Optimization guide from High Availability: « Setting the ZFS block size too low can result in a significant performance degradation referred to as read/write amplification. For example, if the ZFS block size were set to 512 bytes, but the underlying device sector size is 4KiB, then writing 512-byte blocks means having to write the first sector, then read back the 4KiB sector, modify it with the next 512-byte block, write it out again to a new 4KiB sector, and so on. Aligning the ZFS block size with the device sector size avoids this read/write penalty. »

Checking zpools

For existing zpools, let’s check what values were used with zpool list -v -o ashift. By default, FreeBSD uses ashift=0 within the installer, meaning autodetect. To verify what the autodetection has set, let’s use zdb:

zdb -C | grep -E '(name|ashift)'

And… relief (the autodetected values are good! Otherwise, we’d need to rebuild the zpool from scratch with zpool create -o ashift=12):

    name: 'zroot'
    hostname: 'testserver'
            ashift: 12
    name: 'zstorage'
    hostname: 'testserver'
            ashift: 12

Note: if/when using TrueNAS Core, we need to invoke zdb with an alternative path for zpool.cache: zdb -U /data/zfs/zpool.cache

Description

One can add a description to a zpool:

# zpool set comment="UNIX operating system" zroot
# zpool set comment="all your warez are belong to us" zstorage
# zpool get comment
NAME      PROPERTY  VALUE                            SOURCE
zstorage  comment   all your warez are belong to us  local
zroot     comment   UNIX operating system            local

Why would we wanna do this? Just for the lolz (it may well be helpful for documentation, identification, or troubleshooting in larger environments, but come on, let’s do it because we can!).

Tuning datasets

More interesting tunables relative to specific workloads happen on the dataset level!

Access time (atime)

By default, access time is updated every time a file is accessed, which results in a write to disk on each access (like a backup). This adds unnecessary work, impacting performance.

zfs set atime="off" zstorage
zfs set atime="off" zroot

Note: on Linux, we can set relatime=on.

Record size (recordsize=n)

recordsize defines the maximum size of a logical block in a dataset, and can be set independently for each dataset to a value ranging from 512 bytes to 1 megabyte (see the aforementioned doc).

Programster’s Blog’s article “ZFS Record Size” explains why recordsize=1M may have advantages: « Having larger blocks means there are fewer blocks that need to be written in order to write a large piece of data, such as a movie. Fewer blocks means less metadata to write/manage. Many of ZFS’s features, like caching, compression, checksums, and de-duplication, work on a block level, so larger block sizes are likely to reduce their overheads. For example, when copying a 700 MiB film using a 1 MiB block size, ZFS will only need to check 700 checksums for changes. With a 16 KiB block size, it would have to check 44,800 checksums for differences. »

Jumbo sizes

However, recordsize can go up to… 16 megabytes!

22:18 <veg> OMG, recordsize can be set up to 16M now?
22:39 <PMT> It always could
22:39 <PMT> it just had a tunable stopping it from going above 1
22:39 <PMT> rather, always meaning "since large_blocks went in"
22:39 <PMT> because they were worried it might break strangely
22:39 <PMT> but places have been using it and haven't made it go boom so
22:39 <PMT> the tunable was changed to default to "the biggest number"

As hinted at on #openzfs, 16M recordsizes require the the zpool to have feature@large_blocks, and vfs.zfs.max_recordsize= set to 16777216 (16M) instead of the default 1048576 (1M):

# zpool get feature@large_blocks zstorage
NAME     PROPERTY              VALUE                 SOURCE
zstorage  feature@large_blocks  active                local

# sysctl vfs.zfs.max_recordsize="16777216"
vfs.zfs.max_recordsize: 1048576 -> 16777216

Using a 16 MB block size introduces several challenges & potentiel downsides though, including increased latency due to the processing of entire blocks for compression and checksums, inefficient memory use as blocks are cached entirely in RAM, and a higher risk of data loss if a block is damaged, not to forget that network bottlenecks often overshadow storage performance gains.

However, some interesting results have been reported with recordsizes up to 4M on occasion for sequential reading of large media files.

Usecases

In summary, one would be advised to go for:

recordsize=128k by default;
recordsize=1M up to recordsize=4M for media (video/audio);
recordsize=16k for fragmented files (torrents) before moving files to a much higher recordsize=1M dataset for final storage, the mv command acting as as a defragmentation tool;

Note: the recordsize property may be updated at any time, but it can’t recursively apply to files that were already copied onto the dataset before the change, and only affects new files! So, for the change to take effect, one must create a new dataset and move files there. No zfs send | zfs recv either!

Testing storage efficiency

Let’s use recordsize=1M on a new dataset, transfer files via rsync -avP from a recordsize=128k archive dataset, and measure disk space consumption for the same data:

# du -sh archive/audio audio
222G    archive/audio   # recordsize=128k
207G    audio           # recordsize=1M

15G of storage space has been saved, which translates to a 7.2% space gain in this test involving a mix of FLAC & MP3 audio files for the most part.

For a better understanding, read:

Compression

Although it is disabled by default, compression proves to be useful in most scenarios, with zfs being smart enough to disable it completely on already compressed files that wouldn’t benefit from it.

zfs set compression="lz4" dataset

Recommended: lz4

Read OpenZFS: Understanding Transparent Compression for more details.

ZFS: {zpool,dataset} tuning

Table of Contents