ZFS: {zpool,dataset} tuning
#freebsd #zfs #finetuning
zfs
: some at the zpool
level, some being dataset
specific. Let’s start with the basics!
Table of Contents
Definitions
Most of us came from traditional storage systems, and had to wrap our minds
around new concepts introduced by zfs
. Let’s break it down:
zpool (RAID array + volume manager)
A zpool
combines multiple physical disks into a single storage pool,
handling redundancy, caching, and data integrity at the block level. It’s
comparable to a:
- RAID array, but more flexible and self-healing;
- volume manager, dynamically allocating storage without requiring fixed partitions;
- storage backend, on top of which ZFS datasets (filesystems) are created;
Instead of manually partitioning disks or setting up traditional RAID, zfs
automatically distributes data across the zpool
.
dataset (filesystem on steroids)
A dataset
is like a traditional filesystem (such as ext4
, ufs
or
xfs
), but with ZFS-specific benefits:
- Independent settings for most zfs tunables (quota, encryption, nfs, etc.);
- Snapshots & cloning, enabling instant backups and rollbacks;
- No fixed partitions: datasets grow and shrink dynamically within the zpool;
Datasets can be nested within each other to inherit properties from their parents, and creating datasets for each specific use quickly becomes second nature!
Tuning zpools
Laying at the foundation of our ZFS storage, zpool
options often need to be
set upon creation. There’s quite a few of them, as one can see by doing a
zpool get all
!
Let’s see those that truly matter in mose cases.
TRIM
When building a zpool
out of SSD drives (like zroot
, hosting the base
system, in most cases), one may want to:
zpool set autotrim="on" zroot
This enables automatic TRIM operations, which:
- helps optimize SSD performance by clearing unused blocks, improving garbage collection, and reducing write amplification;
- enhances SSD lifespan and ensures better wear leveling;
TRIM is particularly beneficial for SSD-backed zpools with frequent write/delete operations. Potential downsides may include increased I/O overhead and compatibility issues with older models, but the benefits are widely considered to outweigh the risks for modern SSDs.
Bit shift (ashift)
ashift
is an important zpool property that can only be set upon zpool
creation, and cannot be changed later. It determines block size alignment
for the zpool, which directly affects performance and efficiency, since
misalignment can lead to increased write amplification.
Checking drives
Most SSDs and hard drives use 4KB physical sectors these days, so we want to align ZFS blocks to these 4KB boundaries, provided our disks are 4KB indeed! Let’s make sure by querying their SMART registery, which provides detailed health & status information for disks, including temperature, read/write error rates, and… sector size!
Using smartmontools (assuming /dev/da[0-3]
in this scenario):
for part in `seq 0 3`
do
echo -n "disk ${part}: "
smartctl -a /dev/da"${part}" | grep -i size
done
Output confirms disks have logical sectors of 512 bytes, and physical sectors of 4KB indeed:
da0: Sector Sizes: 512 bytes logical, 4096 bytes physical
da1: Sector Sizes: 512 bytes logical, 4096 bytes physical
da2: Sector Sizes: 512 bytes logical, 4096 bytes physical
da3: Sector Sizes: 512 bytes logical, 4096 bytes physical
Disk scenarios
Here are recommended values for most scenarios:
Scenario | Sectors | Setting | Reason |
---|---|---|---|
Modern disks | 4KB | ashift=12 |
because 2^12 = 4096 |
Some SSDs | 8KB | ashift=13 |
because 2^13 = 8192 |
Old HDDs | 512-byte | ashift=9 |
because 2^9 = 512 |
In doubt, the highter the better, according to the ZFS Tuning and Optimization guide from High Availability: « Setting the ZFS block size too low can result in a significant performance degradation referred to as read/write amplification. For example, if the ZFS block size were set to 512 bytes, but the underlying device sector size is 4KiB, then writing 512-byte blocks means having to write the first sector, then read back the 4KiB sector, modify it with the next 512-byte block, write it out again to a new 4KiB sector, and so on. Aligning the ZFS block size with the device sector size avoids this read/write penalty. »
Checking zpools
For existing zpools, let’s check what values were used with zpool list -v -o ashift
. By default, FreeBSD uses ashift=0
within the installer, meaning
autodetect. To verify what the autodetection has set, let’s use zdb
:
zdb -C | grep -E '(name|ashift)'
And… relief (the autodetected values are good! Otherwise, we’d need to
rebuild the zpool from scratch with zpool create -o ashift=12
):
name: 'zroot'
hostname: 'testserver'
ashift: 12
name: 'zstorage'
hostname: 'testserver'
ashift: 12
Note: if/when using TrueNAS Core,
we need to invoke zdb
with an alternative path for zpool.cache
: zdb -U /data/zfs/zpool.cache
Description
One can add a description to a zpool
:
# zpool set comment="UNIX operating system" zroot
# zpool set comment="all your warez are belong to us" zstorage
# zpool get comment
NAME PROPERTY VALUE SOURCE
zstorage comment all your warez are belong to us local
zroot comment UNIX operating system local
Why would we wanna do this? Just for the lolz (it may well be helpful for documentation, identification, or troubleshooting in larger environments, but come on, let’s do it because we can!).
Tuning datasets
More interesting tunables relative to specific workloads happen on the dataset level!
Access time (atime)
By default, access time is updated every time a file is accessed, which results in a write to disk on each access (like a backup). This adds unnecessary work, impacting performance.
zfs set atime="off" zstorage
zfs set atime="off" zroot
Note: on Linux, we can set relatime=on
.
Record size (recordsize=n)
recordsize
defines the maximum size of a logical block in a dataset, and can
be set independently for each dataset to a value ranging from 512 bytes to 1
megabyte (see the aforementioned
doc).
Programster’s Blog’s article “ZFS Record
Size” explains why
recordsize=1M
may have advantages: « Having larger blocks means there are fewer blocks that need to be written in order to write a large piece of data, such as a movie. Fewer blocks means less metadata to write/manage. Many of ZFS’s features, like caching, compression, checksums, and de-duplication, work on a block level, so larger block sizes are likely to reduce their overheads. For example, when copying a 700 MiB film using a 1 MiB block size, ZFS will only need to check 700 checksums for changes. With a 16 KiB block size, it would have to check 44,800 checksums for differences. »
Jumbo sizes
However, recordsize
can go up to… 16 megabytes!
22:18 <veg> OMG, recordsize can be set up to 16M now?
22:39 <PMT> It always could
22:39 <PMT> it just had a tunable stopping it from going above 1
22:39 <PMT> rather, always meaning "since large_blocks went in"
22:39 <PMT> because they were worried it might break strangely
22:39 <PMT> but places have been using it and haven't made it go boom so
22:39 <PMT> the tunable was changed to default to "the biggest number"
As hinted at on #openzfs
, 16M recordsizes require the the zpool to have
feature@large_blocks
, and vfs.zfs.max_recordsize=
set to 16777216
(16M) instead of the default 1048576 (1M):
# zpool get feature@large_blocks zstorage
NAME PROPERTY VALUE SOURCE
zstorage feature@large_blocks active local
# sysctl vfs.zfs.max_recordsize="16777216"
vfs.zfs.max_recordsize: 1048576 -> 16777216
Using a 16 MB block size introduces several challenges & potentiel downsides though, including increased latency due to the processing of entire blocks for compression and checksums, inefficient memory use as blocks are cached entirely in RAM, and a higher risk of data loss if a block is damaged, not to forget that network bottlenecks often overshadow storage performance gains.
However, some interesting results have been reported with recordsizes up to 4M on occasion for sequential reading of large media files.
Usecases
In summary, one would be advised to go for:
recordsize=128k
by default;recordsize=1M
up torecordsize=4M
for media (video/audio);recordsize=16k
for fragmented files (torrents) before moving files to a much higherrecordsize=1M
dataset for final storage, themv
command acting as as a defragmentation tool;
Note: the recordsize
property may be updated at any time, but it can’t recursively apply to files that were already copied onto the dataset before the change, and only affects new files! So, for the change to
take effect, one must create a new dataset and move files there. No zfs send | zfs recv
either!
Testing storage efficiency
Let’s use recordsize=1M
on a new dataset, transfer files via rsync -avP
from a recordsize=128k
archive dataset, and measure disk space consumption
for the same data:
# du -sh archive/audio audio
222G archive/audio # recordsize=128k
207G audio # recordsize=1M
15G of storage space has been saved, which translates to a 7.2% space gain in this test involving a mix of FLAC & MP3 audio files for the most part.
For a better understanding, read:
- Tuning Recordsize in OpenZFS by Klara ;
- Workload Tuning: Basic Concepts | Dataset recordsize, Larger record sizes & Workload Tuning: BitTorrent
Compression
Although it is disabled by default, compression
proves to be useful in most
scenarios, with zfs being smart enough to disable it completely on already
compressed files that wouldn’t benefit from it.
zfs set compression="lz4" dataset
Recommended: lz4
Read OpenZFS: Understanding Transparent Compression for more details.
References
Further reading: