ZFS: pool layout
#freebsd #zfs #design
zfs
pool layout makes the most sense. Let’s review several options!
Table of Contents
Goals
Using ZFS for NAS appliances storing large media files is a very common scenario. In this specific context, we may want to get:
- great storage capacity;
- good resilience;
- decent performance;
…in that order!
We’ll try & figure out proper storage designs for 8 large hard disks in this specific context.
Concepts
You may well want to read choosing the right ZFS pool layout in addition to getting familiar with ZFS concepts.
This article focuses on some of the RAID-like properties of zfs
, and does
not dive into its specific data corruption prevention & data
integrity
mecanisms, which greatly complement general mitigations provided by
traditionnal RAID models. Note this article is not an introduction to RAID
technology, and familiary with RAID
vocabulary & core concepts is assumed for this article to make sense.
Pool types
Here’s a quick terminology roundup:
mirror
layouts include functionnality typically found in RAID1;stripe
layouts implement what people would generally call RAID0;raidz
layouts provide features usually refered to as RAID5 (with raidz1), as RAID6 (with raidz2) while expanding functionnality beyond traditionnal RAID limitations;dRAID
offers a very different route, with paradigm-shifting storage concepts; however:- it doesn’t make much sense to be deployed without hot spare devices & a large amount of physical disks;
- the code is still pretty young (it was added to OpenZFS in 2021);
- dRAID deployments are quite uncommon at the moment, so there’s less eyes, real world scenarios and pooled experience to gather from at the moment, though this may change;
- it adds quite a few layers of complexity to the already pretty complex software stack that is ZFS;
- as a result, we shall stick to more conservative layouts for the purpose of this article, but dRAID definitely deserved to be mentioned!
It’s about VDEVs
ZFS is made of:
datasets
(similar to filesystems and/or mountpoints) that rely on…zpools
(similar to RAID arrays) that are built from…VDEVs
(one or more physical disks);
Rule of thumb: the more VDEVs the better for IOPS performance, since ZFS addresses VDEVs in a parallel fashion. More on that in understanding ZFS vdev Types if you wanna deepen your understanding of that important aspect!
Layouts
Back to our 8 hard disks scenario, with a media storage use in mind, at least 3 options can be considered:
#1 — 4 x mirror VDEV of 2 disks
This is quite similar to traditional RAID10
:
- amounts to
4 VDEVs
; - has 1 data disk & 1 parity disk per VDEV;
- offers the best input/output performance, due to the high amount of VDEVs;
- comes with reduced storage: usable space is only half the total size of disks;
- ain’t the safest: pool can survive the loss of 4 disks, as long as they are not in the same VDEV, but it is very vulnerable to the loss of 2 disks within the same VDEV (in which case the whole pool may be irrecuperable);
#2 — 2 x RAIDz1 VDEV of 4 disks
Not a typical RAID setup, this can be thought of as a 2 x RAID5
kind of
design:
- amounts to
2 VDEVs
; - has 3 data disks & 1 disk for parity per VDEV;
- allows for more parrallelism due to 2 VDEVs, hence better IOPS & fast read/writes;
- is only partially safer: the pool can survive the loss of 2 disks, as long as they are not in the same VDEV, but it is very vulnerable to the loss of 2 disks within the same VDEV (in which case the entire pool is lost);
#3 — 1 x RAIDz2 VDEV of 8 disks
This can be thought of as a traditional RAID6
, and:
- amounts to a
single VDEV
; - has 6 data disks & 2 disks for parity;
- seems to be the safest: pool can survive the loss of any 2 disks;
Conclusions
Option 1 (4 x striped mirrors
) does not seem to fit out media-oriented
NAS usecase very well (fastest, weekest on storage space & failure
combinations). It’s worth mentionning for other high performance scenarios, and
can be levered with SSD or NVMe drives very efficiently to drive a number of
appliances.
Option 2 (2 x RAIDz1
) improves upon option 1, and may initially seem to
be the best of both worlds (high storage efficiency, and faster than RAIDz2),
but its specific vulnerability increases the risk of catastrophic failure
twofold: in case one disk fails, a second disk failure can be tolerated, if
and only if it is in the other VDEV.
Our setup of 8 large disks implies pretty high resilvering times
(ZFS term
for rebuilding), and if we account for delays such as:
- detection of the problem;
- removal of the drive;
- paperwork for RMA if applicable;
- ordering of a replacement drive;
- drive delivery time;
- drive insertion into the pool;
- and finally… resilvering;
…that leaves a lot of room for a second drive to fail. In which case, do we really want to be left hoping for a second failure to happen in the right place (1 out of 4)?
Option 3 (1 x RAIDz2
) allows for the failure of any other drive in that
scenario, so 1 out of 7 remaining disks. Given our initial goals, we have to
be mindful of the fact that NAS usage does not require crazy write performance.
However, storage capacity and resilience are key. So, it seems like the
single RAIDz2
approach fits our usecase best!
Let’s hope this helps you guys & gals make a decision, when/if confronted with a similar scenario in the near future!
Bonus
More reading
Rest assured that pool layout is a daunting questions for many:
- Which would you do? 5 way Raidz2 vs draid1 4:1
- Analysis Paralysis, 2 x Raidz1 vdevs vs. 1 x Raidz2 vdev
- 2x RAIDZ1 or 1x RAIDZ2 ?
Questions & answers
Feedback from the #freebsd
IRC channel on Libera:
18:11 veg| for an 8 x 22tb disks NAS storing media, would you go with a zpool
of 2 x RAIDz1 VDEVs (containing 4 disks each) or 1 x RAIDz2 VDEV
of 8 disks?
18:11 veg| performance wise, the more VDEVs the better, sure, but wouldn't
using 2 RAIDz1 VDEVs reduce redundancy by only allowing a single
drive failure per VDEV?
18:55 rwp| Both of your proposals allow for two failing devices but
the first one only if the second failure is from the other
VDEV while the second allows any two.
18:56 rwp| Both are good configurations. It depends upon your own
judgement as to which is better for you.
18:56 rwp| If you expect to be able to react to the first device fail by
replacing the redundancy of the failure before a second one in
the set of four fails then it is fine.
18:57 rwp| That would be typical in a corporate enterprise datacenter
situation with spares handy and someone on call to make the
replacement every day.
18:58 rwp| The raidz2 configuration allows any two devices to fail.
Allows a little more slack in your schedule if one fails and
you don't get to it before a second one fails then things are
still okay if you then replace the failed devices and restore
redundancy.
18:58 rwp| At the cost of somewhat less performance than the striped VDEV
configuration of the first proposal.
18:59 rwp| In my home setup I decided I might be away on vacation for a
couple of weeks and I did not need that performance. YMMV.
19:00 rwp| Also remember that if one has an identical collection of
storage devices and they are all running identical hours in
identical environments then systematic type failures are more
likely to cluster together.
19:01 rwp| Twice in my career I have had two sibling spinning disks die
within a few days of each other. One I caught okay.
19:01 rwp| The other failed before the client decided to replace the
redundancy and I had to do a full restore from backup.
19:26 rwp| For the one that I caught I copied the data to a different NAS
and switched over to it. But left the original running since
it was remote.
19:26 rwp| And then two days later the second drive in the mirror failed
causing the loss of the entire array. But the data had
already been moved. So all okay.
19:27 rwp| For the one that needed full recovery from backup the entire
tale was just snafu because that client needed to be convinced
to do something about it causing time to drag on. Could have
saved it. But not after the second drive failed too.
19:28 rwp| Experiences like those are why I am a huge fan of raidz2/raid6
which gives a little more safety. Sometimes just enough more. :-)
19:31 meena| I wonder how drives from different vendors that have same
specs… fail
19:32 rwp| Failures modes from different vendors should be completely
decoupled. Which is what you want.
19:34 rwp| For my own stuff I often split lots of drives. I mean I have
two drives that are identical and been running mirrored for a
year.
19:34 rwp| I buy two more identical drives. I then split those lots up
so that each system has one new drive and one experienced
drive.
19:34 rwp| Hoping that being a year apart that any failures will not be
coupled failures.
19:38 rwp| Corporations though often have SLAs with service vendors and
those vendors will often require arrays to have identical
drives and identical firmware. Because they don't want the
client to be complaining about weird issues that might be due
to drive firmware or behavior.
19:38 rwp| But they might have a 4-hour service agreement to replace any
failures very quickly. Which makes up for the potential problems.
19:47 rwp| Since veg mentioned 22TB disks I will mention in passing that
I sure hope they are not SMR Shingled Magnetic Recording.
Because those are completely unsuitable for purpose in a RAID.
19:48 rwp| If I were _given_ that much SMR storage I would probably still
use the drives. But not in any raid. I would use them only
as singles.