usc/cookbook/zfs-pool/

Goals

Using ZFS for NAS appliances storing large media files is a very common scenario. In this specific context, we may want to get:

great storage capacity;
good resilience;
decent performance;

…in that order!

We’ll try & figure out proper storage designs for 8 large hard disks in this specific context.

Concepts

You may well want to read choosing the right ZFS pool layout in addition to getting familiar with ZFS concepts.

This article focuses on some of the RAID-like properties of zfs, and does not dive into its specific data corruption prevention & data integrity mecanisms, which greatly complement general mitigations provided by traditionnal RAID models. Note this article is not an introduction to RAID technology, and familiary with RAID vocabulary & core concepts is assumed for this article to make sense.

Pool types

Here’s a quick terminology roundup:

mirror layouts include functionnality typically found in RAID1;
stripe layouts implement what people would generally call RAID0;
raidz layouts provide features usually refered to as RAID5 (with raidz1), as RAID6 (with raidz2) while expanding functionnality beyond traditionnal RAID limitations;
dRAID offers a very different route, with paradigm-shifting storage concepts; however:
- it doesn’t make much sense to be deployed without hot spare devices & a large amount of physical disks;
- the code is still pretty young (it was added to OpenZFS in 2021);
- dRAID deployments are quite uncommon at the moment, so there’s less eyes, real world scenarios and pooled experience to gather from at the moment, though this may change;
- it adds quite a few layers of complexity to the already pretty complex software stack that is ZFS;
- as a result, we shall stick to more conservative layouts for the purpose of this article, but dRAID definitely deserved to be mentioned!

It’s about VDEVs

ZFS is made of:

datasets (similar to filesystems and/or mountpoints) that rely on…
zpools (similar to RAID arrays) that are built from…
VDEVs (one or more physical disks);

Rule of thumb: the more VDEVs the better for IOPS performance, since ZFS addresses VDEVs in a parallel fashion. More on that in understanding ZFS vdev Types if you wanna deepen your understanding of that important aspect!

Layouts

Back to our 8 hard disks scenario, with a media storage use in mind, at least 3 options can be considered:

#1 — 4 x mirror VDEV of 2 disks

This is quite similar to traditional RAID10:

amounts to 4 VDEVs;
has 1 data disk & 1 parity disk per VDEV;
offers the best input/output performance, due to the high amount of VDEVs;
comes with reduced storage: usable space is only half the total size of disks;
ain’t the safest: pool can survive the loss of 4 disks, as long as they are not in the same VDEV, but it is very vulnerable to the loss of 2 disks within the same VDEV (in which case the whole pool may be irrecuperable);

#2 — 2 x RAIDz1 VDEV of 4 disks

Not a typical RAID setup, this can be thought of as a 2 x RAID5 kind of design:

amounts to 2 VDEVs;
has 3 data disks & 1 disk for parity per VDEV;
allows for more parrallelism due to 2 VDEVs, hence better IOPS & fast read/writes;
is only partially safer: the pool can survive the loss of 2 disks, as long as they are not in the same VDEV, but it is very vulnerable to the loss of 2 disks within the same VDEV (in which case the entire pool is lost);

#3 — 1 x RAIDz2 VDEV of 8 disks

This can be thought of as a traditional RAID6, and:

amounts to a single VDEV;
has 6 data disks & 2 disks for parity;
seems to be the safest: pool can survive the loss of any 2 disks;

Conclusions

Option 1 (4 x striped mirrors) does not seem to fit out media-oriented NAS usecase very well (fastest, weekest on storage space & failure combinations). It’s worth mentionning for other high performance scenarios, and can be levered with SSD or NVMe drives very efficiently to drive a number of appliances.

Option 2 (2 x RAIDz1) improves upon option 1, and may initially seem to be the best of both worlds (high storage efficiency, and faster than RAIDz2), but its specific vulnerability increases the risk of catastrophic failure twofold: in case one disk fails, a second disk failure can be tolerated, if and only if it is in the other VDEV.

Our setup of 8 large disks implies pretty high resilvering times (ZFS term for rebuilding), and if we account for delays such as:

detection of the problem;
removal of the drive;
paperwork for RMA if applicable;
ordering of a replacement drive;
drive delivery time;
drive insertion into the pool;
and finally… resilvering;

…that leaves a lot of room for a second drive to fail. In which case, do we really want to be left hoping for a second failure to happen in the right place (1 out of 4)?

Option 3 (1 x RAIDz2) allows for the failure of any other drive in that scenario, so 1 out of 7 remaining disks. Given our initial goals, we have to be mindful of the fact that NAS usage does not require crazy write performance. However, storage capacity and resilience are key. So, it seems like the single RAIDz2 approach fits our usecase best!

Let’s hope this helps you guys & gals make a decision, when/if confronted with a similar scenario in the near future!

Bonus

Questions & answers

Feedback from the #freebsd IRC channel on Libera:

18:11   veg| for an 8 x 22tb disks NAS storing media, would you go with a zpool
             of 2 x RAIDz1 VDEVs (containing 4 disks each) or 1 x RAIDz2 VDEV
             of 8 disks? 
18:11   veg| performance wise, the more VDEVs the better, sure, but wouldn't
             using 2 RAIDz1 VDEVs reduce redundancy by only allowing a single
             drive failure per VDEV? 
18:55   rwp| Both of your proposals allow for two failing devices but
             the first one only if the second failure is from the other
             VDEV while the second allows any two.
18:56   rwp| Both are good configurations.  It depends upon your own
             judgement as to which is better for you.
18:56   rwp| If you expect to be able to react to the first device fail by
             replacing the redundancy of the failure before a second one in
             the set of four fails then it is fine.
18:57   rwp| That would be typical in a corporate enterprise datacenter
             situation with spares handy and someone on call to make the
             replacement every day.
18:58   rwp| The raidz2 configuration allows any two devices to fail.
             Allows a little more slack in your schedule if one fails and
             you don't get to it before a second one fails then things are
             still okay if you then replace the failed devices and restore
             redundancy.
18:58   rwp| At the cost of somewhat less performance than the striped VDEV
             configuration of the first proposal.
18:59   rwp| In my home setup I decided I might be away on vacation for a
             couple of weeks and I did not need that performance.  YMMV.
19:00   rwp| Also remember that if one has an identical collection of
             storage devices and they are all running identical hours in
             identical environments then systematic type failures are more
             likely to cluster together.
19:01   rwp| Twice in my career I have had two sibling spinning disks die
             within a few days of each other.  One I caught okay.
19:01   rwp| The other failed before the client decided to replace the
             redundancy and I had to do a full restore from backup.
19:26   rwp| For the one that I caught I copied the data to a different NAS
             and switched over to it.  But left the original running since
             it was remote.
19:26   rwp| And then two days later the second drive in the mirror failed
             causing the loss of the entire array.  But the data had
             already been moved.  So all okay.
19:27   rwp| For the one that needed full recovery from backup the entire
             tale was just snafu because that client needed to be convinced
             to do something about it causing time to drag on.  Could have
             saved it.  But not after the second drive failed too.
19:28   rwp| Experiences like those are why I am a huge fan of raidz2/raid6
             which gives a little more safety. Sometimes just enough more. :-)
19:31 meena| I wonder how drives from different vendors that have same
             specs… fail
19:32   rwp| Failures modes from different vendors should be completely
             decoupled.  Which is what you want.
19:34   rwp| For my own stuff I often split lots of drives. I mean I have
             two drives that are identical and been running mirrored for a
             year.
19:34   rwp| I buy two more identical drives.  I then split those lots up
             so that each system has one new drive and one experienced
             drive.
19:34   rwp| Hoping that being a year apart that any failures will not be
             coupled failures.
19:38   rwp| Corporations though often have SLAs with service vendors and
             those vendors will often require arrays to have identical
             drives and identical firmware.  Because they don't want the
             client to be complaining about weird issues that might be due
             to drive firmware or behavior.
19:38   rwp| But they might have a 4-hour service agreement to replace any
             failures very quickly.  Which makes up for the potential problems.
19:47   rwp| Since veg mentioned 22TB disks I will mention in passing that
             I sure hope they are not SMR Shingled Magnetic Recording.
             Because those are completely unsuitable for purpose in a RAID.
19:48   rwp| If I were _given_ that much SMR storage I would probably still
             use the drives.  But not in any raid.  I would use them only
             as singles.

ZFS: pool layout

Table of Contents