5333 private links
The Z File System (ZFS) was created by Matthew Ahrens and Jeff Bonwick in 2001. ZFS was designed to be a next generation file system for Sun Microsystems’ OpenSolaris. In 2008, ZFS was ported to FreeBSD. The same year a project was started to port ZFS to Linux. However, since ZFS is licensed under the Common Development and Distribution License, which is incompatible with the GNU General Public License, it cannot be included in the Linux kernel. To get around this problem, most Linux distros offer methods to install ZFS.
zfs destroy has a dry-run option, and can be used to delete sequences of snapshots. So you can see before-hand how much space will be reclaimed by deleting a sequence of snapshots, like this:
root@box:~# zfs destroy -nv pool/dataset@snap4%snap8
Quick and dirty cheat sheet for anyone getting ready to set up a new ZFS pool. Here are all the settings you’ll want to think about, and the values I think you’ll probably want to use.
I am not generally a fan of tuning things unless you need to, but unfortunately a lot of the ZFS defaults aren’t optimal for most workloads.
I tend to be pretty firm on how disks relate to vdevs, and vdevs relate to pools… but once you veer down deeper into the direct on-disk storage, I get a little hazier. So here’s an attempt to remedy that, with citations, for my benefit (and yours!) down the line. //
The zpool is the topmost unit of storage under ZFS. A zpool is a single, overarching storage system consisting of one or more vdevs. Writes are distributed among the vdevs according to how much FREE space each vdev has available – you may hear urban myths about ZFS distributing them according to the performance level of the disk, such that “faster disks end up with more writes”, but they’re just that – urban myths. //
Also note that the pool’s performance scales with the number of vdevs, not the number of disks within the vdevs. If you have a single 12 disk wide RAIDZ2 vdev in your pool, expect to see roughly the IOPS profile of a single disk, not of ten!
There is absolutely no parity or redundancy at the pool level. If you lose any vdev, you’ve lost the entire pool, plain and simple. Even if you “didn’t write to anything on that vdev yet” – the pool has altered and distributed its metadata accordingly once the vdev was added; if you lose that vdev “with nothing on it” you’ve still lost the pool.
It’s important to realize that the zpool is not a RAID0; in conventional terms, it’s a JBOD – and a fairly unusual one, at that. //
Even if you’ve done your homework and are absolutely certain that your disks use 512B hardware sectors, I strongly advise considering setting ashift=12 or even ashift=13 – because, remember, it’s immutable per vdev, and vdevs cannot be removed from pools. If you ever need to replace a 512B sector disk in a vdev with a 4K or 8K sector disk, you’ll be screwed if that vdev is ashift=9.
It seems like a stupid question, if you’re not an IT professional – and maybe even if you are – how much storage does it take to store 1TB of data? Unfortunately, it’s not a stupid question in the vein of “what weighs more, a pound of feathers or a pound of bricks”, and the answer isn’t “one terabyte” either. I’m going to try to break down all the various things that make the answer harder – and unhappier – in easy steps. Not everybody will need all of these things, so I’ll try to lay it out in a reasonably likely order from “affects everybody” to “only affects mission-critical business data with real RTO and RPO defined”. //
TL;DR: If you have 280GiB of existing data, you need 1TB of local capacity. //
8:1 rule of thumb
Based on the same calculations and with a healthy dose of rounding, we come up with another really handy, useful, memorable rule of thumb: when buying, you need eight times as much raw storage in production as the amount of data you have now.
So if you’ve got 1TiB of data, buy servers with 8TB of disks – whether it’s two 4TB disks in a single mirror, or four 2TB disks in two mirrors, or whatever, your rule of thumb is 8:1. Per system, so if you maintain hotspare and DR systems, you’ll need to do that twice more – but it’s still 8:1 in raw storage per machine.
This comes up far too often, so rather than continuing to explain it over and over again, I’m going to try to do a really good job of it once and link to it here.
Learn to get the most out of your ZFS filesystem in our new series on storage fundamentals.
OpenZFS founding developer Matthew Ahrens opened a PR for one of the most sought-after features in ZFS history—RAIDz expansion—last week. The new feature allows a ZFS user to expand the size of a single RAIDz vdev. For example, you can use the new feature to turn a three-disk RAIDz1 into a four, five, or six RAIDz1.OpenZFS is a complex filesystem, and things are necessarily going to get a bit chewy explaining how the feature works. So if you're a ZFS newbie, you may want to refer back to our comprehensive ZFS 101 introduction.
I always used to sweat, and sweat bullets, when it came time to replace a failed disk in ZFS. It happened infrequently enough that I never did remember the syntax quite right in between issues, and the last thing you want to do with production hardware and data is fumble around in a cold sweat – you want to know what the correct syntax for everything is ahead of time, and be practiced and confident.
Wonderfully, there’s no reason for that anymore – particularly on Linux, which boasts a really excellent set of tools for simulating storage hardware quickly and easily. So today, we’re going to walk through setting up a pool, destroying one of the disks in it, and recovering from the failure. The important part here isn’t really the syntax for the replacement, though… it’s learning how to set up the simulation in the first place, which allows you to test lots of things properly when your butt isn’t on the line!
Sanoid is a policy-driven snapshot management tool for ZFS filesystems. When combined with the Linux KVM hypervisor, you can use it to make your systems functionally immortal.
sanoid rollback demo
(Real time demo: rolling back a full-scale cryptomalware infection in seconds!)
More prosaically, you can use Sanoid to create, automatically thin, and monitor snapshots and pool health from a single eminently human-readable TOML config file at /etc/sanoid/sanoid.conf. (Sanoid also requires a "defaults" file located at /etc/sanoid/sanoid.defaults.conf, which is not user-editable.) //
Sanoid also includes a replication tool, syncoid, which facilitates the asynchronous incremental replication of ZFS filesystems.
We tested WD Red SMR v CMR drives to see if there was indeed a significant impact with the change. We found SMR can put data at risk 13-16x longer than CMR //
The performance results achieved by the WD Red WD40EFAX surprised me; my only personal experience with SMR drives prior to this point was with Seagate’s Archive line. Based on my time with those drives, I was expecting much poorer results. Instead, individually the WD Red SMR drives Are essentially functional. They work aggressively in the background to mitigate their own limitations. The performance of the drive seemed to recover relatively quickly if given even brief periods of inactivity. For single drive installations, the WD40EFAX will likely function without issue.
However, the WD40EFAX is not a consumer desktop-focused drive. Instead, it is a WD Red drive with NAS branding all over it. When that NAS readiness was put to the test the drive performed spectacularly badly. The RAIDZ results were so poor that, in my mind, they overshadow the otherwise decent performance of the drive. //
The WD40EFAX is demonstrably a worse drive than the CMR based WD40EFRX, and assuming that you have a choice in your purchase the CMR drive is the superior product. Given the significant performance and capability differential between the CMR WD Red and the SMR model, they should be different brands or lines rather than just product numbers. In online product catalogs keeping the same branding means that it shows as a “newer model” at many retailers. Many will simply purchase the newer model expecting it to be better as previous generations have been. That is not a recipe for success.
We tested WD Red SMR v CMR drives to see if there was indeed a significant impact with the change. We found SMR can put data at risk 13-16x longer than CMR
We discuss other WD Red DM-SMR performance experiences, share that WD-HGST knew using DM-SMR was not good for ZFS, and how this could have passed testing
News emerged earlier this week that Western Digital was producing NAS hard drives using SMR technology -- which results in slower performance in some types of applications -- without disclosing that fact to customers in marketing materials or specification sheets. After a bit of continued prodding, storage industry sage Chris Mellor secured statements from both Seagate and Toshiba that confirmed that those companies, too, are selling drives using the slow SMR technology without informing their customers. The latter two even use the tech in hard drives destined for desktop PCs. //
It's important to understand that there are different methods of recording data to a hard drive, and of the productized methods, shingled magnetic recording (SMR) is by far the slowest. //
As such, these drives are mainly intended for write-once-read-many (WORM) applications, like archival and cold data storage, and certainly not as boot drives for mainstream PC users. //
the industry developed SMR to boost hard drive capacity within the same footprint. The tactic revolves around writing data tracks over one another in a 'shingled' arrangement. //
For WD, that consisted of working the SMR models into its WD Red line of drives, but only the lower-capacity 2TB to 6TB models. Slower SMR drives do make some measure of sense in this type of application, provided the NAS is used for bulk data storage. Still, compatibility issues have cropped up in RAID and ZFS applications that users have attributed to the unique performance characteristics of the drives.
Toshiba tells Block and Files that it is also selling SMR drives without listing them on spec sheets, but does so within its P300 series of desktop drives. Seagate also disclosed that it uses the tech in four models, including its Desktop HDD 5TB, without advertising that fact. However, Seagate, like others, does correctly label several of their archival hard drives as using SMR tech, making the lack of disclosure on mainstream models a bit puzzling.
rsync.net accounts have full support for borg backup
borg creates and maintains encrypted, remote backups.
-
- Your data is encrypted with keys that only you hold
-
- rsync.net cannot see your data.
-
- Backups are fast, bandwidth efficient and compressed/deduplicated.
-
- borg is fully open source and is in active, current development
Specific borg Features
-
You may access the account with any tool that runs over SSH - not just borg.
-
You may create and maintain an unlimited number of borg repositories.
-
You have full control over your authorized_keys file to restrict IP and command access - or to enforce append-only mode.
-
You may configure custom alerts to generate email, SMS, or Pushover warnings - or call a webhook. Or all of the above.
-
You may set your account to be immutable (read-only) and accessible only by SSH key (disabled passwords).
We support legacy borg versions for backward compatibility - currently 0.29 and 1.x branches.
Special Pricing for borg Accounts
Special "borg accounts" are available at a very deep discount for technically proficient users.
-
- Free ZFS filesystem snapshots are not included since you'll be doing versioning and retention with borg.
-
- We will not configure subaccounts, or additional logins, for these borg accounts.
-
- There is NO borg specific technical support or integration engineering. You're here because you're an expert.
Choose any location
-
- NO Charges for ingress/egress
-
- Unlimited borg Repositories
-
- Start with 100 GB for $18/year
- $0.015/GB/Month
Open Standards
Common Sense
We give you an empty UNIX filesystem that you can access with any SSH tool
Our platform is built on ZFS which provides unparalleled data security and fault tolerance
rsync.net can scale to Petabyte size and Gigabit speeds
rsync / sftp / scp / borg / rclone / restic / git-annex
-
Secure Offsite Backup Since 2001
-
Five Global Locations in US, Europe and Asia
-
SSAE16, PCI and HIPAA Compliant - We Sign BAAs
-
No Contracts, Licenses, Setup or Per-Seat Costs
-
Unlimited Technical Support from UNIX Engineers
-
Free Monitoring and Configurable Alerts
-
Two Factor Auth available
-
Physical Data Delivery Available
-
Web Based Management Console
OpenZFS on Windows
OpenZFS on Windows port. Contribute to openzfsonwindows/ZFSin development by creating an account on GitHub.
I have a very large external drive that I want to use for backups. Part of the backups are for Windows partitions that need to be accessible from Windows, part are backups of some Linux partitons.
...
By default, a single copy of user data is stored, and two copies of file system metadata is stored. By increasing copies, you adjust this behavior such that copies copies of user data (within that file system) is stored, and copies plus one copies of system metadata (within that file system) is stored. For best effect, if you want to set copies to a value greater than one, you should do so when you create the pool using zpool create -O copies=N, to ensure that additional copies of all root file system metadata is stored. //
Under normal read operations, extra copies only consume storage space. When a read error occurs, if there are redundant, valid copies of the data, those redundant copies can be used as an alternative in order to satisfy the read request and transparently rewrite the broken copy. //
However, during writes, all copies must be updated to ensure that they are kept in sync. Because ZFS aims to place copies far away from each other, this introduces additional seeking. Also don't forget about its Merkle tree design, with metadata blocks placed some physical distance away from the data blocks (to guard against for example a single write failure corrupting both the checksum and the data). I believe that ZFS aims to place copies at least 1/8 of the vdev away from each other, and the metadata block containing the checksum for a data block is always placed some distance away from the data block.
Consequently, setting copies greater than 1 does not significantly help or hurt performance while reading, but reduces performance while writing in relation to the number of copies requested and the IOPS (I/O operations per second) performance of the underlying storage.
As friendly of an online advertisement as you'll find.
In mid-August, the first commercially available ZFS cloud replication target became available at rsync.net. Who cares, right? As the service itself states, "If you're not sure what this means, our product is Not For You."
Of course, this product is for someone—and to those would-be users, this really will matter. Fully appreciating the new rsync.net (spoiler alert: it's pretty impressive!) means first having a grasp on basic data transfer technologies. And while ZFS replication techniques are burgeoning today, you must actually begin by examining the technology that ZFS is slowly supplanting. //
Yep—it took the same old 1.7 seconds for ZFS to re-sync, no matter whether we touched a 1GB file, touched an 8GB file, or even moved an 8GB file from one place to another. In the last test, that's almost three full orders of magnitude faster than rsync: 1.7 seconds versus 1,479.3 seconds. Poor rsync never stood a chance.
rsync has a lot of trouble with these. The tool can save you network bandwidth when synchronizing a huge file with only a few changes, but it can't save you disk bandwidth, since rsync needs to read through and tokenize the entire file on both ends before it can even begin moving data across the wire. This was enough to be painful, even on our little 8GB test file. On a two terabyte VM image, it turns into a complete non-starter. I can (and do!) sync a two terabyte VM image daily (across a 5mbps Internet connection) usually in well under an hour. Rsync would need about seven hours just to tokenize those files before it even began actually synchronizing them... and it would render the entire system practically unusable while it did, since it would be greedily reading from the disks at maximum speed in order to do so.
The moral of the story? Replication definitely matters.