5333 private links
Setting up a ZFS pool involves a number of permanent decisions that will affect the performance, cost, and reliability of your data storage systems, so you really want to understand all the options at your disposal for making the right choices from the beginning.
The "write hole" effect can happen if a power failure occurs during the write. It happens in all the array types, including but not limited to RAID5, RAID6, and RAID1. In this case it is impossible to determine which of data blocks or parity blocks have been written to the disks and which have not. In this situation the parity data does not match to the rest of the data in the stripe. Also, you cannot determine with confidence which data is incorrect - parity or one of the data blocks.
Write hole in RAID5
"Write hole" is widely recognized to affect a RAID5, and most of the discussions of the "write hole" effect refer to RAID5. It is important to know that other array types are affected as well.
If the user data is not written completely, usually a filesystem corrects the errors during the reboot by replaying the transaction log. If a file system does not support journaling, the errors will still be corrected during the next consistency check (CHKDSK or fsck).
If the parity (in RAID5) or the mirror copy (in RAID1) is not written correctly, it would be unnoticed until one of the array member disks fails. If the disk fails, you need to replace the failed disk and start RAID rebuild. In this case one of the blocks would be recovered incorrectly. If a RAID recovery is needed because of a controller failure, a mismatch of parity doesn't matter.
A mismatch of parity or mirrored data can be recovered without user intervention, if at some later point a full stripe is written on a RAID5, or the same data block is written again in a RAID1. In such a case the old (incorrect) parity is not used, but new (correct) parity data would be calculated and then written. Also, new parity data would be written if you force the resynchronization of the array (this option is available for many RAID controllers and NAS). //
How to avoid the "write hole"?
In order to completely avoid the write hole, you need to provide write atomicity. We call the operations which cannot be interrupted in the middle of the process "atomic". The "atomic" operation is either fully completed or is not done at all. If the atomic operation is interrupted because of external reasons (e.g. a power failure), it is guaranteed that a system stays either in original or in final state.
In a system which consists of several independent devices, natural atomicity doesn't exist. Variance of mechanical hard drives characteristics and data bus particularities don't allow to provide required synchronization. In these cases, transactions are typically used. Transaction is a group of operations for which atomicity is provided artificially. However, expensive overhead is required to provide transaction atomicity. Hence, transactions are not used in RAIDs.
One more option to avoid a write hole id to use a ZFS which is a hybrid of a filesystem and a RAID. ZFS uses "copy-on-write" to provide write atomicity. However, this technology requires a special type of RAID (RAID-Z) which cannot be reduced to a combination of common RAID types (RAID 0, RAID 1, or RAID 5).
ZFS offers an impressive amount of features even putting aside its hybrid nature (both a filesystem and a volume manager -- zvol) covered in detail on Wikipedia. One of the most fundamental points to keep in mind about ZFS is it targets a legendary reliability in terms of preserving data integrity. ZFS uses several techniques to detect and repair (self-healing) corrupted data. Simply speaking it makes an aggressive use of checksums and relies on data redundancy, the price to pay is a bit more CPU processing power. However, the Wikipedia article about ZFS also mention it is strongly discouraged to use ZFS over classic RAID arrays as it can not control the data redundancy, thus ruining most of its benefits.
Further Observations on ZFS Metadata Special Device
Last quarter we discussed metadata special devices.
By default, adding a special device to a zpool causes all (new) pool metadata to be written to the device. Presumably this device is substantially faster than spinning disk vdevs in the pool. //
Finally, the permissions and ownerships regime is re-enforced every five minutes just in case something ever gets chowned/chmodded incorrectly. Yes, we really are gratuitously chmodding and chowning ourselves every five minutes.
I am known as a strong ZFS Boot Environment supporter … and not without a reason. I have stated the reasons ‘why’ many times but most (or all) of them are condensed here – https://is.gd/BECTL – in my presentation about it.
The upcoming FreeBSD 13.0-RELEASE looks very promising. In many tests it is almost TWICE as fast as the 12.2-RELEASE. Ouch!
Having 12.2-RELEASE installed I wanted to check 13.0-BETA* to check if things that are important for me – like working suspend/resume for example – work as advertised on the newer version. It is the perfect task that can be achieved by using ZFS Boot Environments.
In the example below we will create entire new ZFS Boot Environment with clone of our current 12.2-RELEASE system and upgrade it there (in BE) to the 13.0-BETA3 version … and there will only be required on reboot – not three as in typical freebsd-update(8) upgrade procedure.
I assume that you have FreeBSD 12.2-RELEASE installed with ZFS (default ZFS FreeBSD install) and its installed in UEFI or UEFI+BIOS mode.
Here are the steps that will be needed.
u/ABC_AlwaysBeCoding (OP)
....
A few days in I started to see some unexplained panics/freezes. Some Steam games would fail validation... Something was wrong. I ran btrfsck and... it reported problems. I couldn't repair them without booting off another disk so I found my Manjaro USB key I made and booted off that again and attempted to --repair the btrfs drive.
FAIL. It was unable to do it. Unrecoverable errors. The partition was basically hosed! WTF? Luckily I only had "game data" on it and it was all synced with the cloud!
Initially I blamed BTRFS. I was mad. I had heard Ubuntu 19.10 now has experimental ZFS-on-root support out of the box. I was intrigued. I said "fuck it", made an installer key and wiped my disk and ran with it.
Things went fine until... I was installing some big game and... things just stopped. But not like "hang" stopped, I could still move windows around and keyboard input still appeared on the screen, which told me the CPU and GPU weren't the cause... fuck, it's something with the drive. It only cleared on reboot. I was afraid my data or filesystem got corrupt... turns out the former did not happen and the latter is virtually impossible with ZFS (yessss), so I tried again. Again, under heavy usage things eventually just froze. I kept trying different things now that I could duplicate it. I had the idea of forcing heavy activity via manually triggering a scrub and then watch zpool status in another window.
And then sonofabitch, I finally saw it. The drive that had just reported as ONLINE during the scrub, suddenly reported "SUSPENDED: One or more devices are faulted in response to IO failures." zpool clear did not clear it. But now I KNEW it was a drive (or enclosure, or cable, or mobo) issue. I immediately suspected the drive cable (as one does, who has been doing this sort of thing for a while), finally found another USB-A 3.1 to USB-C cable from Plugable that looked solidly built, and substituted it.
BOOM. ALL PROBLEMS WENT AWAY.
So OK, the enclosure guys (really cool enclosure, btw) sent me a bad cable with it. It happens. But...
SOLD... on ZFS. //
u/atemu12
The things btrfs check
reported might've been benign warnings, false positives and/or fixed on mount.
--repair the btrfs drive.
RTFM
Warning
Do not use --repair unless you are advised to do so by a developer or an experienced user, and then only after having accepted that no fsck successfully repair all types of filesystem corruption. Eg. some other software or hardware bugs can fatally damage a volume.
Not reading the manual results in:
Unrecoverable errors. The partition was basically hosed! //
u/ABC_AlwaysBeCoding (OP)
Fair enough. Thing is, --repair may be dangerous but it seems to be impossible to corrupt a ZFS filesystem because there isn't even a way to run any sort of --repair short of a resilvering/scrub. But yeah, I was inexperienced with both of these filesystems, as you can tell.
Btrfs will try to continue regardless of error
This is absolutely the wrong strategy. Once corruption starts, it usually spreads. I think ZFS does the right thing here, even if it is inconvenient (and to be fair, if I had a mirror, it probably would have just continued as well... I think? Maybe I'll keep the bad cable to do resilience testing).
BTRFS basically uses the Golang strategy (ignore errors, be squishy and nondeterministic, just keep going), ZFS uses the Erlang/Elixir strategy (fail fast, fail hard, be brittle and deterministic, restart cleanly) and I am most definitely in the latter camp based on 20+ years in the industry getting paid to program while doing sysadmin for fun or necessity.
Regular old fashioned ZFS has filesystems and snapshots. Recent versions of ZFS add a third object, called bookmarks. Bookmarks are described like this in the zfs manpage (for the 'zfs bookmark' command):
Creates a bookmark of the given snapshot. Bookmarks mark the point in time when the snapshot was created, and can be used as the incremental source for a zfs send command.
ZFS on Linux has an additional explanation here:
A bookmark is like a snapshot, a read-only copy of a file system or volume. Bookmarks can be created extremely quickly, compared to snapshots, and they consume no additional space within the pool. Bookmarks can also have arbitrary names, much like snapshots.
Unlike snapshots, bookmarks can not be accessed through the filesystem in any way. From a storage standpoint a bookmark just provides a way to reference when a snapshot was created as a distinct object. [...]
Whenever you create a zfs send stream, that stream is created as the delta between two snapshots. (That's the only way to do it as ZFS is currently implemented.) In order to apply that stream to a different dataset, the target dataset must contain the starting snapshot of the stream; if it doesn't, there is no common point of reference for the two. When you destroy the @snap0 snapshot on the source dataset, you create a situation that is impossible for ZFS to reconcile.
The way to do what you are asking is to keep one snapshot in common between both datasets at all times, and use that common snapshot as the starting point for the next send stream.
To recursively set prop for pool/fs/root and descendants with the exception of a few (eg mountpoint) non-inherited props:
zfs set prop=val pool/fs/root
Recursively reset (restore prop to inherited (if set "higher" in the tree) or default otherwise, or remove (if a user prop)):
zfs inherit -r prop pool/fs/root
//
You can delete a user property (those with a colon, e.g. com.sun:auto-snapshot:yearly) with the zfs inherit command, e.g
zfs inherit com.sun:auto-snapshot:yearly <dataset>
From RTFM zfs(8): Use the "zfs inherit" command to clear a user property. If the property is not defined in any parent dataset, it is removed entirely. Property values are limited to 1024 characters. //
Inheriting nonexistance, thats fun. I think a parent should be able to force behavior on its lineage, as well as trusting the child to imitate its behavior. Maybe the developers are training us to be better parents.
Examples of data problems include the following:
Pool or file system space is missing
Transient I/O errors due to a bad disk or controller
On-disk data corruption due to cosmic rays
Driver bugs resulting in data being transferred to or from the wrong location
A user overwriting portions of the physical device by accident
The zpool configuration at rsync.net is simple and straightforward: we create 12 disk raidz-3 vdevs (which can sustain 3 disk failures without data loss) and concatenate them into zpools.
This allows for a nice, even, five vdevs per 60 drive JBOD. 12x16TB disks in one vdev is still a number we (and our retained consultants at Klara Systems[1]) are comfortable with.
We also employ three SSD cache drives per zpool - one single drive for L2ARC and a mirrored pair of drives for the SLOG. This is a very standard configuration and anyone working professionally with ZFS will recognize it. //
We maintain a full complement of checksum commands that you can run, over ssh, to verify files:
ssh user@rsync.net rmd160 some/file
The full list of checksum commands is:
md5 sha1 sha224 sha256 sha384 sha512 sha512t256 rmd160 skein256 skein512 skein1024 cksum
... and note the last one in the list, 'cksum' ...
cksum is interesting because if your use-case is not security or collision detection but, rather, simple verification of file transfer and/or integrity, cksum is much, much faster than the "serious" checksum tools. You should consider cksum if you have simple integrity checks to do on millions of files, etc.
Q: How can I identify which /dev is which physical hard drive? Which drive is which?!?
A: This is an especially good thing to know when you are trying to replace a failed drive in a SoftRAID array! As of Version 0.7.4020 the serial number is displayed on the WebGUI page Disks > Management > HDD Management if the drive reports it.
run this one-liner is bash:
for i in $(sysctl -n kern.disks);do printf "%s\t%s\n" $i "$(smartctl -a /dev/$i | grep "Serial Number")";done | sort
This will output a list of all disks' assigned /dev's and serial numbers.
If you want to do it with a script or with the CLI, remember that the disk cannot be mounted or otherwise be in use. If you have a few disks to wipe, it will save time in the long run if you use a script like the one shown below. This script wipes the first and last 4096 kilobytes of data from a drive ensuring that any partitioning or MetaData is gone and you can then reuse your drive. Warning, all other data on the drive will become inaccessible!:
#!/bin/sh
echo "What disk do you want"
echo "to wipe? For example - ada1 :"
read disk
echo "OK, in 10 seconds I will destroy all data on $disk!"
echo "Press CTRL+C to abort!"
sleep 10
diskinfo ${disk} | while read disk sectorsize size sectors other
do
# Delete MBR, GPT Primary, ZFS(L0L1)/other partition table.
/bin/dd if=/dev/zero of=/dev/${disk} bs=${sectorsize} count=8192
# Delete GEOM metadata, GPT Secondary(L2L3).
/bin/dd if=/dev/zero of=/dev/${disk} bs=${sectorsize} oseek=`expr $sectors - 8192` count=8192
done
How to safely remove rest of GPT?
Disk have actual data (part of ZFS), I am don't need to destroy this
data.
GEOM: da6: the primary GPT table is corrupt or invalid.
GEOM: da6: using the secondary instead -- recovery strongly advised.
//
You need to zero out the backup gpt header. Geom locates that header
using (mediasize / sectorsize) - 1. I think mediasize/sectorsize is
exactly what's displayed by diskinfo -v as "mediasize in sectors", so
that number - 1 would be lastsector in:
dd if=/dev/zero of=/dev/da6 bs=<sectorsize> oseek=<lastsector> count=1
//
In case when you have not valid primary header, gpart destroy
will not
touch first two sectors. In you case you can wipe only last sector, like
Ian suggested, but use gpart destroy -F da6
instead of dd. //
You need to use gpart destroy -F
to CORRUPTED GPT, this command will wipe last sector where GPT backup header is located. Since GPT is in CORRUPT state, the primary header
will not be overwrited by this command.
When both primary and backup headers and tables are valid, gpart destroy
overwites PMBR, primary and backup headers.
Olivier Cochard-Labbé, the original founder of FreeNAS, shares with us a presentation outlining the history and beginnings of what would become the world's most popular storage operating system.
Slideshow explaining VDev, zpool, ZIL and L2ARC and other newbie mistakes!
I've put together a Powerpoint presentation(and PDF) that gives some useful info for newbies to FreeNAS. I decided to create this slideshow because in the last 5 months I've been on this forum I've seen a lot of people confused about vdevs, zpools, zils, l2arcs, etc. Hopefully we can put to rest alot of the confusion once and for all.
We get a large amount of duplicate threads with the same questions being asked every other day. Personally, we get them so often I decided to stop answering them and decided a better use of my time would be to create this presentation. I literally read every thread and every post that goes on the forum. So if I don't answer either the answer is in the thread or the answer is found in this presentation or FreeNAS manual. Answering every 3rd thread with "Consult the manual" gets a little old after a while and I have better uses for my time.
This presentation also contains a lot of information that is explained in a little more detail for new users. It includes many common errors newbies make and can save you some heartache. If you are brand new to FreeBSD, I recommend reading the manual cover to cover. There are a lot of recommendations throughout the manual, and they are typically there because they are an error trap for many people.
I've saved this in Powerpoint because I have some animations in the slideshow. I'm not sure what other formats would work. If you would like this in another format that supports animations please let me know and I'll see what I can do. Currently I provide this in a powerpoint presentation and PDF. The PDF has no animations therefore the Powerpoint is preferred.
Yet RAID is not only about availability. Its other advantages are important and, for most, possibly more important.
- Performance. Striping data across multiple drives can dramatically increase bandwidth for large file apps like video editing.
- Capacity. Putting 4-12 drives in a RAID gives a large virtual disk that is much larger than any single drive.
- Management. After the often painful setup process - and until something breaks - RAID arrays are simpler to manage than individual disks. //
Many are still using small RAID5 arrays with 10^-14 error rates - me too! - and RAID5 seems to work fine. But adjustments should be made to account for the unchanged error rates.
- Always maintain a minimum of 2 copies any data stored on a RAID - 1 on the RAID and 1 elsewhere.
- Where there is a drive failure pull any unbacked up data - latest documents that aren't backed up - off the RAID before replacing the failed drive.
Since RAID arrays are more complex than individual drives, they are more likely to fail. But until they do they are more convenient, faster and larger than any single drive.
RAID 5 protects against a single disk failure. You can recover all your data if a single disk breaks. The problem: once a disk breaks, there is another increasingly common failure lurking. And in 2009 it is highly certain it will find you.
Disks fail While disks are incredibly reliable devices, they do fail. Our best data - from CMU and Google - finds that over 3% of drives fail each year in the first three years of drive life, and then failure rates start rising fast.
With 7 brand new disks, you have ~20% chance of seeing a disk failure each year. Factor in the rising failure rate with age and over 4 years you are almost certain to see a disk failure during the life of those disks.
But you're protected by RAID 5, right? Not in 2009.
Reads fail SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 100,000,000,000,000 bits, the disk will very politely tell you that, so sorry, but I really, truly can't read that sector back to you.
One hundred trillion bits is about 12 terabytes. Sound like a lot? Not in 2009. //
So now what? The obvious answer, and the one that storage marketers have begun trumpeting, is RAID 6, which protects your data against 2 failures. Which is all well and good, until you consider this: as drives increase in size, any drive failure will always be accompanied by a read error. So RAID 6 will give you no more protection than RAID 5 does now, but you'll pay more anyway for extra disk capacity and slower write performance.
Gee, paying more for less! I can hardly wait! //
Originally the developers of RAID suggested RAID 6 as a means of protecting against 2 disk failures. As we now know, a single disk failure means a second disk failure is much more likely - see the CMU pdf Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? for details - or check out my synopsis in Everything You Know About Disks Is Wrong. RAID 5 protection is a little dodgy today due to this effect and RAID 6 - in a few years - won't be able to help.
Finally, I recalculated the AFR for 7 drives using the 3.1% AFR from the CMU paper, using the formula suggested by a couple of readers - 1-96.9 ^# of disks - and got 19.8%. So I changed the ~23% number to ~20%.
Comments welcome, of course. I revisited this piece in 2013 in Has RAID5 stopped working? Now that we have 6TB drives - some with the same 10^14 URE - the problem is worse than ever.
As you can see in the chart above, btrfs-raid1 differed pretty drastically from its conventional analogue. To understand how, let's think about a hypothetical collection of "mutt" drives of mismatched sizes. If we have one 8T disk, three 4T disks, and a 2T disk, it's difficult to make a useful conventional RAID array from them—for example, a RAID5 or RAID6 would need to treat them all as 2T disks (producing only 8T raw storage before parity).
However, btrfs-raid1 offers a very interesting premise. Since it doesn't actually marry disks together in pairs, it can use the entire collection of disks without waste. Any time a block is written to the btrfs-raid1, it's written identically to two separate disks—any two separate disks. Since there are no fixed pairings, btrfs-raid1 is free to simply fill all the disks at the same rough rate proportional to their free capacity. //
As any storage administrator worth their salt will tell you, RAID is primarily about uptime. Although it may keep your data safe, that's not its real job—the job of RAID is to minimize the number of instances in which you have to take the system down for extended periods of time to restore from proper backup.
Once you understand that fact, the way btrfs-raid handles hardware failure looks downright nuts. What happens if we yank a disk from our btrfs-raid1 array above? //
Btrfs' refusal to mount degraded, automatic mounting of stale disks, and lack of automatic stale disk repair/recovery do not add up to a sane way to manage a "redundant" storage system. //
Believe it or not, we've still only scratched the surface of btrfs problems. Similar problems and papercuts lurk in the way it manages snapshots, replication, compression, and more. Once we get through that, there's performance to talk about—which in many cases can be orders of magnitude slower than either ZFS or mdraid in reasonable, common real-world conditions and configurations.
Output:
# zpool status
...
scan: scrub in progress since Sun Jul 25 16:07:49 2021
403M scanned at 100M/s, 68.4M issued at 10.0M/s, 405M total
0B repaired, 16.91% done, 00:00:04 to go
Where:
Metadata which references 403M of file data has been scanned at 100M/s, and 68.4M of that file data has been scrubbed sequentially at 10.0M/s.