5333 private links
I honestly don't know for sure either, but I see the same thing from time to time. I have one pool that's a uniform collection of Hitachi disks off an internal LSI2008 baed controller that never has errors, and a pool that's a mismash of WD (some 512, some 4K) & Seagate 2TB drives in an external case attached to a SAS expander by way of an LSI1068 based controller with external cabling, and go figure, it almost always has some small quantity of data that gets repaired with no device level errors (running this scrub right now):
If you work with storage applications or storage hardware there’s a good chance you’ve heard of ZFS. ZFS is essentially a software implementation of RAID but in my experience the most reliable it’s software RAID I’ve worked with.
Traidtion RAID instead of ZFS
Comparison to standard RAID
Over the years I’ve worked with several implementations of hardware RAID and for the most part they are pretty equal. However, most hardware RAID implementations I’ve seen — mine included — aren’t really done well. Before I move on to ZFS RAID I’m going to cover the basic problems I’ve come across with Hardware RAID setups which contributed to my switch to ZFS. In this list below RAID = “Hardware RAID”
- RAID controllers are typically more expensive than HBAs (Host Bus Adapters)
- Many RAID users do not properly set their cache settings on top of the fact that most cards do not come with a BBU. Lots of admins get frustrated with throughput and force write-back without a BBU
- RAID controllers rarely keep up with drive capacity
- Sometimes the implementation is proprietary which can make your setup less scalable (limited RAID sets, inability to mix/max nested raid or difficult to expand existing sets)
- Most user interfaces I have worked with for hardware RAID were poor; i.e. option ROMs on the card that can’t see full disk names or OS specific utilities that are buggy or available to select OS installs only
- I’ve yet to see a RAID card that allows you to perform a scan for errors like the ZFS scrub. I’m not saying they don’t exist, just haven’t see them
On your quest for data integrity using OpenZFS is unavoidable. In fact, it would be quite unfortunate if you are using anything but ZFS for storing your valuable data. However, a lot of people are reluctant to try it out. Reason being that an enterprise-grade filesystem with a wide range of features built into it, ZFS must be difficult to use and administer. Nothing can be further from truth. Using ZFS is as easy as it gets. With a handful of terminologies, and even fewer commands you are ready to use ZFS anywhere – From the enterprise to your home/office NAS.
In the words of the creators of ZFS: “We want to make adding storage to your system as easy as adding new RAM sticks.”
The following sections describe how to identify the type of data corruption and how to repair the data, if possible.
-
Identifying the Type of Data Corruption
-
Repairing a Corrupted File or Directory
-
Repairing ZFS Storage Pool-Wide Damage
Checksum errors can occur transiently on individual disks or across multiple disks. The most likely culprits are bit rot or transient storage subsystem errors - oddities like signal loss due to solar flares and so on.
With ZFS, they are not of much concern, but some degree of preventative maintenance is necessary to prevent a failure from accumulation.
From time to time you may see zpool status output similar to this:
NAME STATE READ WRITE CKSUM
zones ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 23
c1t1d0 ONLINE 0 0 0
Note the "23" in the CKSUM column.
If this number is significantly large or growing rapidly, the drive is likely in a "pre-failure" state and will fail soon, and is otherwise (in this case) potentially compromising the redundancy of the VDEV.
One thing to make note of is that checksum errors on individual drives, from time to time, is normal and expected behavior (if not optimal). So are many errors on single drives which are about to fail. Many checksum failures across multiple drives can be indicative of a significant storage subsystem problem: a damaged cable, a faulty HBA, or even power problems. If this is noticed, consider contacting Support for assistance with identification.
This document is written for administrators and those who have familiarity with computing hardware platforms and storage concepts such as RAID. If you're already versed in the general failure process, you can skip ahead to how to replace a drive and repairing the pool.
Degrees of verbosity
When a drive fails or has errors, a great degree of logging data is available on SmartOS. We can drill down in more detail to help us find the underlying cause of disk failure. In descending order, these commands will present the disk failure cause in increasing verbosity:
zpool status
iostat -en
iostat -En
fmadm faulty
fmdump -et {n}days
fmdump -eVt {n}days
The zpool status command will present us with a high level view of pool health.
iostat will present us with high level error counts and specifics as to the devices in question.
fmadm faulty will tell us more specifically which event led to the disk failure. (fmadm can also be used to clear transitory faults; this, however, is outside the scope of this document. Refer to the fmadm man page for more information.) fmdump is much more specific still, presenting us of a log of the last {n} days of fault events. This information is often extraneous to replacing faulted disks, but if the problem is more complex than a simple single disk failure, it is extremely useful in isolating a root cause.
There's a nice lengthy post of the zfs-discuss mailing list that covers how scrub works on a raidz vdev. The gist of it is:
- read data block, compute checksum
- read checksum from disk, compare to computed checksum
- compute parity for block, compute checksum for parity block
- read checksum for parity block from disk, compare to computed checksum
If compute checksums don't match the checksums on disk, then data block is rebuilt from parity, checksums are compared, and data written overtop of the bad data block.
There are only two cases where ZFS will overwrite an existing block. The first is when updating the vdev labels at the beginning and end of each device. The second is when repairing a damaged block on disk. The key here is that the block is rewritten with its original contents meaning its checksum, which is immutable, does not need to change. This is the only valid data which can be written to this block. //
To sum it up:
Read error during normal operation = data gets reconstructed from parity and we continue
Read error during scrub = data gets reconstructed from parity plus fixed by overwriting the data on the disk? The rewrite will then rely on the disk remapping capability of the hard disk.
This is how many modern file system backup programs work. On day 1 you make an rsync copy of your entire file system:
backup@backup_server> DAY1=`date +%Y%m%d%H%M%S`
backup@backup_server> rsync -av -e ssh earl@192.168.1.20:/home/earl/ /var/backups/$DAY1/
On day 2 you make a hard link copy of the backup, then a fresh rsync:
backup@backup_server> DAY2=`date +%Y%m%d%H%M%S`
backup@backup_server> cp -al /var/backups/$DAY1 /var/backups/$DAY2
backup@backup_server> rsync -av -e ssh --delete earl@192.168.1.20:/home/earl/ /var/backups/$DAY2/
“cp -al” makes a hard link copy of the entire /home/earl/ directory structure from the previous day, then rsync runs against the copy of the tree. If a file remains unchanged then rsync does nothing — the file remains a hard link. However, if the file’s contents changed, then rsync will create a new copy of the file in the target directory. If a file was deleted from /home/earl then rsync deletes the hard link from that day’s copy.
In this way, the $DAY1 directory has a snapshot of the /home/earl tree as it existed on day 1, and the $DAY2 directory has a snapshot of the /home/earl tree as it existed on day 2, but only the files that changed take up additional disk space. If you need to find a file as it existed at some point in time you can look at that day’s tree. If you need to restore yesterday’s backup you can rsync the tree from yesterday, but you don’t have to store a copy of all of the data from each day, you only use additional disk space for files that changed or were added.
I use this technique to keep 90 daily backups of a 500GB file system on a 1TB drive.
One caveat: The hard links do use up inodes. If you’re using a file system such as ext3, which has a set number of inodes, you should allocate extra inodes on the backup volume when you create it. If you’re using a file system that can dynamically add inodes, such as ext4, zfs or btrfs, then you don’t need to worry about this.
bsandor said:
My guess is, I should not try to recycle drives like this, and if I do, maybe there's something I should do to the disk before re-using it in another ZFS system.
zpool labelclear device is likely what you are looking for.
Q:
I have a NAS running FreeBSD 10.1 with 3 disks: ada2 is the boot device, ada0 and ada1 are a ZFS mirror.
dmesg shows this for the ZFS mirror:
GEOM: ada0: the primary GPT table is corrupt or invalid.
How can I recover the primary GPT tables for ada0 and ada1?
A:
Please post the output of zpool status. If ZFS uses the whole disks there won't be a GPT table (or any other partition scheme) at all.
A:
yes, ZFS uses the whole disks.
A:
In that case both ada0 and ada1 do not have a partition table.
You create a pool of directly to device ada0 ada1 - this led to the fact that the pool was created using the entire device (ignore partitions!) not the partition on it (to use every device possible! it is necessary for others to organize the loading of the OS!) and, of course, information about GPT was destroyed.
You should have created a zfs pool:
(for striped) zpool create tank ada0p3 ada1p3
(for mirror) zpool create tank mirror ada0p3 ada1p3
If You set the label in the command gpart add ... -l zdisk0 ada0 , so:
(for mirror) zpool create tank mirror /dev/gpt/zdisk0 /dev/gpt/zdisk1
Should be you case:
zpool status -v
AA:
Because disks can vary in exact size, ZFS leaves some space unused at the end of the disk. (I don't know of an easy way to find out how much. Somebody pointed me at the source once, but I can't find that now. I think it would have to be at least a megabyte to allow for disk variance, but that is an estimate.)
The backup copy of the GPT is stored at the very end of the disk. The boot code tries to verify GPT tables, and is likely finding that leftover backup GPT at the end of the disk.
The trick is clearing that backup GPT without damaging the ZFS data. Do not attempt to do that without a full, verified backup of that ZFS mirror. After that, use diskinfo -v ada0 to get the mediasize in sectors. The standard backup GPT is 33 blocks long, so erasing the last 33 blocks on the disk with dd(1) should be enough to avoid the error without interfering with the ZFS data. dd(1) does not have a way to say "the last n blocks", so the seek= option has to be used to seek to (mediasize in blocks - 33).
WARNING: make a full, verified backup of everything on the disk first!
....
Repeat the procedure for ada1. Do not just reuse the same dd(1) command because the two disks might not have identical block counts.
In the future, the easy way is to erase GPT metadata before reusing the disk. That can be done with gpart destroy (see gpart(8)).
https://forums.freebsd.org/threads/gpt-table-corrupt.52102/post-292341
At the very least make sure there's no partition scheme on the disks before adding them. If there is a gpart destroy adaX should clear it.
One of my most popular blog articles is this article about the "Hidden Cost of using ZFS for your home NAS". To summarise the key argument of this article:
Expanding ZFS-based storge can be relatively expensive / inefficient.
For example, if you run a ZFS pool based on a single 3-disk RAIDZ vdev (RAID5 equivalent2), the only way to expand a pool is to add another 3-disk RAIDZ vdev1.
You can't just add a single disk to the existing 3-disk RAIDZ vdev to create a 4-disk RAIDZ vdev because vdevs can't be expanded.
The impact of this limitation is that you have to buy all storage upfront even if you don't need the space for years to come.
Otherwise, by expanding with additional vdevs you lose capacity to parity you may not really want/need, which also limits the maximum usable capacity of your NAS.
RAIDZ vdev expansion
Fortunately, this limitation of ZFS is being addressed!
ZFS founder Matthew Ahrens created a pull request around June 11, 2021 detailing a new ZFS feature that would allow for RAIDZ vdev expansion.
Finally, ZFS users will be able to expand their storage by adding just one single drive at a time. This feature will make it possible to expand storage as-you-go, which is especially of interest to budget conscious home users3.
Jim Salter has written a good article about this on Ars Technica.
https://arstechnica.com/gadgets/2021/06/raidz-expansion-code-lands-in-openzfs-master/
There is still a caveat
Existing data will be redistributed or rebalanced over all drives, including the freshly added drive. However, the data that was already stored on the vdev will not be restriped after the vdev is expanded. This means that this data is stored with the older, less efficient parity-to-data ratio.
I think Matthew Ahrends explains it best in his own words:
After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.
So, if you add a new drive to a RAIDZ vdev, you'll notice that after expansion, you will have less capacity available than you would theoretically expect. //
The overhead or 'lost capacity' can be recovered by rewriting existing data after the vdev has been expanded, because the data will then be written with the more efficient parity-to-data ratio of the larger vdev.
Rewriting all data may take quite some time and you may opt to postpone this step until the vdev has been expanded a couple of times so the parity-to-data ratio is now 'good enough' that significant storage gains can be had by rewriting the data. //
Release timeline
According to the Ars Technica article by Jim Salter, this feature will probably become available in August 2022, so we need to have some patience.
RAID type - Supported RAID levels are:
- Mirror (two-way mirror - RAID1 / RAID10 equivalent);
- RAID-Z1 (single parity with variable stripe width);
- RAID-Z2 (double parity with variable stripe width);
- RAID-Z3 (triple parity with variable stripe width).
Drive capacity - we expect this number to be in gigabytes (powers of 10), in-line with the way disk capacity is marked by the manufacturers. This number will be converted to tebibytes (powers of 2). The results will be presented in both tebibytes (TiB) and terabytes (TB). Note: 1 TB = 1000 GB = 1000000000000 B and 1 TiB = 1024 GiB = 1099511627776 B
Single drive cost - monetary cost/price of a single drive; used to calculate the Total cost and the Cost per TiB. The parameter is optional and has no impact on capacity calculations.
Number of RAID groups - the number of top-level vdevs in the pool.
Number of drives per RAID group - the number of drives per vdev.
Slop space allocation - 1/32 of the capacity of the pool or at least 128 MiB, but never more than half the pool size. [1,2]
% free space limit - recommended free pool space required to ensure best performance. Usually around 20%. [4]
Outputs:
Total raw storage capacity - the sum of physical size of all drives in the pool.
Zpool storage capacity - calculated as the difference between the total raw storage capacity and the loss for drive partitioning and metaslab allocation, but without taking into account parity and padding. This number should be reasonably close to the SIZE value reported by the zpool list command.
Reservation for parity and padding - calculated as described by Matthew Ahrens in [3]. (Not applicable to Mirror vdevs.)
Zpool usable storage capacity - calculated as the difference between the zpool storage capacity and the reservation for parity and padding.
Slop space allocation - see the Inputs section above for description.
ZFS usable storage capacity - calculated as the difference between the zpool usable storage capacity and the slop space allocation value. This number should be reasonably close to the sum of the USED and AVAIL values reported by the zfs list command.
When creating a pool, use disks with the same blocksize. A correlation between zfs "blocksize" and the disk blocksize is the ashift parameter (which cannot be modified after the pool creation). Either the value is set to "0" for auto blocksize recognition, or manually set to either ashift=12 for 4k blocksize disks or ashift=9 for 512b blocksize disks. Mixing disks with different blocksizes in the same pool can lead to caveats like performance leaks or inefficient space utilization.
You can interact more effectively with ZFS using the zdb tool. To check the vdev ashift used for the zpool, check the ZFS MOS configuration:
zdb -C <pool_name>
or
zdb -e <pool_name>
Q: How to delete all but last [n] ZFS snapshots?
Does anyone have any good ways / scripts they use to manage the number of snapshots stored on their ZFS systems? Ideally, I'd like a script that iterates through all the snapshots for a given ZFS filesystem and deletes all but the last n snapshots for that filesystem.
E.g. I've got two filesystems, one called tank and another called sastank. Snapshots are named with the date on which they were created: sastank@AutoD-2011-12-13 so a simple sort command should list them in order. I'm looking to keep the last 2 week's worth of daily snapshots on tank, but only the last two days worth of snapshots on sastank.
A:
zfs list -t snapshot -o name | grep ^tank@Auto | tac | tail -n +16 | xargs -n 1 zfs destroy -r
- Output the list of the snapshot (names only) with zfs list -t snapshot -o name
- Filter to keep only the ones that match tank@Auto with grep ^tank@Auto
- Reverse the list (previously sorted from oldest to newest) with tac
- Limit output to the 16th oldest result and following with tail -n +16
- Then destroy with xargs -n 1 zfs destroy -vr
Deleting snapshots in reverse order is supposedly more efficient or sort in reverse order of creation.
zfs list -t snapshot -o name -S creation | grep ^tank@Auto | tail -n +16 | xargs -n 1 zfs destroy -vr
A:
More general case of getting most recent snapshot based on creation date, not by name.
zfs list -H -t snapshot -o name -S creation | head -1
Scoped to a specific filesystem name TestOne
zfs list -H -t snapshot -o name -S creation -d1 TestOne | head -1
-H: No header so that first line is a snapshot name
-t snapshot: List snapshots (list can list other things like pools and volumes)
-o name: Display the snapshot name property.
-S creation: Capital S denotes descending sort, based on creation time. This puts most recent snapshot as the first line.
-d1 TestOne: Says include children, which seems confusing but its because as far as this command is concerned, snapshots of TestOne are children. This will NOT list snapshots of volumes within TestOne such as TestOne/SubVol@someSnapshot.
| head -1: Pipe to head and only return first line. //
Thanks for the -d1. That was the key to the question "How do I get all snapshots for a given dataset?"
Managing ZFS Properties
Dataset properties are managed through the zfs command's set, inherit, and get subcommands.
-
Setting ZFS Properties
-
Inheriting ZFS Properties
-
Querying ZFS Properties
zrep is, by design, fairly simple. It is based around zfs properties. Therefore, when in doubt, check the properties. zrep gives an easy way to do this.
Here is an example of a set of properties used when zrep is replicating to a filesystem on the same host. Note that src-host and dest-host are the same:
# ./zrep list -v scratch/zreptest
scratch/zreptest:
zrep:src-fs scratch/zreptest
zrep:src-host hamachi
zrep:dest-fs scratch/zreptest_2
zrep:master yes
zrep:savecount 5
zrep:dest-host hamachi
For the destination filesystem (in this case scratch/zreptest_2), the properties would be mostly the same. The only difference would be, instead of "zrep:master yes", you would see "readonly on"
ZFS property overview
Fixing zfs properties
To manually change a zfs property, use
zfs set zrep:something=newval file/system
To remove a property, zfs is a little odd. The standard (and in fact, only) way of removing a property in zfs, is as follows:
zfs inherit zrep:master file/system
This tells zfs to set the value type of that property, to be "inherited" from the parent filesystem. Since the parent usually does not have any zrep properties, that effectively 'removes' a property from a filesystem or snapshot.
It is the late 1990s and the computer server world is dominated by enterprise UNIX operating systems – all competing with each other. Windows 2000 is not out yet and Windows NT 4 is essentially a toy that lesser mortals run on their Intel PCs which they laughingly call ‘servers’. Your company has a commercial UNIX and its called Solaris. Your UNIX is very popular and is a leading platform. Your UNIX however has some major deficiencies when it comes to storage.
IRIX – a competing proprietary UNIX – has the fantastic XFS file system which vastly out performs your own file system which is still UFS (“Unix File System” – originally developed in the early 1980s) and doesn’t even have journalling – until Solaris 7 at least (in November 1998). IRIX had XFS baked into it from 1994. IRIX also had a great volume manager – where as Solaris’ ‘SVM’ was generally regarded as terrible and was an add-on product that didn’t appear as part of Solaris itself until Solaris 8 in 2000. //
ZFS – and sadly btrfs – are both rooted in a 1990s monolithic model of servers and storage. btrfs hasn’t caught on in Linux for a variety of reasons, but most of all its because it simply isn’t needed. XFS runs rings around both in terms of performance, scales to massive volume sizes. LVM supports XFS by adding COW snapshots and clones, and even clustering if you so wanted. I believe the interesting direction in file systems is actually things like Gluster and Ceph – file systems designed with the future in mind, rather than for a server model we’re not running any more. ///
Interesting to compare the comments to the disparaging statements in the article.
ZFS combines hardware and software layers, combines volume, disk & partition management in one application.
It is the only production ready journaled CoW file system with data integrity management.
Btrfs is not production ready.
Described as "The last word in filesystems" ZFS is stable, fast, secure, and future-proof. Being licensed under the CDDL, and thus incompatible with GPL, it is not possible for ZFS to be distributed along with the Linux Kernel. This requirement, however, does not prevent a native Linux kernel module from being developed and distributed by a third party, as is the case with zfsonlinux.org (ZOL).
ZOL is a project funded by the Lawrence Livermore National Laboratory to develop a native Linux kernel module for its massive storage requirements and super computers
The Zettabyte File System
by J Bonwick · Cited by 117 — 1 Introduction. Upon hearing about our work on ZFS, some people appear to be genuinely surprised and ask, “Aren't local file