5333 private links
u/ABC_AlwaysBeCoding (OP)
....
A few days in I started to see some unexplained panics/freezes. Some Steam games would fail validation... Something was wrong. I ran btrfsck and... it reported problems. I couldn't repair them without booting off another disk so I found my Manjaro USB key I made and booted off that again and attempted to --repair the btrfs drive.
FAIL. It was unable to do it. Unrecoverable errors. The partition was basically hosed! WTF? Luckily I only had "game data" on it and it was all synced with the cloud!
Initially I blamed BTRFS. I was mad. I had heard Ubuntu 19.10 now has experimental ZFS-on-root support out of the box. I was intrigued. I said "fuck it", made an installer key and wiped my disk and ran with it.
Things went fine until... I was installing some big game and... things just stopped. But not like "hang" stopped, I could still move windows around and keyboard input still appeared on the screen, which told me the CPU and GPU weren't the cause... fuck, it's something with the drive. It only cleared on reboot. I was afraid my data or filesystem got corrupt... turns out the former did not happen and the latter is virtually impossible with ZFS (yessss), so I tried again. Again, under heavy usage things eventually just froze. I kept trying different things now that I could duplicate it. I had the idea of forcing heavy activity via manually triggering a scrub and then watch zpool status in another window.
And then sonofabitch, I finally saw it. The drive that had just reported as ONLINE during the scrub, suddenly reported "SUSPENDED: One or more devices are faulted in response to IO failures." zpool clear did not clear it. But now I KNEW it was a drive (or enclosure, or cable, or mobo) issue. I immediately suspected the drive cable (as one does, who has been doing this sort of thing for a while), finally found another USB-A 3.1 to USB-C cable from Plugable that looked solidly built, and substituted it.
BOOM. ALL PROBLEMS WENT AWAY.
So OK, the enclosure guys (really cool enclosure, btw) sent me a bad cable with it. It happens. But...
SOLD... on ZFS. //
u/atemu12
The things btrfs check
reported might've been benign warnings, false positives and/or fixed on mount.
--repair the btrfs drive.
RTFM
Warning
Do not use --repair unless you are advised to do so by a developer or an experienced user, and then only after having accepted that no fsck successfully repair all types of filesystem corruption. Eg. some other software or hardware bugs can fatally damage a volume.
Not reading the manual results in:
Unrecoverable errors. The partition was basically hosed! //
u/ABC_AlwaysBeCoding (OP)
Fair enough. Thing is, --repair may be dangerous but it seems to be impossible to corrupt a ZFS filesystem because there isn't even a way to run any sort of --repair short of a resilvering/scrub. But yeah, I was inexperienced with both of these filesystems, as you can tell.
Btrfs will try to continue regardless of error
This is absolutely the wrong strategy. Once corruption starts, it usually spreads. I think ZFS does the right thing here, even if it is inconvenient (and to be fair, if I had a mirror, it probably would have just continued as well... I think? Maybe I'll keep the bad cable to do resilience testing).
BTRFS basically uses the Golang strategy (ignore errors, be squishy and nondeterministic, just keep going), ZFS uses the Erlang/Elixir strategy (fail fast, fail hard, be brittle and deterministic, restart cleanly) and I am most definitely in the latter camp based on 20+ years in the industry getting paid to program while doing sysadmin for fun or necessity.