Hard drive bit rot

ZFS has proven that a wide variety of chipset bugs, firmware bugs, actual mechanical failure, etc are still present and actively corrupting our data. It applies to HDDs and flash. Worse, this corruption in most cases appears randomly over time so your proposal to verify the written data immediately is useless.

Prior to the widespread deployment of this new generation of check-summing filesystems, I made the same faulty assumption you made: that data isn’t subject to bit rot and will reproduce what was written.

ZFS or BTRFS will disabuse you of these notions very quickly. (Be sure to turn on idle scrubbing).

It also appears that the error rate is roughly constant but storage densities are increasing, so the bit errors per GB stored per month are increasing as well.

Microsoft needs to move ReFS down to consumer euro ducts ASAP. BTRFS needs to become the Linux default FS. Apple needs to get with the program already and adopt a modern filesystem.

and, same article…

First off, make sure you have a separate backup storage volume that doesn’t get touched by normal applications and which keeps history. Backup doesn’t protect you very much if accidental deletes or application bugs corrupt all your copies within one backup cycle. Use an appropriate backup tool to manage this, where appropriateness depends on your skill and willingness to tinker. You could use something as simple as an rsync –link-dest job, or rsync –inplace in combination with filesystem snapshots, or some backup suite that will store history in its own format.

For bit-rot protection of the stored backup data, make a backup volume using zfs or btrfs with at least two disks in a mirroring configuration (where the filesystem manages the duplicate data, not a separate raid layer). Set it to periodically scrub itself, perhaps weekly. It will validate checksums on individual file extents. If one copy of a file extent cannot be read successfully, it will rewrite it using the other valid mirror. This rewrite will allow the disk’s block remapping to relocate a bad block and keep going. The ability to validate checksums is the value add beyond normal raid, where the typical raid system only notices a problem when the disk starts reporting errors.

Monitor overall disk health and preemptively replace drives that start to show many errors, just as with regular raid. Some people consider the first block remapping event to be a failure sign, but you may replace a lot of disks this way. Others will wait to see if it starts having many such events within days or weeks before considering the disk bad.

And this …

I’ve used ZFS under Linux for 5 years now for exactly this sort of thing. I picked ZFS because I was putting photos and other things on it for storage that I wasn’t likely to be looking at actively and wouldn’t be able to detect bit-rot until it was far too late. ZFS has detected and corrected numerous device corruption or unreadable issues over the years and corrected them, via monthly “zpool scrub” operations.

and less valuable/relevant

git annex [branchable.com] is an open source project that lets you distribute files around various media (including external HDs, Amazon S3, SSH-connected computers, etc.). It has an fsck command [branchable.com] for checking that your data still matches its checksums.

There’s a GUI interface [branchable.com] that makes it a lot like Dropbox, where you just add files to a folder, and they are sync’d.

And this which is very useful …

I have a pair of 4TB disks that I keep cloned with rsync. Periodically I verify the contents using rsync -c, which forces rsync to do a full checksum on the files. A few times a year this will identify a file that is actually corrupt and I’ll manually recover it from the good copy.

Comments are closed.