What I decided to do was to put some old drives into my new NAS, and configure them into a raidz ZFS pool with just enough space to copy my data. Then, after copying over, pull the drives from the old NAS and put them into the new one, along with a couple of brand new drives, giving me way more space than I originally had.
I already knew the old, "spare" drives were unreliable: that's why they were sitting in a filing cabinet rather than being used. So I expected data corruption, but hoped that ZFS would be able to cope with it. And it did! Even though I was getting I/O errors on two of my old drives, ZFS was repairing the volume well enough that I was able to copy my data across and verify (using "rsync -c") that it hadn't been corrupted.
That's pretty excellent in itself. But that's not the end of it.
With some trepidation, I pulled the drives from my old NAS, replaced the faulty drives in my new NAS (without any data loss), and used those I had left over, together with the two new drives, to increase the size of my storage pools. I filled each ZFS pool with random data, and made sure a ZFS scrub completed with no errors. I had my new NAS working perfectly, so I pretty much forgot about it for while, and started worrying about how Squeezebox Server (32-bit and built with an antique version of perl) was going to run on my new NAS box.
A month or so later I noticed that a "zpool status" command was telling me that one or more of my drives had errored and that I should be thinking about replacing them. But all my ZFS volumes were "ONLINE", not "DEGRADED", and there were no reports in the dmesg. Strangely, the errors were on the two new drives, not on the old drives, and they were all in the "CKSUM" column, nothing in "READ" or "WRITE". There were 21 of them between the two 2TB drives, 17 on one and 4 on the other.
After much head scratching I figured out what was going on. And I felt very smug about using ZFS.
Although my two new drives had passed their SMART tests, there were errors on them so subtle they were managing to get past the drive's onboard error correction, and bad data was being passed to the CPU without any I/O error being signalled. In theory, this can't happen.
But the corruption was being noticed, and corrected, by that clever feature of ZFS which automatically writes a checksum for each block in the pool.
I scrubbed the zpool. The errors went away, and I haven't seen any more of them. Two conclusions can be be drawn from this: that brand new drives aren't necessarily better than old ones at storing your data, and that ZFS is utterly spiffing.
Back to the main page.
Author: Ian Pallfreeman: ip@xenopsyche.com