(original article)

Do not let the perfect be the enemy of the good

16 March 2011, 04:12 UTC

I agree with some of your conclusions, but disagree with many of your arguments.

To me, that really means that a system administrator cannot abdicate their responsibility - they have to be involved in the decision at some point. In the RAID6 case, it makes sense to me to have the scrub report the errors, and then root can decided to take the array off-line at an appropriate time for repair. Depending on the system, this might be a minor umount+mdadm command, or a reboot-to-single-user-and-md_mod.start_ro=1 event with major downtime. But at least no applications will be holding the data in memory (or fiddling with the buffer cache via mmap) while the repair happens.

Yes, no form of RAID is an adequate replacement for a UPS and a well configured nut. And with UPS prices below $100 for 600VA, it is cheaper to have a UPS than to have RAID. It would be nice if the kernel RAID implementation could recognise this situation, much like ext3 recognises it, and reports it to the system administrator: Warning: raid array md1 was crashed - automatic repair might make things either better or worse.

Statistically, the smart approach will do better with inconsistent P+Q. As you said, the chance of aligned-double-error is rather small. For a six-disk RAID6, the smart approach should get this wrong significantly less than half the time. The simple approach should get this wrong about two-thirds of the time. Some people might think that getting it wrong more often than getting it right is worse than useless.

Disks generally have an expected Uncorrected Bit-Error-Rate. With large disks these days, the BER times the size of the disk gives values greater than 1. This does not mean that an uncorrected error will occur (the probability has a few more exponentiations than I just used), but it does mean that an uncorrected error is more likely than unlikely, once the disk has been fully read and written one or more times.

But these checks are already rolled into the BER, and we still have errors. If the numbers reported in http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191 are accurate (they claim to be based upon CERN research, but I have not personally followed this up), then having a method to recover from such errors becomes pressing as soon as you get to Terabyte scale storage - which is now the default for most desktops!

I think that root should be given the chance to choose between them. Option 3 is what will be true for most of the people most of the time, assuming that they have a UPS and they are not a bleeding-edge kernel hacker who likes to crash their system for educational purposes.

It would be nice if this could be done with zero downtime, but this might well be impractical. If the check routine is done on a regular basis (I think debian calls it once a week by default), then administrators may get enough warning of creeping single-bit or single-sector corruption (which were found to be equally likely in the CERN stats above) that they can do a recalculate_data during a scheduled maintenance window (for servers) or on their next reboot (for home machines).

It might also be nice to rename the existing repair option to recalculate_parity, since in many cases it will actually hide creeping corruption rather than repair it. But that is an argument for another time and thread.




[æ]