I agree with some of your conclusions, but disagree with many of your arguments.
- I acknowledge that the extra complexity of a "smart" recovery can sometimes be justified. However it should not be automatic, nor a default
To me, that really means that a system administrator cannot abdicate their responsibility - they have to be involved in the decision at some point. In the RAID6 case, it makes sense to me to have the scrub report the errors, and then root can decided to take the array off-line at an appropriate time for repair. Depending on the system, this might be a minor umount+mdadm command, or a reboot-to-single-user-and-md_mod.start_ro=1 event with major downtime. But at least no applications will be holding the data in memory (or fiddling with the buffer cache via mmap) while the repair happens.
- If the machine crashes or loses power while data is being written to the array it is not possible to know
Yes, no form of RAID is an adequate replacement for a UPS and a well configured nut. And with UPS prices below $100 for 600VA, it is cheaper to have a UPS than to have RAID. It would be nice if the kernel RAID implementation could recognise this situation, much like ext3 recognises it, and reports it to the system administrator: Warning: raid array md1 was crashed - automatic repair might make things either better or worse.
- RAID6 with inconsistent P and Q ... there is a small possibility that the identified block isn't wrong ... The probability of this is rather small, but it is non-zero. This suggests that in the "unclean array" case, there is no reason to expect that the smart approach will do any better than the simple approach.
Statistically, the smart approach will do better with inconsistent P+Q. As you said, the chance of aligned-double-error is rather small. For a six-disk RAID6, the smart approach should get this wrong significantly less than half the time. The simple approach should get this wrong about two-thirds of the time. Some people might think that getting it wrong more often than getting it right is worse than useless.
- The probability that the data changed while it was on disk should be incredibly small as modern storage devices have sophisticated checks which allow such corruptions to be easily detected.
Disks generally have an expected Uncorrected Bit-Error-Rate. With large disks these days, the BER times the size of the disk gives values greater than 1. This does not mean that an uncorrected error will occur (the probability has a few more exponentiations than I just used), but it does mean that an uncorrected error is more likely than unlikely, once the disk has been fully read and written one or more times.
- The probability that the data changed while it was on disk should be incredibly small as modern storage devices have sophisticated checks which allow such corruptions to be easily detected.
But these checks are already rolled into the BER, and we still have errors. If the numbers reported in http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191 are accurate (they claim to be based upon CERN research, but I have not personally followed this up), then having a method to recover from such errors becomes pressing as soon as you get to Terabyte scale storage - which is now the default for most desktops!
- 3/ assume most blocks are correct and try to find a single update that can fix the inconsistency (the smart approach).
Each of these can be 'best' in different circumstances. Without any knowledge of a particular situation it is not possible to choose between them.
I think that root should be given the chance to choose between them. Option 3 is what will be true for most of the people most of the time, assuming that they have a UPS and they are not a bleeding-edge kernel hacker who likes to crash their system for educational purposes.
It would be nice if this could be done with zero downtime, but this might well be impractical. If the check routine is done on a regular basis (I think debian calls it once a week by default), then administrators may get enough warning of creeping single-bit or single-sector corruption (which were found to be equally likely in the CERN stats above) that they can do a recalculate_data during a scheduled maintenance window (for servers) or on their next reboot (for home machines).
It might also be nice to rename the existing repair option to recalculate_parity, since in many cases it will actually hide creeping corruption rather than repair it. But that is an argument for another time and thread.
