Thank you for your thoughtful and careful comments.
There are a couple that I would like to respond to.
Firstly, you say: Statistically, the smart approach will do better with inconsistent P+Q
You can only say something is statistically true if you have statistics. I suspect you don't. If you have a credible threat model you could simulate it and generate statistics. However my main point here is that you (or I) don't have a credible threat model.
Given a credible model, something could be done. Without that the best we can do is report anything unusual that is found. I am completely supportive of informative error reports and of given the sysadmin control to do whatever they want. I am not supportive of having any automatic action be taken without a model against which I can justify that action.
Secondly, you refer to BER and the Znet column.
This does provide statistics - quite interesting statistics. But those statistics do not provide a model against which to justify an automatic repair action. All they tell me is that it is good to keep checksums and backups, and I completely agree, but that is not a RAID-level issue.
"BER" is the rate of getting bit errors off the surface. Some of them are correctable. The vast majority of the rest are detectable. i.e. if you get an uncorrectable bit error, then you will know about it because the drive will report an error. And that is exactly the sort of situation that RAID already handles quite well - that is what it was designed for.
In the context of smart rebuild, we are only interested in bit errors that are undetectable. i.e there were enough error bits in the one sector (or disk block) that the ECC couldn't even detect the error. While that is possible it is exceptionally rare. It is very very likely that you will discard a drive because it returns too many read errors long before you get even one undetectable error.
The errors discovered by the CERN study were almost certainly not disk surface errors.
The 512-byte sector errors were probably because the data was written to the wrong place, or read from the wrong place (and if you don't know which there is nothing you can do to 'fix' it). The single bit errors were probably during one of the many copies between original and the media, or media and the ALU which performed the check. There must have been some bus or storage which didn't provide a proper ECC. I believe parallel-SCSI doesn't provide any checks on address information and was susceptible to this. I think SATA includes a CRC on each packet sent over the wire, so it is less susceptible.
In either case, there is no certainty that anything that RAID6 could do would have helped. Maybe it would help sometimes, but not often.
For the errors reported by CERN an application-level checksum/crc/hash is the best approach - that and replacing the 3% of nodes on which they detected problems.
