I frequently see comments, particularly on the linux-raid mailing list to the effect that md should be more clever when recovering from an inconsistent stripe in an array.
In particular, it is suggested that for a RAID1 with more than 2 devices, a vote should be held and if one content occurs more often than the others (e.g. 2 devices have the same content, the third is different) then the majority vote should rule and the most common content be copied over the less common content.
Similarly with RAID6 if the P and Q blocks don't match the data blocks, it may be possible to find exactly one data block which can be corrected so as to make both P and Q match - so we could change just one data block instead of two "parity" blocks to achieve consistency.
I will call this approach the "Smart recovery" approach.
The assertion is that smart recovery will not only make the stripe consistent, but will also make it "correct".
I do not agree with these comments. It is my position that if there is an inconsistency that needs to be corrected then it should be corrected in a simple predictable way and that any extra complexity is unjustified. For RAID1, that means copying to first block over all the others. For RAID6, that means calculating new P and Q blocks based on the data. This is the "simple recovery" approach.
This note is an attempt to justify this position, both to myself and to you, my loyal reader.
There are two situations where md might discover that a stripe is inconsistent and so need to fix it.
The first is after an unclean shutdown. If the machine crashes or loses power while data is being written to the array it is not possible to know, on restart, which devices received new data and which did not. If a full stripe was being written then none, all, or any other subset of the devices may have been written while the others still hold old data. Further, on different stripes the arrangement of "old" and "new" could be different.
Thus on inspecting a RAID1 stripe and finding not all blocks the same a vote will only report which block is most common, not which was newest or oldest. Similarly a RAID6 with inconsistent P and Q could well not be able to identify a single block which is "wrong" and even if it could there is a small possibility that the identified block isn't wrong, but the other blocks are all inconsistent in such a way as to accidentally point to it. The probability of this is rather small, but it is non-zero.
This suggests that in the "unclean array" case, there is no reason to expect that the smart approach will do any better than the simple approach.
There is another issue here. After an unclean shutdown the filesystem should expect the data to possibly be inconsistent - some old and some new. It will normally take action to resolve any inconsistencies based on what it finds.
While the filesystem can expect blocks to have changed since before a crash, it will not expect blocks to change again while the array is running normally. This means that if md changes anything during resync after the unclean shutdown, that change should not be visible to the filesystem.
So if md were to use a majority vote while resyncing a 3 device RAID1, it would have to do the same voting while reading any non-resynced stripe. Similarly it would need to recovery any RAID6 stripe before allowing a read from it.
This would be significantly more complex than just reading from the first device of an unclean RAID1 or just reading the data blocks from an unclean RAID6.
So for the unclean array case it is clearly a win to use the simple approach -- less complexity for same net result.
Now for the other of the two situations. This is where the array is thought to be clean and is being scanned by a regular "scrubbing" process. Clearly the simplest way to deal with inconsistencies here is to use the same simple and predictable algorithm given above. The question is: is it possible to reliably do better, or at least to sometimes do better and never do worse.
When scrubbing an array that is thought to be clean we should not normally find any inconsistent stripes. It is certainly expected to occasionally find blocks that cannot be read as the media surface storing those blocks could have degraded. Such sectors will be reported by the device as media read errors. Finding these is a big part of why we run the scrub in the first place. Assuming the array is not degraded we can generate the missing data from the other devices and write it back. There is no question about how to generate this data. That is defined by the RAID level being implemented.
But suppose that a scrub of a clean array finds a stripe that is inexplicably inconsistent. What is the best thing to do in that case?
There are a couple of issues we need to consider before we can answer this. The first is to ask "How can this possibly happen?" and of course there are lots of ways, all hopefully very unlikely.
The probability that the data changed while it was on disk should be incredibly small as modern storage devices have sophisticated checks which allow such corruptions to be easily detected. That leaves two possibilities. Either the wrong data was written to the media, or the data that we received from the read is different from what was on the media. i.e. a faulty write or a faulty read.
In each case if could be that the actual data was wrong, or it could be that the address as processed by the storage device was wrong - bad data or bad address.
The corruption (whether of data or address, whether reading or writing) could have happened in the device, on the buss leading to the device, in the controller which manages the device, on the buss between controller and main memory, or in main memory itself. It could be due to a hardware glitch (transient error), a hardware fault (repeatable error) or software malfunction.
Or, finally, it could be that the array was not really 'clean' - an error, either in software or in system administration - might have resulted in the 'clean' flag being set on an array which quite reasonably had inconsistencies in it. As a special case of this, there is an artifact of the Linux md/raid1 implementation which can sometimes allow different data to be written two the different legs of a mirror, particular when the array is being used as a swap device. If this happens and is not promptly corrected (before the array is marked clean) then the data will never be read back so it doesn't matter what data is where. This only causes confusion due to excessive warnings about mismatches, it does not imply a lack of reliability.
There is one last variable that should be considered, and that is the possibility that the "bad" data has already be read by a filesystem or application software in such a way that reverting the "bad" data back to "good" data would actually cause more problems, not less.
With this taxonomy of possible problems we can start suggesting possible approaches.
If it stripe in inconsistent due to a faulty read, then clearly there is nothing to be gained by writing anything back -- that cannot fix anything and could conceivably make things worse.
If the stripe is inconsistent due to a faulty write, then we cannot in general trust the hardware to do the correct thing with any future write that we submit. So trying to correct the inconsistency automatically would not be wise.
If the stripe is inconsistent due to undetected change on the media, then applying the "smart" approach which - where possible - only writes to a single device in each stripe would be sensible.
If the stripe is inconsistent because the array is not really clean but was incorrectly marked so, then the simple and predictable approach is probably preferred as discussed above.
If the admin of the machine discovers that there is a hardware error and rectifies it then it might be appropriate to correct any stripes that are subsequently found to be inconsistent. If this hardware error only affected one device, that device should be failed and re-initialised. If however the hardware error turned out to affect various devices randomly, but with a sufficiently low probability that it is unlikely for one stripe to have received multiple errors; and if there is no evidence that bad data might have been read and processed already, then it would make sense to use the "smart" recovery method where possible.
To summarise the above, when automatically scrubbing an array, the best thing to do when an inconsistency is found is to report it, but not to attempt to correct it. Any correction should only be at the explicit request of the admin (or other expert system).
The options that could make sense for correction are:
1/ assume a particular device is most likely to be at fault and regenerated the data on that device
2/ assume parity blocks are most likely to be at fault and regenerate them. This does not have a clear analogue for RAID1 though the closest is "assume the first block is correct and copy it".
3/ assume most blocks are correct and try to find a single update that can fix the inconsistency (the smart approach).
Each of these can be 'best' in different circumstances. Without any knowledge of a particular situation it is not possible to choose between them. Option 2 has a slight advantage for RAID6 in that it will not cause a bad data block - which might have been read - to change a second time. It also have the very real advantage that it is simple and has to be implemented anyway.
Option 1 is trivial for an admin to choose simply by failing that device. This then removes the possibility for a stripe to be inconsistent and this whole discussion no longer applies.
The question then remains, is option 3 something that is worth implementing? Will the conditions that make it the preferred option happen often enough that it could be used, and would a typical admin be able to detect such a condition and thus choose to use it? These are not easy to answer.
Related is the question of how to implement it. There are a few options.
The first is to integrate the smart block comparison in to the already existing resync code with an option to enable either "simple" or "smart". However as mentioned above, this should really be combined with the ability to read and validate whole stripes before serving a read or write request against them. Otherwise you have the very real chance that a data block will appear to change and then change back again.
If such a scheme were implemented (which could be valuable despite its high cost), then including "smart" recovery as an option would make sense.
The next option is to run a normal scrub and log all the stripes which were inconsistent. This list could then be processed in user-space by some program which analyses all the inconsistencies to determine whether it is always just one block per stripe, and whether it is always the same device etc. It could then - possibly with sysadmin confirmation - request the RAID code to reconstruct individual blocks on individual devices as required.
Finally, it could be that once such errors are detected, it is best to stop using the array altogether until things are sorted out. In that case the "smart" recovery could all be done by a user-space program without bothering the in-kernel RAID code.
I think my preferred approach would be the second option. It seems to strike the right balance between being easy to work with, reasonably safe, and reasonably simple.
This would require that the scrub process provides a reliable log of inconsistent stripes, and that the raid code has the option to mark individual blocks as needing recovery. The former is something that we really want anyway, and the latter should fit well with the planned addition of a per-device bad-block-list. It could even be done entirely in user-space by suspending IO to the affected stripe (md already supports that), making the required update, then resuming IO.
Doing most of the work in userspace would allow the affected stripe to be mapped through the filesystem back to a particular file so higher level considerations could be applied to choosing the correct content.
So to conclude the conclusion, I acknowledge that the extra complexity of a "smart" recovery can sometimes be justified. However it should not be automatic, nor a default. It should only be used when an informed sysadmin (or other expert system) has made an explicit choice.