Smart or simple RAID recovery??

11 February 2010, 05:03 UTC

I frequently see comments, particularly on the linux-raid mailing list to the effect that md should be more clever when recovering from an inconsistent stripe in an array.

In particular, it is suggested that for a RAID1 with more than 2 devices, a vote should be held and if one content occurs more often than the others (e.g. 2 devices have the same content, the third is different) then the majority vote should rule and the most common content be copied over the less common content.

Similarly with RAID6 if the P and Q blocks don't match the data blocks, it may be possible to find exactly one data block which can be corrected so as to make both P and Q match - so we could change just one data block instead of two "parity" blocks to achieve consistency.

I will call this approach the "Smart recovery" approach.

The assertion is that smart recovery will not only make the stripe consistent, but will also make it "correct".

I do not agree with these comments. It is my position that if there is an inconsistency that needs to be corrected then it should be corrected in a simple predictable way and that any extra complexity is unjustified. For RAID1, that means copying to first block over all the others. For RAID6, that means calculating new P and Q blocks based on the data. This is the "simple recovery" approach.

This note is an attempt to justify this position, both to myself and to you, my loyal reader.

# Resyncing an unclean array.
# Scrubbing a clean array.
# Repair Options
# Conclusion
# Comments...

Resyncing an unclean array.

There are two situations where md might discover that a stripe is inconsistent and so need to fix it.

The first is after an unclean shutdown. If the machine crashes or loses power while data is being written to the array it is not possible to know, on restart, which devices received new data and which did not. If a full stripe was being written then none, all, or any other subset of the devices may have been written while the others still hold old data. Further, on different stripes the arrangement of "old" and "new" could be different.

Thus on inspecting a RAID1 stripe and finding not all blocks the same a vote will only report which block is most common, not which was newest or oldest. Similarly a RAID6 with inconsistent P and Q could well not be able to identify a single block which is "wrong" and even if it could there is a small possibility that the identified block isn't wrong, but the other blocks are all inconsistent in such a way as to accidentally point to it. The probability of this is rather small, but it is non-zero.

This suggests that in the "unclean array" case, there is no reason to expect that the smart approach will do any better than the simple approach.

There is another issue here. After an unclean shutdown the filesystem should expect the data to possibly be inconsistent - some old and some new. It will normally take action to resolve any inconsistencies based on what it finds.

While the filesystem can expect blocks to have changed since before a crash, it will not expect blocks to change again while the array is running normally. This means that if md changes anything during resync after the unclean shutdown, that change should not be visible to the filesystem.

So if md were to use a majority vote while resyncing a 3 device RAID1, it would have to do the same voting while reading any non-resynced stripe. Similarly it would need to recovery any RAID6 stripe before allowing a read from it.

This would be significantly more complex than just reading from the first device of an unclean RAID1 or just reading the data blocks from an unclean RAID6.

So for the unclean array case it is clearly a win to use the simple approach -- less complexity for same net result.

Scrubbing a clean array.

Now for the other of the two situations. This is where the array is thought to be clean and is being scanned by a regular "scrubbing" process. Clearly the simplest way to deal with inconsistencies here is to use the same simple and predictable algorithm given above. The question is: is it possible to reliably do better, or at least to sometimes do better and never do worse.

When scrubbing an array that is thought to be clean we should not normally find any inconsistent stripes. It is certainly expected to occasionally find blocks that cannot be read as the media surface storing those blocks could have degraded. Such sectors will be reported by the device as media read errors. Finding these is a big part of why we run the scrub in the first place. Assuming the array is not degraded we can generate the missing data from the other devices and write it back. There is no question about how to generate this data. That is defined by the RAID level being implemented.

But suppose that a scrub of a clean array finds a stripe that is inexplicably inconsistent. What is the best thing to do in that case?

There are a couple of issues we need to consider before we can answer this. The first is to ask "How can this possibly happen?" and of course there are lots of ways, all hopefully very unlikely.

The probability that the data changed while it was on disk should be incredibly small as modern storage devices have sophisticated checks which allow such corruptions to be easily detected. That leaves two possibilities. Either the wrong data was written to the media, or the data that we received from the read is different from what was on the media. i.e. a faulty write or a faulty read.

In each case if could be that the actual data was wrong, or it could be that the address as processed by the storage device was wrong - bad data or bad address.

The corruption (whether of data or address, whether reading or writing) could have happened in the device, on the buss leading to the device, in the controller which manages the device, on the buss between controller and main memory, or in main memory itself. It could be due to a hardware glitch (transient error), a hardware fault (repeatable error) or software malfunction.

Or, finally, it could be that the array was not really 'clean' - an error, either in software or in system administration - might have resulted in the 'clean' flag being set on an array which quite reasonably had inconsistencies in it. As a special case of this, there is an artifact of the Linux md/raid1 implementation which can sometimes allow different data to be written two the different legs of a mirror, particular when the array is being used as a swap device. If this happens and is not promptly corrected (before the array is marked clean) then the data will never be read back so it doesn't matter what data is where. This only causes confusion due to excessive warnings about mismatches, it does not imply a lack of reliability.

There is one last variable that should be considered, and that is the possibility that the "bad" data has already be read by a filesystem or application software in such a way that reverting the "bad" data back to "good" data would actually cause more problems, not less.

With this taxonomy of possible problems we can start suggesting possible approaches.

If it stripe in inconsistent due to a faulty read, then clearly there is nothing to be gained by writing anything back -- that cannot fix anything and could conceivably make things worse.

If the stripe is inconsistent due to a faulty write, then we cannot in general trust the hardware to do the correct thing with any future write that we submit. So trying to correct the inconsistency automatically would not be wise.

If the stripe is inconsistent due to undetected change on the media, then applying the "smart" approach which - where possible - only writes to a single device in each stripe would be sensible.

If the stripe is inconsistent because the array is not really clean but was incorrectly marked so, then the simple and predictable approach is probably preferred as discussed above.

If the admin of the machine discovers that there is a hardware error and rectifies it then it might be appropriate to correct any stripes that are subsequently found to be inconsistent. If this hardware error only affected one device, that device should be failed and re-initialised. If however the hardware error turned out to affect various devices randomly, but with a sufficiently low probability that it is unlikely for one stripe to have received multiple errors; and if there is no evidence that bad data might have been read and processed already, then it would make sense to use the "smart" recovery method where possible.

Repair Options

To summarise the above, when automatically scrubbing an array, the best thing to do when an inconsistency is found is to report it, but not to attempt to correct it. Any correction should only be at the explicit request of the admin (or other expert system).

The options that could make sense for correction are:
1/ assume a particular device is most likely to be at fault and regenerated the data on that device
2/ assume parity blocks are most likely to be at fault and regenerate them. This does not have a clear analogue for RAID1 though the closest is "assume the first block is correct and copy it".
3/ assume most blocks are correct and try to find a single update that can fix the inconsistency (the smart approach).

Each of these can be 'best' in different circumstances. Without any knowledge of a particular situation it is not possible to choose between them. Option 2 has a slight advantage for RAID6 in that it will not cause a bad data block - which might have been read - to change a second time. It also have the very real advantage that it is simple and has to be implemented anyway.

Option 1 is trivial for an admin to choose simply by failing that device. This then removes the possibility for a stripe to be inconsistent and this whole discussion no longer applies.

The question then remains, is option 3 something that is worth implementing? Will the conditions that make it the preferred option happen often enough that it could be used, and would a typical admin be able to detect such a condition and thus choose to use it? These are not easy to answer.

Related is the question of how to implement it. There are a few options.

The first is to integrate the smart block comparison in to the already existing resync code with an option to enable either "simple" or "smart". However as mentioned above, this should really be combined with the ability to read and validate whole stripes before serving a read or write request against them. Otherwise you have the very real chance that a data block will appear to change and then change back again.

If such a scheme were implemented (which could be valuable despite its high cost), then including "smart" recovery as an option would make sense.

The next option is to run a normal scrub and log all the stripes which were inconsistent. This list could then be processed in user-space by some program which analyses all the inconsistencies to determine whether it is always just one block per stripe, and whether it is always the same device etc. It could then - possibly with sysadmin confirmation - request the RAID code to reconstruct individual blocks on individual devices as required.

Finally, it could be that once such errors are detected, it is best to stop using the array altogether until things are sorted out. In that case the "smart" recovery could all be done by a user-space program without bothering the in-kernel RAID code.

Conclusion

I think my preferred approach would be the second option. It seems to strike the right balance between being easy to work with, reasonably safe, and reasonably simple.

This would require that the scrub process provides a reliable log of inconsistent stripes, and that the raid code has the option to mark individual blocks as needing recovery. The former is something that we really want anyway, and the latter should fit well with the planned addition of a per-device bad-block-list. It could even be done entirely in user-space by suspending IO to the affected stripe (md already supports that), making the required update, then resuming IO.

Doing most of the work in userspace would allow the affected stripe to be mapped through the filesystem back to a particular file so higher level considerations could be applied to choosing the correct content.

So to conclude the conclusion, I acknowledge that the extra complexity of a "smart" recovery can sometimes be justified. However it should not be automatic, nor a default. It should only be used when an informed sysadmin (or other expert system) has made an explicit choice.




Comments...

Re: Smart or simple RAID recovery?? (07 September 2010, 19:17 UTC)

You're right that automatic recovery has some complex implications, but simply accurate reporting of the location of the error is extremely useful diagnostic information. During scrubbing, I encounter an unexpected inconsistency. I would dearly like to know which drive is the most likely suspect! I have been fighting this problem unsuccessfully in a RAID-5 array (the RAID-10 on another partition of the same drives never shows any problems, sigh), and am hoping that a 3-way mirrored setup will allow better diagnostics.

I'd like to have voting implemented inside the RAID code for that reason alone.

It could also tell me which logical block(s) on the md device are suspect, so I can find any affected files and possibly apply a higher-level consistency check. (Comparison with backups, for example.)

You're right that always reading and checking a whole stripe would be very valuable. Actually, in RAID-1 and -10, you only need to read ceil(n/2) copies. If they agree, any additional ones would be superfluous.

[permalink][hide]

Re: Smart or simple RAID recovery?? (11 February 2011, 23:29 UTC)

Was any of the features mentioned in the conclusion implemented since? Can we request 'check' to log the mismatch address? This will allow us to do userspace recovery.

[permalink][hide]

Re: Smart or simple RAID recovery?? (12 February 2011, 01:01 UTC)

Unfortunately not. I've been caught up in other things and haven't made nearly the progress on the md code base that I would like to have done. I find this very frustrating and I will be making an effort to manage my time better and make some serious progress in md.

Thanks for your encouragement.

[permalink][hide]

Do not let the perfect be the enemy of the good (16 March 2011, 04:12 UTC)

I agree with some of your conclusions, but disagree with many of your arguments.

  • I acknowledge that the extra complexity of a "smart" recovery can sometimes be justified. However it should not be automatic, nor a default

To me, that really means that a system administrator cannot abdicate their responsibility - they have to be involved in the decision at some point. In the RAID6 case, it makes sense to me to have the scrub report the errors, and then root can decided to take the array off-line at an appropriate time for repair. Depending on the system, this might be a minor umount+mdadm command, or a reboot-to-single-user-and-md_mod.start_ro=1 event with major downtime. But at least no applications will be holding the data in memory (or fiddling with the buffer cache via mmap) while the repair happens.

  • If the machine crashes or loses power while data is being written to the array it is not possible to know

Yes, no form of RAID is an adequate replacement for a UPS and a well configured nut. And with UPS prices below $100 for 600VA, it is cheaper to have a UPS than to have RAID. It would be nice if the kernel RAID implementation could recognise this situation, much like ext3 recognises it, and reports it to the system administrator: Warning: raid array md1 was crashed - automatic repair might make things either better or worse.

  • RAID6 with inconsistent P and Q ... there is a small possibility that the identified block isn't wrong ... The probability of this is rather small, but it is non-zero. This suggests that in the "unclean array" case, there is no reason to expect that the smart approach will do any better than the simple approach.

Statistically, the smart approach will do better with inconsistent P+Q. As you said, the chance of aligned-double-error is rather small. For a six-disk RAID6, the smart approach should get this wrong significantly less than half the time. The simple approach should get this wrong about two-thirds of the time. Some people might think that getting it wrong more often than getting it right is worse than useless.

  • The probability that the data changed while it was on disk should be incredibly small as modern storage devices have sophisticated checks which allow such corruptions to be easily detected.

Disks generally have an expected Uncorrected Bit-Error-Rate. With large disks these days, the BER times the size of the disk gives values greater than 1. This does not mean that an uncorrected error will occur (the probability has a few more exponentiations than I just used), but it does mean that an uncorrected error is more likely than unlikely, once the disk has been fully read and written one or more times.

  • The probability that the data changed while it was on disk should be incredibly small as modern storage devices have sophisticated checks which allow such corruptions to be easily detected.

But these checks are already rolled into the BER, and we still have errors. If the numbers reported in http://www.zdnet.com/blog/storage/data-corruption-is-worse-than-you-know/191 are accurate (they claim to be based upon CERN research, but I have not personally followed this up), then having a method to recover from such errors becomes pressing as soon as you get to Terabyte scale storage - which is now the default for most desktops!

  • 3/ assume most blocks are correct and try to find a single update that can fix the inconsistency (the smart approach).

    Each of these can be 'best' in different circumstances. Without any knowledge of a particular situation it is not possible to choose between them.

I think that root should be given the chance to choose between them. Option 3 is what will be true for most of the people most of the time, assuming that they have a UPS and they are not a bleeding-edge kernel hacker who likes to crash their system for educational purposes.

It would be nice if this could be done with zero downtime, but this might well be impractical. If the check routine is done on a regular basis (I think debian calls it once a week by default), then administrators may get enough warning of creeping single-bit or single-sector corruption (which were found to be equally likely in the CERN stats above) that they can do a recalculate_data during a scheduled maintenance window (for servers) or on their next reboot (for home machines).

It might also be nice to rename the existing repair option to recalculate_parity, since in many cases it will actually hide creeping corruption rather than repair it. But that is an argument for another time and thread.

[permalink][hide]

Do not let the perfect be the enemy of the good (18 March 2011, 03:54 UTC)

Thank you for your thoughtful and careful comments.

There are a couple that I would like to respond to.

Firstly, you say:

Statistically, the smart approach will do better with inconsistent P+Q

You can only say something is statistically true if you have statistics. I suspect you don't. If you have a credible threat model you could simulate it and generate statistics. However my main point here is that you (or I) don't have a credible threat model.

Given a credible model, something could be done. Without that the best we can do is report anything unusual that is found. I am completely supportive of informative error reports and of given the sysadmin control to do whatever they want. I am not supportive of having any automatic action be taken without a model against which I can justify that action.

Secondly, you refer to BER and the Znet column.

This does provide statistics - quite interesting statistics. But those statistics do not provide a model against which to justify an automatic repair action. All they tell me is that it is good to keep checksums and backups, and I completely agree, but that is not a RAID-level issue.

"BER" is the rate of getting bit errors off the surface. Some of them are correctable. The vast majority of the rest are detectable. i.e. if you get an uncorrectable bit error, then you will know about it because the drive will report an error. And that is exactly the sort of situation that RAID already handles quite well - that is what it was designed for.

In the context of smart rebuild, we are only interested in bit errors that are undetectable. i.e there were enough error bits in the one sector (or disk block) that the ECC couldn't even detect the error. While that is possible it is exceptionally rare. It is very very likely that you will discard a drive because it returns too many read errors long before you get even one undetectable error.

The errors discovered by the CERN study were almost certainly not disk surface errors.

The 512-byte sector errors were probably because the data was written to the wrong place, or read from the wrong place (and if you don't know which there is nothing you can do to 'fix' it). The single bit errors were probably during one of the many copies between original and the media, or media and the ALU which performed the check. There must have been some bus or storage which didn't provide a proper ECC. I believe parallel-SCSI doesn't provide any checks on address information and was susceptible to this. I think SATA includes a CRC on each packet sent over the wire, so it is less susceptible.

In either case, there is no certainty that anything that RAID6 could do would have helped. Maybe it would help sometimes, but not often.

For the errors reported by CERN an application-level checksum/crc/hash is the best approach - that and replacing the 3% of nodes on which they detected problems.

[permalink][hide]




[æ]