I'm in the middle of (finally) implementing a bad block list for Linux md/raid, and I find that the motivation and the desired behaviour isn't (or wasn't) quite as obvious as I expected. So now that I think I have sorted it out, it seems sensible to write it up so that you, my faithful reader, can point out any glaring problems.
The bad block list is simply a list of blocks - one list for each device - which are to be treated as 'bad'. This does not include any relocation of bad blocks to some good location. That might be done by the underlying device, but md doesn't do it. md just tracks which blocks are bad and which, by implication, are good.
The difficulty comes in understanding exactly what "bad" means, why we need to record badness, and what to do when we find that we might want to perform IO against a recorded bad block.
There are three different conditions that can cause a block to be considered as "bad".
Firstly a read error on a degraded array for which the data cannot be found from other devices is a "bad" block and gets recorded. One could argue that there is no need to record such a bad block as it is easy to discover again that is it bad by simply trying again to read it. However when trying to read a block that happens to be bad the underlying device may retry several times and hence take quite a while to report that the block is bad. If md already knows it is bad, it can redirect the read to other devices and get an answer much more quickly.
Secondly when recovery to a spare finds that a required block of data cannot be recovered (as it depends on a bad block), the target block on the spare should be marked as 'bad', even though the media itself is probably perfectly good. This ensures that a block in a file will not silently end up with incorrect data. It will either be correct, or will report an error when it is read. This is important so that higher level backup strategies can be aware that a file is corrupted and so not e.g. over-write an old good copy with a new bad copy.
Finally a write error should cause a block to be marked as 'bad'. This is one area that I had to think a lot about. Obviously there is no doubt that something must be marked bad when a write error occurs, but it wasn't entirely clear that we shouldn't just mark the whole device as faulty. After all, hard drives handle media write errors by relocating the data so that the write error is not visible to the OS. Once a write error does get through, it either means there is a fairly serious problem with the device or some connector, or it means that drive's relocation area is full, which is again a fairly serious error.
However several of the possible serious errors can still leave a device in a state where most of the blocks can be read, and the purpose of RAID is to provide as much data availability as possible. So it follows that keeping the device active as long as possible is important. It may be that we only keep it limping along for another hour until a recovery process can save the data to some other device, but that extra hour is worth considerable programming effort. There could of course be a difficulty updating the bad-block-list on the device if writes to the device are having problems. That is an issue that I don't plan to address at first, though hopefully it will come later.
So now that we know what a bad block is we need to be sure that we know how to handle them.
When reading we simply want to avoid actually getting the data from any bad block. If the request is larger than a block (which it commonly is) we might need to handle different parts of the request differently.
For RAID1 we currently choose one device and submit the whole read request to that device. If there are bad blocks on all devices though, it might be necessary to submit different parts of the request to different devices. To keep things simple, we will always split a read request up in to pieces such that for each piece every drive is either all-bad (so it can be ignored) or all-good. This means that we don't need to further consult the bad-block lists if we need to retry after a read error. It also allows us to easily obey the requirement that - when the array is out-of-sync - all reads much come from the "first" device which is working.
For RAID5 we already break requests up into PAGE sized units, and it would seem to be unnecessarily complex to sub-divide requests more than this. So if any part of a PAGE on a device is bad, we will treat the whole PAGE as bad. As PAGEs are normally 4K and we are told that soon most drives will have 4K sectors, there is no long-term value in subdividing anyway. So if a PAGE is bad on one device, we treat that strip as though the device is bad and recover from parity.
When writing there are two possible ways to treat bad blocks. We could attempt the write and if it succeeds, we mark the blocks as not being bad any more. Or we could simply avoid writing those blocks on those devices completely.
Both of these are desirable in different circumstances. If a block is bad simply because it was once bad on some device when the array was degraded, and that device has been replaced with a perfectly good device, then we certainly want to write as that will remove the badness. Conversely if the device is failing and is likely to return an error to the write, we really want to avoid writing as it may put extra unwanted stress on the device and will likely cause retry/timeout delays.
After some thought it seems that the best way to choose between these is simply to ask if we have ever seen a write error on the device. If we haven't then we should write. If we have, then don't. It probably isn't necessary to store this state in the metadata as one write attempt every time you assemble the array is probably reasonable.
It would be quite easy to have this "seen-a-write-error" state visible via sysfs so that it can be cleared (or set) if the sysadmin thinks that would be a good idea but I suspect that in most cases the default handling would be acceptable.
As with reading, it is easiest to split write requests up in to pieces that are either on good drives, or are all-bad or all-bad on bad drives.
If a write request does fail, we have two choices. We can mark the whole region as bad, or we can try block-sized writes and only record badness when a block-sized write fails. The former would be a lot easier, but goes against the principle of maximum data availability. So we must choose the latter. It would probably be a similar amount of work to the work that currently tries to repair read errors, and could possibly use some of the same code.
Some extra thought is needed for parity-arrays such as RAID5 or RAID6. The obvious handling for bad blocks is that they cause the containing device to appear to be faulty just for that strip (a strip is one page from each device, contrasting with a stripe which is one chunk from each device). Thus we need to read from other devices to recover the content and to compute parity.
It is tempting to think that if we know a block is bad to the extent that we have lost the data in that block, then we can exclude it from future parity calculations across that strip and thus survive the loss of a second block in the same stripe.
While it sounds appealing it turns out that it would be more trouble that it is worth. We would need to record not only that the block was bad, but also that the block had been excluded from parity calculations. This would have to be recorded on the bad-block-list for that device. Then if we last that whole device we wouldn't know what the parity meant.
So despite the apparent room for optimisation, it seems best to treat a bad block as simply an unknown quantity that is best avoided.