Design notes for a bad-block list in md/raid

19 May 2010, 04:37 UTC

I'm in the middle of (finally) implementing a bad block list for Linux md/raid, and I find that the motivation and the desired behaviour isn't (or wasn't) quite as obvious as I expected. So now that I think I have sorted it out, it seems sensible to write it up so that you, my faithful reader, can point out any glaring problems.

The bad block list is simply a list of blocks - one list for each device - which are to be treated as 'bad'. This does not include any relocation of bad blocks to some good location. That might be done by the underlying device, but md doesn't do it. md just tracks which blocks are bad and which, by implication, are good.

The difficulty comes in understanding exactly what "bad" means, why we need to record badness, and what to do when we find that we might want to perform IO against a recorded bad block.

There are three different conditions that can cause a block to be considered as "bad".

Firstly a read error on a degraded array for which the data cannot be found from other devices is a "bad" block and gets recorded. One could argue that there is no need to record such a bad block as it is easy to discover again that is it bad by simply trying again to read it. However when trying to read a block that happens to be bad the underlying device may retry several times and hence take quite a while to report that the block is bad. If md already knows it is bad, it can redirect the read to other devices and get an answer much more quickly.

Secondly when recovery to a spare finds that a required block of data cannot be recovered (as it depends on a bad block), the target block on the spare should be marked as 'bad', even though the media itself is probably perfectly good. This ensures that a block in a file will not silently end up with incorrect data. It will either be correct, or will report an error when it is read. This is important so that higher level backup strategies can be aware that a file is corrupted and so not e.g. over-write an old good copy with a new bad copy.

Finally a write error should cause a block to be marked as 'bad'. This is one area that I had to think a lot about. Obviously there is no doubt that something must be marked bad when a write error occurs, but it wasn't entirely clear that we shouldn't just mark the whole device as faulty. After all, hard drives handle media write errors by relocating the data so that the write error is not visible to the OS. Once a write error does get through, it either means there is a fairly serious problem with the device or some connector, or it means that drive's relocation area is full, which is again a fairly serious error.

However several of the possible serious errors can still leave a device in a state where most of the blocks can be read, and the purpose of RAID is to provide as much data availability as possible. So it follows that keeping the device active as long as possible is important. It may be that we only keep it limping along for another hour until a recovery process can save the data to some other device, but that extra hour is worth considerable programming effort. There could of course be a difficulty updating the bad-block-list on the device if writes to the device are having problems. That is an issue that I don't plan to address at first, though hopefully it will come later.

So now that we know what a bad block is we need to be sure that we know how to handle them.

When reading we simply want to avoid actually getting the data from any bad block. If the request is larger than a block (which it commonly is) we might need to handle different parts of the request differently.

For RAID1 we currently choose one device and submit the whole read request to that device. If there are bad blocks on all devices though, it might be necessary to submit different parts of the request to different devices. To keep things simple, we will always split a read request up in to pieces such that for each piece every drive is either all-bad (so it can be ignored) or all-good. This means that we don't need to further consult the bad-block lists if we need to retry after a read error. It also allows us to easily obey the requirement that - when the array is out-of-sync - all reads much come from the "first" device which is working.

For RAID5 we already break requests up into PAGE sized units, and it would seem to be unnecessarily complex to sub-divide requests more than this. So if any part of a PAGE on a device is bad, we will treat the whole PAGE as bad. As PAGEs are normally 4K and we are told that soon most drives will have 4K sectors, there is no long-term value in subdividing anyway. So if a PAGE is bad on one device, we treat that strip as though the device is bad and recover from parity.

When writing there are two possible ways to treat bad blocks. We could attempt the write and if it succeeds, we mark the blocks as not being bad any more. Or we could simply avoid writing those blocks on those devices completely.

Both of these are desirable in different circumstances. If a block is bad simply because it was once bad on some device when the array was degraded, and that device has been replaced with a perfectly good device, then we certainly want to write as that will remove the badness. Conversely if the device is failing and is likely to return an error to the write, we really want to avoid writing as it may put extra unwanted stress on the device and will likely cause retry/timeout delays.

After some thought it seems that the best way to choose between these is simply to ask if we have ever seen a write error on the device. If we haven't then we should write. If we have, then don't. It probably isn't necessary to store this state in the metadata as one write attempt every time you assemble the array is probably reasonable.

It would be quite easy to have this "seen-a-write-error" state visible via sysfs so that it can be cleared (or set) if the sysadmin thinks that would be a good idea but I suspect that in most cases the default handling would be acceptable.

As with reading, it is easiest to split write requests up in to pieces that are either on good drives, or are all-bad or all-bad on bad drives.

If a write request does fail, we have two choices. We can mark the whole region as bad, or we can try block-sized writes and only record badness when a block-sized write fails. The former would be a lot easier, but goes against the principle of maximum data availability. So we must choose the latter. It would probably be a similar amount of work to the work that currently tries to repair read errors, and could possibly use some of the same code.

Some extra thought is needed for parity-arrays such as RAID5 or RAID6. The obvious handling for bad blocks is that they cause the containing device to appear to be faulty just for that strip (a strip is one page from each device, contrasting with a stripe which is one chunk from each device). Thus we need to read from other devices to recover the content and to compute parity.

It is tempting to think that if we know a block is bad to the extent that we have lost the data in that block, then we can exclude it from future parity calculations across that strip and thus survive the loss of a second block in the same stripe.

While it sounds appealing it turns out that it would be more trouble that it is worth. We would need to record not only that the block was bad, but also that the block had been excluded from parity calculations. This would have to be recorded on the bad-block-list for that device. Then if we last that whole device we wouldn't know what the parity meant.

So despite the apparent room for optimisation, it seems best to treat a bad block as simply an unknown quantity that is best avoided.


Re: Design notes for a bad-block list in md/raid (19 May 2010, 11:19 UTC)

Where will you store the bad block list?

Also, what superblock version do you recommend? I have always gone with the mdadm default of 0.90, but I don't understand if that is "the best".

Thanks for such an awesome project. I recently grew a 5-disk raid5 to a 7-disk raid6, and changed the block size while I was at it. Other than needing the git version, it worked flawlessly (online too!)


Re: Design notes for a bad-block list in md/raid (20 May 2010, 04:58 UTC)

I haven't finalised where the bad block table will be stored. However I will only support it on v1.x metadata. v1.x allows me to explicitly exclude space from being used for data so that I can put other things there, like the write-intent bitmap or a bad-block table.

mdadm will get to decide where to put the different bits when it creates the array. I may be able to get mdadm to add an empty bad-block table when assembling an array if there is enough spare space - not sure about that aspect yet.

I would recommend v1.2 for new arrays unless it is a RAID1 that you hope to boot from, in which case you probably need v1.0 (or maybe even 0.90) for grub/lilo to be able to understand it.

There are a few features that are not supported for v0.90 (simply because the format is not easy to extend). bad-block-lists will probably be the first really significant one.


Re: Design notes for a bad-block list in md/raid (20 May 2010, 11:31 UTC)

Is it possible to change superblock format after the RAID has been built?


Re: Design notes for a bad-block list in md/raid (20 May 2010, 12:05 UTC)

No, there is not currently any support for changing super-block version.


Re: Design notes for a bad-block list in md/raid (20 May 2010, 12:23 UTC)

If I'm remembering correctly, the 0.9 and 1.2 superblocks are both at the end of the partitions, so is it possible to do it by shrinking the raid a bit then stopping it and forcing creation of a new raid with exactly the same parameters (save for the superblock format) and telling it not to rebuild?

Which all sounds rather scary...

Also, is there a way to tell md to save more room for the superblock stuff (like the bad block lists or the bitmap)?

Which leads to the next question... is there a "good" size for the bitmap? I like what it does for me, so I assume a larger bitmap would lead to more granular information. So on a massive raid, having more pages might be better.

Thanks, as always.


Re: Design notes for a bad-block list in md/raid (21 May 2010, 06:33 UTC)

0.90 has metadata near the end of the device, as does 1.0. 1.1 and 1.2 have it at or near the start. Yes, you could possible stop a 0.90 and re-create it as 1.0 and still have all your data. You would want to use the "--size" option to ensure it remained the same size. Maybe one day mdadm will do this nicely for you.

No, there is not currently any way to tell mdadm to reserve extra space. It reserves enough to store whatever bitmap you ask for, and maybe some more.

Current evidence is that a 'good' size for a bitmap is for each bit to represent rather large blocks - the order of megabytes. That is sufficient to accelerate resync considerably, but keep to a minimum the number of bitmap updates that are required. The latest mdadm has a fairly good default.


Re: Design notes for a bad-block list in md/raid (21 May 2010, 14:17 UTC)

"...two possible ways to treat bad blocks. We could attempt the write and if it fails, we mark the blocks as not being bad any more."

Shouldn't that say, "we mark the blocks as being bad"?


Re: Design notes for a bad-block list in md/raid (21 May 2010, 21:58 UTC)

Thanks for pointing that out. The sentence is (or was) wrong but not quite as you thought.

We could try writing (to a known-bad block) and if that succeeds then we can remove it from the list of bad blocks.

I have changed it in the article.



Re: Design notes for a bad-block list in md/raid (30 May 2010, 21:00 UTC)

Regarding bitmaps, at one point the source code had a warning about them not being able to be larger than a certain size (for some reason 2^20 bits rings a bell, 128KiB?) or it could cause kernel problems, is that true?

Mostly asking as I'm working/experimenting with a 15-drive RAID array and going to be getting a HyperOS flash-and-battery-backed-RAM solid-state drive from a friend to test storing all the bitmaps on, so throughput will be limited purely by IOPs and the SATA/PCI Express bandwidth to/from the HyperOS.

I wanted to know if there's any gains from having more than 1GB of storage to put all the bitmaps on that you could see if I expect all the drives to be 1.5TB or larger devices w/ newer 4k-sector internal formats? Ditto with the bad-block tables being added soon, figured verifying expected space consumed would be good.


Re: Design notes for a bad-block list in md/raid (30 May 2010, 23:18 UTC)

Yes, there is still a limit on the size of bitmaps, and the numbers you quote sound about right.

A larger bitmap tends to result in poorer performance as more bitmap updates are required. When a resync or rebuild is required and the bitmap can be used, then a larger bitmap can mean the resync is faster, but there are diminishing returns. Making the bitmap twice as large means half as much IO, but if that reduced the time from 10 seconds to 5 seconds, it might not be worth it.

I would suggest that each bit in the bitmap should correspond to about 1 second's worth of IO. It would be sequential IO, so about 60Meg. With 2TB devices, you only need 2^15bits. Such a bitmap would reduce a 9 hour resync to a few seconds (depending on how busy the drive had been).

Large bitmaps are only really interesting when you are mirroring over a slow link where a second worth of data may only be a Megabyte or less.

If you have a drive, you would be much better off storing an external journal for the FS on it (Assume you use an FS that supports external journals).


Re: Bitmap sizes for external storage (30 May 2010, 23:43 UTC)

That explains how the bitmap-size should be calculated much better actually in an understandable way, and makes a lot of sense now that it's mentioned. One of those "Oh, duh!" type of descriptions.

Also, good reminder on using an external journal for the main file system, though once it switches over to btrfs that will go out the proverbial window so I'm probably not going to rely on that. While it's still EXT4 it will definitely help though.

Even more than an SSD, storage on the HyperOS is spendy... 1-6 DDR2 RAM DIMMs plus the US$2-300 up-front cost for the chassis + SD and/or CF card for persistant storage since the battery-backup is only to write out the DIMMs to flash, but compared to the US$1650-2100 cost of 15 1.5-2.0TB hard drives it's reasonable as something of a persistant-storage device for a very small block of near-zero-latency storage like the bitmaps I felt.

Are the expected size requirements for the upcoming bad-block tables on par with the dirty-bitmap tables? Larger? Smaller?

Also, by default the write-bitmap is 'dirtied' appropriately before any actual writes occur to the array itself, then a given bit marked clean again after a timeout period, correct?


Re: Design notes for a bad-block list in md/raid (31 May 2010, 00:04 UTC)

My current draft implementation limits the bad-block list to 4K (512 entries). That may well be changed, but it doubt it would exceed 32K. You wouldn't really be able to put it on a separate device though. The bad-block-list needs to be on the device with the bad blocks.

Your summary of how bits in the bitmap are updated is correct.



Re: NeilBrown Re: bad-block list internal (31 May 2010, 00:14 UTC)

Understood again, and one less issue to be concerned with then. Thought the bad-block chunk would be as mobile as the bitmap chunk, or possibly had to be wherever the bitmap chunk was stored, thus my concern for how large it could grow.

Thank you again for such a wonderful set of tools.


Re: Design notes for a bad-block list in md/raid (03 December 2010, 01:49 UTC)

Did this ever get implemented into an active kernel?


Re: Design notes for a bad-block list in md/raid (03 December 2010, 02:53 UTC)

No. I have code, but it is completely untested and has no user-space support. I really want to get back to it, but other things have intervened.


Re: Design notes for a bad-block list in md/raid (05 April 2011, 07:15 UTC)
I have read: I would recommend v1.2 for new arrays unless it is a RAID1 that you hope to boot from, in which case you probably need v1.0 (or maybe even 0.90) for grub/lilo to be able to understand it. from: Re: Design notes for a bad-block list in md/raid

what I want to report is: The grub 2 in ubuntu 10.04 can not recognize superblock 1.0, so it fails when I re-install Grub 2 into raid1.