In mid-December 2008 I wrote a bit of a "road-map" containing some of my thoughts about development work that could usefully be on on the MD/RAID driver in the Linux kernel. Some of it might get done. Some of it might not. It is not a promise at all, more of a discussion starter in case people want to encourage features or suggest different features.
But I really should put this stuff in my blog so, 6 weeks later, here it is.
Bad block listThe idea here is to maintain and store on each device a list of blocks that are known to be 'bad'. This effectively allows us to fail a single block rather than a whole device when we get a media write error. Of course if updating the bad-block-list gives an error we then have to fail the device.
We would also record a bad block if we get a read error on a degraded array. This would e.g. allow recovery for a degraded raid1 where the sole remaining device has a bad block.
An array could have multiple errors on different devices and just those stripes would be considered to be "degraded". As long a no single stripe had too many bad blocks, the data would still be safe. Naturally as soon as you get one bad block, the array becomes susceptible to data loss on a single device failure, so it wouldn't be advisable to run with non-empty badblock lists for an extended length of time, However it might provide breathing space until drive replacement can be achieved.
hot-device-replaceThis is probably the most asked for feature of late. It would allow a device to be 'recovered' while the original was still in service. So instead of failing out a device and adding a spare, you can add the spare, build the data onto it, then fail out the device.
This meshes well with the bad block list. When we find a bad block, we start a hot-replace onto a spare (if one exists). If sleeping bad blocks are discovered during the hot-replace process, we don't lose the data unless we find two bad blocks in the same stripe. And then we just lose data in that stripe.
Recording in the metadata that a hot-replace was happening might be a little tricky, so it could be that if you reboot in the middle, you would have to restart from the beginning. Similarly there would be no 'intent' bitmap involved for this resync.
Each personality would have to implement much of this independently, effectively providing a mini raid1 implementation. It would be very minimal without e.g. read balancing or write-behind etc.
There would be no point implementing this in raid1. Just raid456 and raid10. It could conceivably make sense for raid0 and linear, but that is very unlikely to be implemented.
split-mirrorThis is really a function of mdadm rather than md. It is already quite possible to break a mirror into two separate single-device arrays. However it is a sufficiently common operation that it is probably making it very easy to do with mdadm. I'm thinking something like
raid5->raid6 conversion.This is also a fairly commonly asked for feature. The first step would be to define a raid6 layout where the Q block was not rotated around the devices but was always on the last device. Then we could change a raid5 to a singly-degraded raid6 without moving any data.
The next step would be to implement in-place restriping. This involves
- freezing a section of the array (all IO blocks)
- copying the data out to a safe backup
- copying it back in with the new layout
- updating the metadata to indicate that the restripe has progressed.
This would probably be quite slow but it would achieve the desired result.
Once we have in-place restriping we could change chunksize as well.
raid5 reduce number of devices.We can currently restripe a raid5 (or 6) over a larger number of devices but not over a smaller number of devices. That means you cannot undo an increase that you didn't want.
It might be nice to allow this to happen at the same time as increasing --size (if the devices are big enough) to allow the array to be restriped without changing the available space.
cluster raid1Allow a raid1 to be assembled on multiple hosts that share some drives, so a cluster filesystem (e.g. ocfs2) can be run over it. It requires co-ordination to handle failure events and resync/recovery. Most of this would probably be done in userspace.
Support for 'discard' commandsFlash-based devices and home high-end storage devices that provide "thin provisioning" like to know when parts of the device are not needed any more so they can optimise their behaviour. File systems are starting to add the functionality of sending this information to the block device using a "discard" command. It might be useful for md to make use of this.
md would need to keep a data structure (bitmap?) listing sections of the array that have been discarded. Initially this might be the whole array. When there is a write to a discarded section, it would need to be resynced and then the write allowed to complete. A read from a discarded section is probably an error - maybe just return nuls. When md decides it can discard part of the array, it tells the component devices that they can discard some data too. When the filesystem tells md it can discard part of the array, we might have a problem.
If the discard request is smaller than the granularity that md is using, then we would need to ignore it. So this would only really work if there was some guaranteed lower lower bound on granularity that was small enough for md to work with, or the filesystem would need to do aggregating and always send the largest 'discard' that it can. I'm not sure if they will do that, or to what extent.
If md was to maintain a granularity of sectors then a 16TB array (not at all unrealistic these days) would require 32billion bits to map what is in use. That is 4 billion bytes or 4Gigabytes - maybe a bit excessive. If we could rely on the filesystem arounding up to at least 1 megabyte (aligned) then only 4Meg of bitmap would be needed which is a little more realistic.