Back in early 2006 md/raid5 gained the ability to increase the number of devices in a RAID5, thus making more space available. As you can imagine, this is a slow process as every block of data (except possibly those in the first stripe) needs to be relocated. i.e they need to be read from one place and written to another. md/raid5 allows this reshaping to happen while the array is live. It temporarily blocks access to a few stripes at a time while those stripes a rearranged. So instead of the whole array being unavailable for several hours, little bits are unavailable for a fraction of a second each.
Then in early 2007 we gained the same functionality for RAID6. This was no more complex than RAID5, it just involved a little more code and testing.
Now, in mid 2009, we have most of the rest of the reshaping options that had been planned. These include changing the stripe size, changing the layout (i.e. where the parity blocks get stored) and reducing the number of devices.
Changing the layout provides valuable functionality as it is an important part of converting a RAID5 to a RAID6.
I'm sure you can see what comes next. After converting the RAID0 to a degraded RAID5 with an unusual layout we would use the new change-the-layout functionality to convert to a real RAID5.
It is a very similar process that can now be used to convert a RAID5 to a RAID6. We first change the RAID5 to RAID6 with a non-standard layout that has the parity blocks distributed as normal, but the Q blocks all on the last device (a new device). So this is RAID6 using the RAID6 driver, but with a non-RAID6 layout. So we "simply" change the layout and the job is done.
A RAID6 can be converted to a RAID5 by the reverse process. First we change the layout to a layout that is almost RAID5 but with an extra Q disk. Then we convert to real RAID5 by forgetting about the Q disk.
In all of this the messiest part is ensuring that the data survives a crash or other system shutdown. With the first reshape which just allowed increasing the number of devices, this was quite easy. For most of the time there is a gap in the devices between where data in the old layout in being read, and where data in the new layout is being written. This gap allows us to have two copies of that data. If we disable writes to a small section while it is being reshaped, then after a crash we know that the old layout still has good data, and simply re-layout the last few stripes from where-ever we recorded that we were up to.
This doesn't work for the first few stripes as they require writing the new layout over the old layout. So after a crash the old layout is probably corrupted and the new layout may be incomplete. So mdadm takes care to make a backup of those first few stripes and when it assembles an array that was still in the early phase of a reshape it first restores from the backup.
For a reshape that does not change the number of devices, such as changing chunksize or layout, every write will be over-writing the old layout of that same data so after a crash there will definitely be a range of blocks that we cannot know whether they are in the old layout or the new layout or a bit of both. So we need to always have a backup of the range of blocks that are currently being reshaped.
This is the most complex part of the new functionality in mdadm 3.1 (which is not released yet but can be found in the devel-3.1 branch for git://neil.brown.name/mdadm). mdadm monitors the reshape, setting an upper bound of how far it can progress at any time and making sure the area that it is allow to rearrange has writes disabled and has been backed-up.
This means that all the data is copied twice, once to the backup and once to the new layout on the array. This clearly means that such a reshape will go very slowly. But that is the price we have to pay for safety. It is like insurance. You might hate having to pay it, but you would hate it much more if you didn't and found that you needed it.
One way to avoid the extra cost of doing the backup is to increase the number of devices at the same time. e.g. if you had a 4-drive RAID5 you could convert it to a 6-drive RAID6 with a command like
The other change worth discussing is the ability to reduce the number of devices. This can be useful to reverse an increase in number of devices that was not intentional, to shrink an array to gain back a spare device, or as part of converting an array from several smaller devices to fewer large devices (thus saving power for example).
While increasing the number of devices or reshaping while leaving the number of devices the same always proceeds from the start of the devices towards the end, reducing the number of devices proceeds from the end of the devices to the beginning. This means that the critical section when a backup is needed happens right at the end of the reshape process, so mdadm runs in the background watching the array and does the backup just when it is needed.
Naturally decreasing the number of devices reduced the amount of available space in the array and the very first write-out to the new arrangement will destroy data that was at the end of the original array. Data that hopefully isn't wanted any more. However people sometimes make mistake and as reducing the number of devices is immediately destructive of data mdadm requires a little more care. In particular, before reducing the number of devices in a RAID5 or RAID6 you must first reduce the size of the array using the new --array-size= option to mdadm --grow. This truncates the array in a non-destructive way. You can check if the data that you care about is still accessible and then when you are sure, use mdadm --grow --raid-disks= to reduce the number of devices.
If you have replaced all the devices with larger devices, you can avoid the need to reduce the size of the array by increasing the component size at the same time as reducing the number of devices. e.g. on a 4-disk RAID5,
RAID1A RAID1 can change the number of devices or change the size of individual devices. A 2 drive RAID1 can be converted to a 2 drive RAID5.
RAID4A RAID4 can change the number of devices or the size of individual devices. It cannot be converted to RAID5 yet (though that should be trivial to implement).
RAID5A RAID5 can change the number of devices, the size of the individual devices, the chunk size and the layout. A 2 drive RAID5 can be converted to RAID1, and a 3 or more drive RAID5 can be converted to RAID6.
RAID6A RAID6 can change the number of devices, the size of the individual devices, the chunk size and the layout. And RAID6 can be converted to RAID5 by first changing the layout to be similar to RAID5, then changing the level.
LINEARA lINEAR array can have a device added to it which will simply increase its size.
RAID0 and RAID10These arrays cannot be reshaped at all at present.
While conversion from RAID0 to RAID5 was used as an example above, it isn't actually implemented yet. md/RAID0 is broader than the regular definition of RAID0 in that the devices can be different sizes. If they are, then the array cannot be converted to RAID5 at all, so a general conversion is not possible.
It would still be good to support a simple conversion if the RAID0 does just use the same amount of space from each device. It would also be nice to teach RAID5 to handle arrays with devices of different sizes. There are some complications there as you could have a hot spare that can replace some devices but not all. There would also be the need to store the active sizes of all devices in the metadata - something RAID0 doesn't need as it doesn't need to be able to cope with missing devices.
If we did have support for variable drive sizes in a RAID5, then we could implement extending a RAID0 by converting it to a degraded RAID5, perform the conversion there using common code, then convert back to RAID0
It would also be nice to add some reshape options to RAID10. Currently you cannot change a RAID10 at all. The first step here would be enumerating exactly what conversions make sense with the various possible layouts (near/far/offset) and then finding how to implement them most easily. But that is a job for another days.