(original article)


03 October 2010, 04:48 UTC

I've been generally quite pleased with mdadm for managing linux raids, but there is one process which I have not been happy with regarding shape changing and I thought I would see if there is a better way to do it. I refer to growing a RAID (raid-5 in this case) not by adding disks but by replacing the disks with larger disks over time. As you know, disks keep getting cheaper so I have been through this process 3 times now, jumping my 3 disks from 300gb to 1tb in various steps. I don't allocate the whole disk to the RAID, I use partitions but that doesn't change things much.

The current method involves replacing all disks, one by one, doing a rebuild each time. But I typically replace disks one at a time, always buying a new larger disk in the new pricing sweet spot. Doing 3 (or N) rebuilds at once is a risky proposition. Any failure during all those rebuilds and you have hosed your RAID -- even though the data to rebuild is still there on the old drive you marked as faulty.

So when you have three 500gb disks and you buy a 750gb, you still only use 500mb of the 750gb drive of course. You could allocate the whole drive to the RAID, but usually 250gb of disk is worthwhile, so it is much nicer to use that 250gb for something else temporarily like online backup (which does not need redundancy) or scratch space etc. Then you do it with a 2nd drive. Finally the day comes when you have 2x750 and a 500 and you replace the 500 -- not with a 750, but with a 1TB because prices have dropped. So now you put in the 1TB you can either allocate 750gb or the whole 1TB to the RAID -- again, that extra disk is handy, why leave it unused for a year? But either way you allocate at least 750 and it rebuilds a 500gb raid component into it. Then you go to those two 750 drives which both have only 500 in use.

You have to rebuild them, too. You remove the partition and the spare one after it and create a new larger 750 partition, and re-add it to the raid. And lo, it rebuilds everything, even though it is in general writing back to the drive exactly what is already on it. About all that really happens is it moves and updates the metadata block, if I understand this right. This is the part that bothers me. I put my array at risk by having it run degraded and rebuild to do almost nothing.

Then I do it again for the other drive. And finally I have a raid on 3 750gb partitions using 500gb of each, and I can grow that to size=max, and I have my 3x750gb RAID. With yet another operation.

What could fix this?

a) A tool to allow me to grow a RAID partition into empty space beyond it in the disk. Ie. modify a program like gparted to be able to grow RAID component partitions up into empty space, updating their metadata. Or possibly even move them if the RAID is offline.

b) Since this is obviously easier to do if the metadata is at the front, encourage people to create RAID5s they plan to move with version 1.2 metadata. And/or offer a means to change the metadata format of an array from 0.9 to 1.2

These functions would help a lot, but another function in the kernel engine would improve robustness of the RAID, namely a way to hot replace a working drive. In a hot replace, the drive to be removed would be in the system along with the new drive as a spare. Instead of marking the drive "faulty" we would mark it to be replaced. The system would then start building on the new spare, either as normal or just by copying blocks from the original, whichever works best. However, in the event of any read error during this procedure, all the original drives would be available to get the parity info to survive the failure.

Without this, we have a situation during a replacement where the RAID has made our data less protected rather than more, which is not the philosophy of redundant drives.

Thoughts? To my mind this is the most common type of reshaping I have actually had to do, so while it's nice to convert a 5 to a 6, this is what I think many people would like.

- Brad Templeton