log segments and RAID6 reshaping

08 March 2011, 07:47 UTC

Part of the design approach of LaFS - and any other log structured filesystem - is to divide the device space into relatively large segments. Each segment is many megabytes in size so the time to write a whole segment is much more than the time to seek to a new segment. Writes happen sequentially through a segment, so write throughput should be as high as the device can manage.

(obviously there needs to be a way to find or create segments with no live data so they can be written to. This is called cleaning and will not be discussed further here).

One of the innovations of LaFS is to allow segments to be aligned with the stripes in a RAID5 or RAID6 array so that each segment is a whole number of stripes and so that LaFS knows the details of the layout including chunk size and width (number of data devices).

This allows LaFS to always write in whole 'strips' - where a 'strip' is one block from each device chosen such that they all contribute to the one parity block. Blocks in a strip may not be contiguous (they only are if the chunksize matches the block size), so one would not normally write a single strip. However doing so is the most efficient way to write to RAID6 as no pre-reading is needed. So as LaFS knows the precise geometry and is free with how it chooses where to write, it can easily write just a strip if needed. It can also pad out the write with blocks of NULs to make sure a whole strip is written each time.

Normally one would hope that several strip would be written at once, hopefully a whole stripe or more, but it is very valuable to be able to write whole strips at a time.

This is lovely in theory but in practice there is a problem. People like to make their RAID6 arrays bigger, often by adding one or two devices to the array and "restriping" or "reshaping" the array. When you do this the geometry changes significantly and the alignment of strips and stripes and segments will be quite different. Suddenly the efficient IO practice of LaFS becomes very inefficient.

There are two ways to address this, one which I have had in mind since the beginning, one which only occurred to me recently.

Answer 1 - two dimension arrays

The first idea was to create a new sort of RAID6 array. Rather than striping the blocks across all devices, we differentiate 'data' devices from 'parity' devices (like RAID4) and concatenate the data devices (like md/linear). The parity device, whether 1 or 2 of them, still hold the same sort of parity data as before from data blocks across all devices, but the logical addresses of those blocks would be very different.

I would probably arrange the addressing so that there is space between the devices to allow for the individual devices to be grown. so maybe 40 address bits address sectors in a device, and 8 address devices in the array.

If we make the first 2 devices the parity devices, then we can add a device to the array at any time by zeroing it and instantly merging it into the array (no reshape needed at that point - the parity blocks will still be correct). We then tell LaFS and it can see the extra space and will know each strip is a little bit bigger.

At first the extra space would be spread out evenly over all of the segments. The extra added to unused segments would be immediately available. The extra added to in-use segments (which is probably most of the segments if we are bothering to add space). will not be available until those segments are "cleaned" and have their data relocated into other (now bigger) segments. So the space becomes available gradually as the cleaning progresses.

So having creating a 2-D RAID6 layout that only LaFS could make any use of, you add devices by zeroing them, adding them to the array, and telling LaFS. Some space becomes available instantly, but most takes a while and a lot of 'cleaning' to arrive.

Answer 2 - Choosing segment size carefully

The problem with Answer 1 is that it needs careful fore-thought to create the right sort of array. People who have just created a regular RAID6 might find it works well, and then performance goes through the floor when they reshape the array. The important observation I recently had though, was that performance doesn't have to go through the floor.

LaFS only uses it's knowledge of strip geometry when writing, not when reading. So the fine details can change as long as the coarse details don't. In this case the find details are the chunk-size and width, while the coarse details are the segment size.

Now RAID6 arrays tend not to be very very large. 2 to 6 data devices is fairly common. Up to maybe 12 isn't unusual. More than that is probably asking for trouble as the probability of multiple errors gets a lot higher than RAID6 can really protect you from. It becomes better to have multiple RAID6 arrays and join them together.

So the 'width' of a RAID6 array is probably one of 2,3,4,5,6,7,8,9,10,11,12. If we take the simple expedient of excluding '11', then all of these divide 2520.

So if we always create LaFS filesystems with a segment size which is a multiple of 2520 blocks, and probably a few more multiples of 2 to allow for larger chunk sizes, we have a segment size that can always align with whole stripes no matter what size the RAID6 is reshaped to - providing it doesn't have exactly 11 or more than 12 data disks.

While the reshape is happening, LaFS won't be able to write perfectly sized strips as the perfect size will be changing, but we don't expect peak performance during a reshape anyway. Once the reshape has finished we can tell LaFS of the new arrangement and it can start using the new size and start writing whole strips again.

Care is still needed like with Answer 1, however the care is different. Choosing the right cluster size can be performed by mkfs.lafs so the user can create a LaFS on any device and it should work well. And choosing the right number of devices to grow to needs a little care (avoid 11, 13) but this will be done after the users have been working with LaFS for a while and hopefully will have read the documentation,

So in this case, the extra space doesn't become available until the reshape completes, but then it all becomes available instantly without any cleaning.

Answer 1 is probably the best answer as the rearranging of blocks can be spread out and writes are always optimal, but Answer 2 is a fairly good second-best, particularly if 2-D RAID6 arrays are not available for some reason (I haven't actually coded them yet ... but then I haven't coded all of LaFS yet either).