It is about 2 years since I last published a road-map for md/raid so I thought it was time for another one. Unfortunately quite a few things on the previous list remain undone, but there has been some progress.
I think one of the problems with some to-do lists is that they aren't detailed enough. High-level design, low level design, implementation, and testing are all very different sorts of tasks that seem to require different styles of thinking and so are best done separately. As writing up a road-map is a high-level design task it makes sense to do the full high-level design at that point so that the tasks are detailed enough to be addressed individually with little reference to the other tasks in the list (except what is explicit in the road map).
A particular need I am finding for this road map is to make explicit the required ordering and interdependence of certain tasks. Hopefully that will make it easier to address them in an appropriate order, and mean that I waste less time saying "this is too hard, I might go read some email instead".
So the following is a detailed road-map for md raid for the coming months.
# Bad Block Log
# Hot Replace
# Reversible Reshape
# Change data offset during reshape
# Bitmap of non-sync regions.
# Assume-clean when increasing array --size
# Enable 'reshape' to also perform 'recovery'.
# When reshaping an array to fewer devices, allow 'size' to be increased
# Allow write-intent-bitmap to be added to an array during reshape/recovery.
# Support resizing of write-intent-bitmap prior to reshape
# Support reshape of RAID10 arrays.
# Better reporting of inconsistencies.
As devices grow in capacity, the chance of finding a bad block increases, and the time taken to recover to a spare also increases. So the practice of ejecting a device from the array as soon as a write-error is detected is getting more and more problematic.
For some time we have avoided ejecting devices for read errors, by computing the expected data from elsewhere and writing back to the device - hopefully fixing the read error. However this cannot help degraded arrays and they will still eject a device (and hence fail the whole array) on a single read error. This is not good.
A particular problem is that when a device does fail and we need to recover the data, we typically read all of the other blocks on all arrays. If we are going to hit any read errors, this is the most likely time, and also this is the worst possible time and it will mean that the recovery doesn't complete and the array gets stuck in a degraded state and is very susceptible to substantial loss if another failure happens.
Part of the answer to this is to implement a "bad block log". This is a record of blocks that are known to be bad. i.e. either a read or a write has recently failed. Doing this allows us to just eject that block from the array rather than the whole devices. Similarly instead of failing the whole array, we can fail just one stripe. Certainly this can mean data loss, but the loss of a few K is much less traumatic than the loss of a terabyte.
But using a bad block list isn't just about keeping the data loss small, it can be about keeping it to zero. If we get a write error on a block in a non-degraded array, then recording the bad block means we lose redundancy in just that stripe rather than losing it across the whole array. If we then lose a different block on a different drive, the ability to record the bad block means that we can continue without data loss. Had we needed to eject both whole drives from the array we would have lost access to all of our data.
The bad block list must be recorded to stable storage to be useful, so it really needs to be on the same drives that store the data. The bad-block list for a particular device is only of any interest to that device. Keeping information about one device on another is pointless. So we don't have a bad block list for the whole array, we keep multiple lists, one for each device.
It would be best to keep at least two copies of the bad block list so that if the place where the list goes bad we can keep working with the device. The same logic applies to other metadata which currently cannot be duplicated. So implementing this feature will not address metadata redundancy. A separate feature should address metadata redundancy and it can duplicate the bad block list as well as other metadata.
There are doubtlessly lots of ways that the bad block list could be stored, but we need to settle on one. For externally managed metadata we need to make the list accessible via sysfs in a generic way so that a user-space program can store is as appropriate.
So: for v0.90 we choose not to store a bad block list. There isn't anywhere convenient to store it and new installations of v0.90 are not really encouraged.
For v1.x metadata we record in the metadata an offset (from the superblock) and a size for a table, and a 'shift' value which can be used to shift from sector addresses to block numbers. Thus the unit that is failed when an error is detected can be larger than one sector.
Each entry in the table is 64bits in little-endian. The most significant 55 bits store a block number which allows for 16 exbibytes with 512byte blocks, or more if a larger shift size is used. The remaining 9 bits store a length of the bad range which can range from 1 to 512. As bad blocks can often be consecutive, this is expected to allow the list to be quite efficient. A value of all 1's cannot correctly identify a bad range of blocks and so it is used to pad out the tail of the list.
The bad block list is exposed through sysfs via a directory called 'badblocks' containing several attribute files.
"shift" stores the 'shift' number described above and can be set as long as the bad block list is empty.
"all" and "unacknowledged" each contains a list of bad ranges, the start (in blocks, not sectors) and the length (1-512). Each can also be written to with a string of the same format as is read out. This can be used to add bad blocks to the list or to acknowledge bad blocks. Writing effectively say "this bad range is securely recorded on stable storage".
All bad blocks appear in the "badblocks/all" file. Only "acknowledged" bad blocks appear in "badblocks/unacknowledged". These are ranges which appear to be bad but are not known to be stored on stable storage.
When md detects a write error or a read error which it cannot correct it added the block and marks the range that it was part of as 'unacknowledged'. Any write that depends on this block is then blocked until the range is acknowledged. This ensures that an application isn't told that a write has succeeded until the data really is safe.
If the bad block list is being managed by v1.x metadata internally, then the bad block list will be written out and the ranges will be acknowledged and writes unblocked automatically.
If the bad block list is being managed externally, then the bad ranges will be reported in "unacknowledged_bad_blocks". The metadata handler should read this, update the on-disk metadata and write the range back to "bad_blocks". This completes the acknowledgment handshake and writes can continue.
RAID1, RAID10 and RAID456 should all support bad blocks. Every read or write should perform a lookup of the bad block list. If a read finds a bad block, that device should be treated as failed for that read. This includes reads that are part of resync or recovery.
If a write finds a bad block there are two possible responses. Either the block can be ignored as with reads, or we can try to write the data in the hope that it will fix the error. Always taking the second action would seem best as it allows blocks to be removed from the bad-block list, but as a failing write can take a long time, there are plenty of cases where it would not be good.
To choose between these we make the simple decision that once we see a write error we never try to write to bad blocks on that device again. This may not always be the perfect strategy, but it will effectively address common scenarios. So if a bad block is marked bad due to a read error when the array was degraded, then a write (presumably from the filesystem) will have the opportunity to correct the error. However if it was marked bad due to a write error we don't risk paying the penalty of more write errors.
This 'have seen a write error' status is not stored in the array metadata. So when restarting an array with some bad blocks, each device will have one chance to prove that it can correctly handle writes to a bad block. If it can, the bad block will be removed from the list and the data is that little bit safer. If it cannot, no further writes to bad blocks will be tried on the device until the next array restart.
"Hot replace" is my name for the process of replacing one device in an array by another one without first failing the one device. Thus there can be two devices in an array filling the same 'role'. One device will contain all of the data, the other device will only contain some of it and will be undergoing a 'recovery' process. Once the second device is fully recovered it is expected that the first device will be removed from the array.
This can be useful whenever you want to replace a working device with
another device, without letting the array go degraded. Two obvious
1/ when you want to replace a smaller device with a larger device
2/ when you have a device with a number of bad blocks and want to replace it with a more reliable device.
For '2' to be realised, the bad block log described above must be implemented, so it should be completed before this feature.
Hot replace is really only needed for RAID10 and RAID456. For RAID1, simply increasing the number of devices in the array while the new device recovers, then failing the old device and decreasing the number of devices in the array is sufficient.
For RAID0 or LINEAR it would be sufficient to:
- stop the arrayn
- make a RAID1 without superblocks for the old and new device
- re-assemble the array using the RAID1 in place of the old device.
This is certainly not as convenient but is sufficient for a case that is not likely to be commonly needed.
So for both the RAID10 and RAID456 modules we need:
- the ability to add a device as a hot-replace device for a specific slot
- the ability to record hot-spare status in the metadata.
- a 'recovery' process to rebuild a device, preferably only reading from the device to be replaced, though reading from elsewhere when needed
- writes to go to both primary and secondary device.
- Reads to come from either if the second has recovered far enough.
- to promote a secondary device to primary when the primary device (that has a hot-replace device) fails.
It is not clear whether the primary should be automatically failed when the rebuild of the secondary completes. Commonly this would be ideal, but if the secondary experienced any write errors (that were recorded in the bad block log) then it would be best to leave both in place until the sysadmin resolves the situation. So in the first implementation this failing should not be automatic.
The identification of a spare as a 'hot-replace' device is achieved through the 'md/dev-XXXX/slot' sysfs attribute. This is usually 'none' or a small integer identifying which slot in the array is filled by this device. A number followed by a plus (e.g. '1+') is written, then the device takes the role of a hot-spare. This syntax requires there be at most one hot spare per slot. This is a deliberate decision to manage complexity in the code. Allowing more would be of minimal value but require substantial extra complexity.
v0.90 metadata is not supported. v1.x sets a 'feature bit' on the superblock of any 'hot-replace' device and naturally records in 'recover_offset' how far recovery has progressed. Externally managed metadata can support this, or not, as they choose.
It is possible to start a reshape that cannot be reversed until the reshape has completed. This is occasionally problematic. While we might hope that users would never make errors, we should try to be as forgiving as possible.
Reversing a reshape that changes the number of data-devices is
possible as we support both growing and shrinking and these happen in
opposite directions so one is the reverse of the other. Thus at worst,
such a reshape can be reversed by:
- stopping the array
- re-writing the metadata so it looks like the change is going in the other direction
- restarting the array.
However for a reshape that doesn't change the number of data devices, such as a RAID5->RAID6 conversion or a change of chunk-size, reversal is currently not possible as the change always goes in the same direction.
This is currently only meaningful for RAID456, though at some later date it might be relevant for RAID10.
A future change will make it possible to move the data_offset while performing a reshape, and that will sometimes require the reshape to progress in a certain direction. It is only when the data_offset is unchanged and the number of data disks is unchanged that there is any doubt about direction. In that case it needs to be explicitly stated.
- some way to record in the metadata the direction of the reshape
- some way to ask for a reshape to be started in the reverse direction
- some way to reverse a reshape that is currently happening.
We have a new sysfs attribute "reshape_direction" which is "low-to-high" or "high-to-low". This defaults to "low-to-high" but will be force to "high-to-low" if the particular reshape requires it, or can be explicity set by a 'write' before the reshape commences.
Once the reshape has commenced, writing a new value to this field can flip the reshape causing it to be reverted.
In both v0.90 and v1.x metadata we record a reversing reshape by setting the most significant bit in reshape_position. For v0.90 we also increase the minor number to 91. For v1.x we set a feature bit as well.
One of the biggest problems with reshape currently is the need for the backup file. This is a management problem as it cannot easily be found at restart, and it is a performance problem as the extra writing is expensive.
In some cases we can avoid the need for a backup file completely by changing the data-offset. i.e. the location on the devices where the array data starts.
For reshapes that increase the number of devices, only a single backup is required at the start. If the data_offset is moved just one chunk earlier we can do without a separate backup. This obviously requires that space was left when the array was first created. Recent versions of mdadm do leave some space with the default metadata, though more would probably be good.
For reshapes that decrease the number of device, only a small backup is required right at the end of the process (at the beginning of the devices). If we move the data_offset forward by one chunk that backup too can be avoided. As we are normally reducing the size of the array in this process, we just need to reduce it a little bit more.
For reshapes that neither increase of decrease the number of devices a somewhat larger change in data_offset is needed to get reasonable performance. A single chunk (of the larger chunk size) would work, but would require updating the metadata after each chunk which would be prohibitively slow unless chunks were very large. A few megabytes is probably sufficient for reasonable performance, though testing would be helpful to be sure. Current mdadm leaves no space at the start of 1.0, and about 1Meg at the start of 1.1 and 1.2 arrays.
This will generally not be enough space. In these cases it will probably be best to perform the reshape in the reverse direction (helped by the previous feature). This will probably require shrinking the filesystem and the array slightly first. Future versions of mdadm should aim to leave a few metabytes free at start and end to make these reshapes work better.
Moving the data offset is not possible for 0.90 metadata as it does not record a data offset.
For 1.x metadata it is possible to have a different data_offset on each device. However for simplicity we will only support changing the data offset by the same amount on each device. This amount will be stored in currently-unused space in the 1.x metadata. There will be a sysfs attribute "delta_data_offset" which can be set to a number of sectors - positive or negative - to request a change in the data offset and thus avoid the need for a backup file.
There are a couple of reasons for having regions of an array that are known not to contain important data and are known to not necessarily be in-sync.
1/ When an array is first created it normally contains no valid data. The normal process of a 'resync' to make all parity/copies correct is largely a waste of time.
2/ When the filesystem uses a "discard" command to report that a region of the device is no-longer used it would be good to be able to pass this down to the underlying devices. To do this safely we need to record at the md level that the region is unused so we don't complain about inconsistencies and don't try to re-sync the region after a crash.
If we record which regions are not in-sync in a bitmap then we can meet both of these needs.
A read to a non-in-sync region would always return 0s. A 'write' to a non-in-sync region should cause that region to be resynced. Writing zeros would in some sense be ideal, but to do that we would have to block the write, which would be unfortunate. As the fs should not be reading from that area anyway, it shouldn't really matter.
The granularity of the bit is probably quite hard to get right. Having it match the block size would mean that no resync would be needed and that every discard request could be handled exactly. However it could result in a very large bitmap - 30 Megabytes for a 1 terabyte device with a 4K block size. This would need to be kept in memory and looked up for every access, which could be problematic.
Having a very coarse granularity would make storage and lookups more efficient. If we make sure the bitmap would fit in 4K, we would have about 32 megabytes for bit. This would mean that each time we triggered a resync it would resync for a second or two which is probably a reasonable time as it wouldn't happen very often. But it would also mean that we can only service a 'discard' request if it covers whole blocks of 32 megabytes, and I really don't know how likely that is. Actually I'm not sure if anyone knows, the jury seems to still be out on how 'discard' will work long-term.
So probably aiming for a few K to a few hundred K seems reasonable. That means that the in-memory representation will have to be a two-level array. A page of pointers to other pages can cover (on a 64bit system) 512 pages or 2Meg of bitmap space which should be enough.
As always we need a way to:
- record the location and size of the bitmap in the metadata
- allow the granularity to be set via sysfs
- allow bits to be set via sysfs, and allow the current bitmap to be read via sysfs.
For v0.90 metadata we won't support this as there is no room. We could possibly store about 32 bytes directly in the superblock allowing for 4Gig sections but this is unlikely to be really useful.
For v1.x metadata we use 8 bytes from the 'array state info'. 4 bytes give an offset from the metadata of the start of the bitmap, 2 bytes give the space reserved for the bitmap (max 32Meg) and 2 bytes give a shift value from sectors to in-sync chunks. The actual size of the bitmap must be computed from the known size of the array and the size of the chunks.
We present the bitmap in sysfs similar to the way we present the bad block list. A file 'non-sync/regions' contains start and size of regions (measured in sectors) that are known to not be in-sync. A file 'non-sync/now-in-sync' lists ranges that actually are in sync but have not been recorded in non-in-sync yet. User-space reads now-in-sync', updates the metadata, and write to 'regions'.
Another file 'non-sync/to-discard' lists ranges for a which a discard request has been made. These need to be recorded in the metadata. They are then written back to the file which allows the discard request to complete.
The granularity can be set via sysfs by writing to 'non-sync/chunksize'.
When a RAID1 is created, --assume-clean can be given so that the largely-unnecessary initial resync can be avoided. When extending the size of an array with --grow --size=, there is no way to specify --assume-clean.
If a non-sync bitmap (see above) is configured this doesn't matter that the extra space will simply be marked as non-in-sync. However if a non-sync bitmap is not supported by the metadata or is not configured it would be good if md/raid1 can be told not to sync the extra space - to assume that it is in-sync.
So when a non-sync bitmap is not configured (the chunk-size is zero),
writing to the non-sync/regions file tells md that we don't care about the
region being in-sync. So the sequence:
- freeze sync_action
- update size
- write range to non-sync/regions
- unfreeze sync_action
will effect a "--grow --size=bigger --assume-clean" reshape.
As a 'reshape' re-writes all the data in the array it can quite easily be used to recover to a spare device. Normally these two operations would happen separately. However if a device fails during a reshape and a spare is available it makes sense to combine them.
Currently if a device fails during a reshape (leaving the array degraded but functional) the reshape will continue and complete. Then if a spare is available it will be recovered. This means a longer total time until the array is optimal.
When the device fails, the reshape actually aborts, and the restarts from where it left off. If instead we allow spares to be added between the abort and the restart, and cause the 'reshape' to actually do a recovery until it reaches the point where it was already up to, then we minimise the time to getting an optimal array.
The 'size' of an array is the amount of space on each device which is used by the array. Normally the 'size' of an array cannot be set beyond the amount of space available on the smallest device.
However when reshaping an array to have fewer devices it can be useful to be able to set the 'size' to be the smallest of the remaining devices - those that will still be in use after the reshape.
Normally reshaping an array to have fewer devices will make the array size smaller. However if we can simultaneously increase the size of the remaining devices, the array size can stay unchanged or even grow.
This can be used after replacing (ideally using hot-replace) a few devices in the array with larger devices. The net result will be a similar amount of storage using few drives, each larger than before.
This should simply be a case of allowing size to be set larger when delta_disks is negative. It also requires that when converting the excess device to spares, we fail them if they are smaller than the new size.
As a reshape can be reversed, we must make sure to revert the size change when reversing a reshape.
Currently it is not possible to add a write-intent-bitmap to an array that is being reshaped/resynced/recovered. There is no real justification for this, it was just easier at the time.
Implementing this requires a review of all code relating to the bitmap, checking that a bitmap appearing - or disappearing - during these processes will not be a problem. As the array is quiescent when the bitmap is added, no IO will actually be happening so it *should* be safe.
This should also allow a reshape to be started while a bitmap is present, as long as the reshape doesn't change the implied size of the bitmap.
When we increase the 'size' of an array (the amount of the device used), that implies a change in size of the bitmap. However the kernel cannot unilaterally resize the bitmap as there may not be room.
Rather, mdadm needs to be able to resize the bitmap first. This requires the sysfs interface to expose the size of the bitmap - which is currently implicit.
Whether the bitmap coverage is increased by increasing the number of bits or increasing the chunk size, some updating of the bitmap storage will be necessary (particularly in the second case).
So it makes sense to allow user-space to remove the bitmap then add a new bitmap with a different configuration. If there is concern about a crash between these two, writes could be suspended for the (short) duration.
Currently the 'sync_size' stored in the bitmap superblock is not used. We could stop updating that, and could allow the bitmap to automatically extend up to that boundary.
So: we have a well defined 'sync_size' which can be set via the superblock or via sysfs. A resize is permitted as long as there is no bitmap, or the existing bitmap has a sufficiently large sync_size.
RAID10 arrays currently cannot be reshaped at all. It is possible to convert a 'near' mode RAID10 to RAID0, but that is about all. Some real reshape is possible and should be implemented.
1/ A 'near' or 'offset' layout can have the device size changed quite easily.
2/ Device size of 'far' arrays cannot be changed easily. Increasing device size of 'far' would require re-laying out a lot of data. We would need to record the 'old' and 'new' sizes which metadata doesn't currently allow. If we spent 8 bytes on this we could possibly manage a 'reverse reshape' style conversion here.
EDIT: if we stored data on drives a little differently this could be a lot easier. Instead of starting the second slab of data at the same location on all devices, we start it an appropriate fraction into the size of 'this' device, then replacing all devices in a raid10-far with larger drives would be very effective. However just increasing the size of the device (e.g. using LVM) would not work very well
3/ Increasing the number of devices is much the same for all layouts. The data needs to be copied to the new location. As we currently block IO while recovery is actually happening, we could just do that for reshape as well, and make sure reshape happens in whole chunks at a time (or whatever turns out to be the minimum recordable unit). We switch to 'clean' before doing any reshape so a write will switch to 'dirty' and update the metadata.
4/ decreasing the number of devices is very much the reverse of increasing.. Here is a weird thought: We have introduced the idea that we can increase the size of remaining devices when we decrease the number of devices in the array. For 'raid10-far', the re-layout for increasing the device size is very much like that for decreasing the number of devices - just that the number doesn't actually decrease.
5/ changing layouts between 'near' and 'offset' should be manageable providing enough 'backup' space is available. We simply copy a few chunks worth of data and move reshape_position.
6/ changing layout to or from 'far' is nearly impossible... With a change in data_offset it might be possible to move one stripe at a time, always into the place just vacated. However keeping track of where we are and were it is safe to read from would be a major headache - unless it feel out with some really neat maths, which I don't think it does. So this option will be left out.
So the only 'instant' conversion possible is to increase the device size for 'near' and 'offset' array.
'reshape' conversions can modify chunk size, increase/decrease number of devices and swap between 'near' and 'offset' layout providing a suitable number of chunks of backup space is available.
The device-size of a 'far' layout can also be changed by a reshape providing the number of devices in not increased.
When a 'check' finds a data inconsistency it would be useful if it was reported. That would allow a sysadmin to try to understand the cause and possibly fix it.
One simple approach would be to simply log all inconsistencies through the kernel logs. This would have to be limited to 'check' and possibly 'repair' passed as logging a 'sync' pass (which also find inconsistencies) can be expected to be very noisy.
Another approach is to use a sysfs file to export a list of addresses. This would place some upper limit on the number of addresses that could be listed, but if there are more inconsistencies than that limit, then the details probably aren't all that important.
It makes sense to follow both of these paths.
- some easy-to-parse logging of inconsistencies found.
- a sysfs file that lists as many inconsistencies as possible.
Each inconsistency is listed as a simple sector offset. For RAID1 and RAID4/5/6, it is an offset from the start of data on the individual devices. For RAID1 and RAID10 it is an offset from the start of the array. So this can only be interpreted with a full understanding of the array layout.
The actual inconsistency may be in some sector immediately following the given sector as md performs checks in blocks larger than one sector and doesn't both refining. So an process that uses this information should read forward from the address to make sure it has found all of the inconsistency. For striped array, at most 1 chunk need be examined. For non-striped (i.e. RAID1) the window size is currently 64K. The actual size can be found by dividing 'mismatch_cnt' by the number of entries in the mismatch list.
This has no dependencies on other features. It relates slightly to the bad-block list as one way of dealing with an inconsistency is to tell md that a selected block in the stripe is 'bad'.