Closing the RAID5 write hole

14 June 2011, 10:17 UTC

Over a year ago I wrote some thoughts about closing the RAID5 write hole in an answer to a comment on a blog post: and

I recently had some interest shown in this so I thought it might be useful to write up some thoughts more coherently and completely.

The "write hole" is a simple concept that applies to any stripe+parity RAID layout like RAID4, RAID5, RAID6 etc. The problem occurs when the array is started from an unclean shutdown without all devices being available, or if a read error is found before parity is restored after the unclean shutdown.

When an unclean shutdown happens the system *might* have been in the process of writing out a strip (or several) and might have successfully written some of the blocks in the strip but not all. If this is the case then the content of the parity block will not accurately reflect the contents of all the data blocks. So if we use the parity block plus some data blocks to calculate some other data block (possibly a data block that was not even being written at the time) then we could get an incorrect result.

When RAID starts up after an unclean shutdown it will recalculate all of the parity blocks on stripes that might have been in the process of being updated and will write out new values if they are wrong. This is known as "resync" in md parlance. If some devices are not present or give read errors then this process will fail.

With md, if some devices are not present, the array will not be started unless "--force" is given to mdadm, or start_dirty_degraded=1 is given on the kernel command line. This ensures that the admin at least knows there is a problem.

If a device fails during the resync, md doesn't take special action - it just allows the array to be used without a resync even though there could be corrupt data. There isn't really much that can be done about this easily. Stopping the array in the middle of activity because of a single failure would probably not be well received.

So this time between an unclean shutdown and resync completing is the "write hole". A single device failure during this time can leave corrupted data. While this possibility of corruption in real, there is no evidence that it is common. I am not aware of even a single case of corruption due to this. This maybe simply that the corruption goes un-noticed - which is quite possible - or that people don't bother to report corruption - which is equally possible. I would expect to hear about it at least occasionally if it happened much at all. But I never have. So while it is certainly theoretically possible, I wouldn't lose sleep over it.

But what can be done for those who do lose sleep?

My preferred solution is that this should be fixed in the filesystem. If the filesystem manages space such that a strip of data (one block across all devices in the RAID5) is (part of) the unit of allocation, so that a strip is always either unused, fully stable, or in the process of being written, and if the filesystem known which, then the write hole becomes a non-problem. Corruption can only happen inside a strip that is being written. In this ideal filesystem such a strip only contains new data that has not yet been acknowledged as stable, so any corruption doesn't destroy anything of value. I am working on such a filesystem (and have been for too many years) but I appreciate that other people might want a different solution.

The other obvious solution is to use some sort of high-speed device to journal the updates, just like a filesystem can journal updates to avoid inconsistencies due to a crash. Before writing any part of a strip to the array devices, the parts that have changed including the parity block(s) must first be written to the high-speed device with metadata such that the stripe can be found and written out correctly after an unclean restart. So this effectively places the high speed device 'after' the array, caching writes on their way to the underlying devices.

The "obvious" high speed device in non-volatile RAM of some sort, possibly FLASH RAM, but probably battery-backed DRAM would be a better choice in lots of cases, if the hardware were easily available.

Such RAM could be used as a cache in various different ways to provide extra functionality beyond "just" closing the write hole. Part of my purpose in writing this is to explore those interactions and see which make sense and which do not.

What to use the cache for?

There are essentially two other purposes that a RAM device could be put to in combination with closing the write hole. The first is to accelerate writes. The second is to also accelerate reads.

Writes to a RAID5 can be quite slow. In general the RAID implementation will need to read some other blocks off some devices so that it can calculate correct parity. So a write becomes a read cycle followed by parity calculation followed by write. This can add substantially latency and kill throughput. It should be noted in passing that my ideal filesystem will always write full strips and so will never need to pre-read and so will not suffer any slowness. But not all filesystems are ideal.

Putting an NVRAM cache in front of a RAID5 can allow write requests to be satisfied by the RAM quickly so latency is reduced. It is not so easy to increase total throughput as we still need to pre-read and compute parity. However we might improve throughput in some cases if we can collect more writes to a single strip before beginning to handle it. So NVRAM in front of a RAID5 can reduce latency and sometimes improve throughput. This is slightly unfortunate as the best place of an NVRAM cache to close the write hole after the RAID5, not before, as noted above.

Accelerating READs only makes sense if the NVRAM cache is much bigger than main memory. The OS already uses RAM to cache anything that has been read so caching it again only make sense if the secondary cache is much much bigger. In that case the index table for the cache would also be quite large and would need to live in the non-volatile storage somewhere. By contrast when a small write cache is used the index could be all in main memory. The non-volatile storage would need to record which block was which so that updates could be replayed after an unclean shutdown but a full index on storage would not be necessary.

I believe that the positioning and use of the index is the key differentiator and it correlates closely with size. So a cache that accelerates READs is a very different thing from a cache that accelerates only WRITEs. Closing the write-hole would be associated with the second, not the first.

So I think it makes sense to have a large non-volatile cache in front of any large storage device, which accelerates all writes and tries to accelerate reads by some-how knowing which data is more likely to be needed quickly and frequently and keeping that in the cache.

Having a separate small non-volatile cache just for accelerating writes and for closing the write-hole can also make sense for ARRAYs that are suseptible to the write hole, but it should be quite different from any READ cache. This cache is treated like a log in that blocks of data/parity are written to it together with sufficient header information to be able to subsequently find out what has been written.

Details of a write-behind cache.

As we have seen it seems to make sense to put this both before the RAID and after the RAID. Unfortunately that is not possible and for correct functionality it would need to be integrated with the RAID. It would appear as a single 'journal device' which was added to the array. If it failed, the array itself would fail.

When a write request arrives from the filesystem, a 'stripe_head' is allocated for each strip which the write will over-write. Then the part of the request for that strip is sent to the journal device. When the journal write completes, the request can be acknowledged.

When a stripe_head is full, or too old, the 'handle_stripe' handler is activated to process the stripe, possibly by reading some data from a device, or by computing parity or whatever.

The 'compute_parity' operation currently involves copying in data from the write request. That would need to change to load data from the journal device - from a location known to the stripe_head. It might make sense to copy to an allocated page instead. For RAID6 which always performs a reconstruct-write we wouldn't need an extra page but when performing a READ-MODIFY-WRITE for RAID5 on a degraded array we might end up needing more pages than are normally allocated.

Once the parity has been computed it gets written to the journal and then the data and parity gets written to the array devices. When that completes (with a flush eventually) we record that the data in the cache is no longer primary.

This process means that the parity can be much later in the journal than the data. That is OK as long as it doesn't get too much later thereby pinning lots of the journal as busy.

I imagine dividing the journal in to 3 sections. We keep a simple count of how much 'live' data (or parity) there is in each section. Once the count becomes zero we send a 'flush' request to all devices and then treat that section as clean and ready for more data to be written.

One section is being written to, one section contains data from strips that are waiting for more data to be added. One section is either being actively flushed or is free. We keep track of how many parity blocks will be needed by strips in each section and make sure that much space is reserved in the section which will be active when this section is being flushed. Thus we always know where to write, and know there will be enough space to flush out everything pending.

One unfortunate complication is READs. While we might expect the filesystem to never want to read something that it has just written, it is possible and we must support it correctly. So a read request cannot simply be passed down to the relevant device. We must first check if any of the data to be read is in the cache. The most straight forward way to do this would be to look up each relevant stripe_head and see if a write is pending. If there is no stripe_head or no pending write we can go directly to the device. If there does seem to be a write pending, we go the slower handle_stripe route to get the data.

It might be more efficient to have some other data structure to record active writes which we can search more quickly. Possibly a test against the write-intent-bitmap would be enough to catch the common cases. In any case something must to done to make sure that reads are safe.

During the restart after the unclean shutdown we would need to examine the journal device to see what needs to be replayed. In the first instance we would examine all the metadata and build a list of blocks so that duplicates could be removed. If one stripe was written multiple times we only want to re-write the last one.

The blocks then need to be gathered into strips. Each strip will either have the requires parity blocks or not. If the parity blocks are present, then the write could have started but not completed, so we simple write all of the blocks to their final destination again.

If we find data blocks in a strip without all the parity blocks, then the write never started. So we calculate the correct parity block based on what is on the devices, and then write it all out. Then we clear the log and restart the array. This should all be done in user-space, there is very little point adding this functionality into the kernel.

How would this affect resync, recovery, and reshape?

This journal device would make resync completely unnecessary. Recovery would be unchanged. Reshape would be largely unchanged, but replaying the journal would need to make allowances for a reshape that may have been underway. Strips either side of the reshape point would need to be handled differently.

Format of the journal device.

It is easy to say "write this to the journal". It is slightly harder to achieve it. In particular we need to record what gets written where, so we need metadata blocks mixed with the data blocks. The location of the metadata blocks must be deterministic. If we needed to search for them we could get confused by data that looks the same.

So each section should start with a metadata block and it describes among other things where the next metadata block is. Thus the metadata block describes the blocks that come after it. This means that we either write blocks out of order, or we hold all pending blocks until the metadata block is ready. The former would be most efficient on battery backed RAM devices. The latter probably best on flash devices. It probably makes sense to design for the latter but allow the former to be selected by a configuration knob.

One problem with writing the metadata first is that we don't know if the data blocks were successfully written. One option is to ignore everything associated with the last metadata block. This would require promptly writing a new metadata block when data writes complete, even if it would be empty. Another option would be to store a CRC across all of the blocks being written. This might be expensive though.

The metadata block should contain:

So as you can see there would be quite a lot of work required just to close the write hole and reduce latency on writes a bit.

A simpler version.

It is reasonable to ask how much simpler it would be if we decided not to accelerate writes at all. After all if a read-accelerating cache were present it would also accelerate writes so there would be no point in including that complexity in the write-hole-closing cache.

If we chose not to accelerate writes we could put the NVRAM cache strictly 'after' the array. It could be kept completely separate, as a set of caches on top of each device. However some interaction between the different caches would be need to align data with parity so it might be best to keep some knowledge in the RAID code.

READ issues could be completely ignored. We would not acknowledge a write until it was safe on the file devices, so any read could go just to the devices. If a file system asks to read a block that it recently wrote and is still waiting for the reply to, then it is really asking for trouble.

We would still need a metadata block, but could possibly use on for each strip if we really wanted to. It would contain the data and the parity blocks for a single stripe or a whole number of strips. This would make replay of the journal a little bit easier.

There would be no need to read-back from the cache or perform an extra copy. All we would do would be to wait for a strip to be ready, then write all the interesting blocks to the journal, then when that completes write them to the main devices. This could have a slightly higher latency in some cases, but should be able to maintain the same throughput as without a cache providing the internal stripe-cache was large enough.

It seems that all the functionality for the simple version would also be used by the more complex write-accelerating version so it would make sense to implement the simple one first and see how it performs. There would be little wasted time.

An even simpler version.

If the NVRAM were as fast and as accessible as normal RAM it might be even easier. The NVRAM would need to be usable as the target for a DMA to the storage devices and ideally would still have a 'struct page' associated with it in Linux.

Then the stripe cache would simply use pages from the NVRAM and state information would be copied from the stripe_head structure into the NVRAM by direct writes at suitable times. We would probably have a data structure at the start of NVRAM which described the entire cache which data address and validity. Then this would be updated at relevant times during normal stripe processing.

So providing there was an interface to ask for NVRAM pages from a given device. Unfortunately the 'direct_access()' block_device_operations op doesn't provide a "struct page" though it would make sense to use this interface and then find a 'struct page' somehow. This approach is temptingly simple, but would require very specific hardware that may not even be available.