(original article)

Re:Battery backed write-through cache?

23 April 2010, 12:37 UTC

The easiest way to make use of some sort of fast-but-small storage (such as RAM drives) is to configure your filesystem with an external journal on the RAM drive and turn on data journalling.

That way everything gets written to the RAM drive first, which has very low latency, and then gets written to store large/slow storage (e.g. RAID6) at a more sedate pace but with no-one waiting for it (unless memory gets tight etc).

I have used this to get good performance on an NFS server (which needs very low latency in it's storage subsystem) and it worked quite well. There was at the time (8 years ago?) room for improvement in ext3's handling of data journalling but I managed to tune it to do quite well. The issues I hit might be fixed now.

This addresses most of what you want from a RAM drive but not quite everything. The one thing it doesn't do is close the 'write hole'. If you crash and come up with a degraded array (or the array goes degraded before a resync finishes) you can get silent corruption. This is virtually never a problem in real life, but it would be nice to fix it and the only credible fix is to journal to a v.fast device. And if you are going to journal to a v.fast device, you get all the other low-latency benefits for free. (This of course assumes that your low-latency device has very good ECC codes so the chance of data loss in there is virtually zero).

Add this to md/raid5 would certainly not be trivial but it would be quite possible. If I had a RAM drive to experiment with I might even try it....

Every incoming write would be queued for writing to the journal with a header describing the block. You would also need to get the data into the strip cache and doing that efficiently would probably be the tricky bit. It might even be easiest to write out to the RAM drive, then read back into the strip cache in order to calculate parity. After you have a strip ready to write and have calculated the new parity, you need to log that to the journal before writing anything out to the array.

I would probably want to make the strip cache a lot larger, but not allocate pages permanently to every entry. Rather the stripe cache is used to record where all that data that is not safe on storage is, and as memory is available we connect the memory to a strip, read in the required data from the log or from the array, calculate the parity, write it to the journal, then write out the strip to the array.

So there are three main parts to this:

1/ allow entries in the stripe cache to not have any pages attached. When a stripe is ready for pages it waits for them to be available, attaches them, and uses them. So we have a v.large pool of 'struct stripe_head' which can even grow dynamically, and a more restricted pool of buffers.

2/ A journal management module that queues of blocks, creates a header, writes out blocks and header to the RAM drive, and keep track of when data is safely in the array so that it can be dropped from the journal.

3/ Tied it all together with appropriate extensions to the metadata and mdadm so that the journal is found and attached properly.

I think I would only do thing for RAID5/6. There is no 'write hole' problem for RAID1 or RAID10 so the fs-journal approach should be completely sufficient.