Off-the-road-map: Data checksums

27 February 2011, 11:42 UTC

Among the responses I received to my recent post of a development road-map for md/raid were some suggestions for features that I believe are wrong and should not be implemented. So rather than being simple ommisions, they are deliberate exclusions. On of these suggestions in the idea of calculating, storing, and checking a checksum of each data block.

Checksums are in general a good idea. Whether it is a simple parity bit, an ECC, a CRC or a full cryptographic hash, a checksum can help detect single bit and some multi-bit errors and stop those error propagating further into a system. It is generally better to know that you have lost some data rather than believe that some wrong data is actually good, and checksums allow you to do that.

So I am in favour of checksum in general, but I don't think it is appropriate to sprinkle them around everywhere and in particular I don't think that it is the role of md to manage checksums for all data blocks.

To make this belief more concrete, I see that there are two classes of places where checksums are important. I call these "link checksums" and "end-to-end checksums".

Link checksum relate to the lowest levels of data abstraction. When the data is stored and as it moved from place to place (along links) there should be some checksum appropriate to that storage or link. So memory should have ECC checks implemented and enabled, the buss from memory to device controller (such as PCI-express) should have a CRC, the buss from controller to device (SATA) has CRCs on packets. The disk drives electronics should have ECC memory and should create an ECC code when writing data out, and check it when reading data in. I believe all of these generally are in place - though I'm not sure if hard drives have ECC memory.

Thus the data should be protected as it moves around a system. In some cases error can be corrected transparently, in other they are detected in time to arrange a re-transmit which is just as good. In the rare case that errors can not be corrected in one of these ways, they are at least reported.

In theory all of these should be enough to guard against undetectable errors, but in practice they sometimes don't quite reach the level or reliability required. Each is not quite perfect for various reasons, and as they are chained together the whole becomes weaker that the some of the parts. Also the various parts don't really know how important the data is and cannot scale their level of safety appropriately.

For this reason it makes lots of sense to have end-to-end checksums. These are imposed at the highest level of data abstraction, typically by the application or even by the user. The GPG signature that is often included with software distribution archieves is a good example of this. It protects against errors in all of the various hardware level, and also various modes of human interference or accident. The checksum that file compression formats included is another example as is the hash used by 'git' and similar tools.

In each case, the style of checksum can be choosen based on the perceived needs of the data and the policy of dealing with errors is equally set at a level where that policy can be appropriately tuned and makes sense.

It is not immediately clear that having extra levels of checksums between these two is sensible. Each level of checksum introduces a cost, either in terms of performance or in terms of hardware (and software) complexity, or both. So at each level there needs to be a clear justification of that cost.

The justification will normally combine a threat model and a benefit model. It must be clear what sort of problems are expected and what their cause might be, and it must be clear that the information provided by the checksum will be genuinely useful.

In the case of link checksums, the threat is clear due to the non-digital nature of the medium carrying the data and the various phyiscal and electrical irregularities that can reduce data fidelity. The benefit model involves - as already mentioned - the possibilty of ECC based correction or of data retransmission.

In the case of end-to-end checksums the threat model is somewhat more vague though also more broad - "lots of unlikely things are all possible". However the benefit model is a lot clearer. It will be different in each case, bit in general it is a strong signal that the data is, or is not, reliably.

Where we to consider implementing some sort of checks at the md level we would need to be clear what the threat was, and what the risks were.

As with end-to-end checksum the threat is rather vague. There really shouldn't be a problem as the disk drive and other electronics already check for and report errors. On the other hand the benefit model is also rather vague. Without a clear theat model, and ECC isn't really practical so we can only hope to detect errors.

In an array with redundancy (i.e. not a degraded array) a detected error can be useful as we can treat that block as faulty and get the data from elsewhere. We already do that when the the disk drive reports an error. Having an internal checksum mechanism might slightly increase the number of cases whether this is possible.

However as has been noted, the main value of checksums above the link-layer is to be able to detect problems, or the lack there of. If problems are found at the end-to-end level, it almost certainly means a fault somewhere. It migth be a human fault, or a system-design fault, or a hardware fault. In any case the fault must be found and fixed before it is sensible to continue using the system.

The same would be the case for any checksum mechanism in the md layer. If it found an error it probably should not try to fix it - the system has demonstrated that it isn't working correctly, so doing anything to it could make the situation worse as easily as it could make it better. So the only really useful thing that an md checksum could do is exactly what an end-to-end checksum could do, only it could do it much better. And if it is going to do it, then it is a waste of resources for md to do it too.

So my conclusion is that md should not be doing data checksumming. Higher levels probably should. In particular, they should have enough redundancy that small corruptions are easily detected. e.g. a program source file (for a sensible language with static checking) doesn't really need a checksum as any corruption will almost certainly result in a compiler error. But a checksum that was easy to included wouldn't hurt.

This doesn't mean md shouldn't use checksums at all - only that it shouldn't use checksum for data. For metadata - such as the 'super block' or various bitmaps - a checksum makes lots of sense. In these cases md is the highest level of abstraction working with the data so it is correctly positioned to use an end-to-end checksum to avoid working with incorrect data. md already does this for the superblock and probably should for the write-intent-bitmap. And a case could certainly be made for strengthening those checksums if they turn out to be weak (which is likely). It is just the data which md has no business in checksumming.

I believe the same logic applies to filesystems. It makes lots of sense for a filesystem to add checksums to any and all of its metadata - inodes, indexing blocks, directories etc. But it doesn't seem justifiable to do the same for data blocks and that would be duplication would that would be more effectively done at a different level.

And as one final comment, if it turned out that disk drives didn't actually provide checksums that worked reliably, then the md layer might be a suitable place to make up for that short coming. I could imagine a separate md personality which simply added checksums to the data and reported errors. It would be a challenge to make it really reliable as updating the data and the checksum atomically would require some sort of journalling scheme. And it would be an even greater challenge to make it work an an acceptable speed. So if hardware really were not reliable it could make sense, buit to be honest - it doesn't seem likely to me.