14 June 2011, 10:17 UTCClosing the RAID5 write hole
Over a year ago I wrote some thoughts about closing the RAID5 write hole
in an answer to a comment on a blog post:
http://neil.brown.name/blog/20090129234603 and http://neil.brown.name/blog/20090129234603-028.
I recently had some interest shown in this so I thought it might be useful to write up some thoughts more coherently and completely.
28 March 2011, 02:54 UTCAnother mdadm release: 3.2.1
Hot on the heals of mdadm-3.1.5 I have just released 3.2.1.
The 3.2 series contains two particular sets of new functionality.
Firstly there is the "policy" framework. This allows us to set policy for different devices based on where they are connected (e.g. which controller) so that e.g. when a device is hot-plugged it can immediately be made a hot-spare for an array without further operator intervention. It also allows broader controller of spare-migration between arrays. It is likely that more functionality will be added to this framework over time
Secondly, the support for Intel Matrix Storage Manager (IMSM) arrays has been substantially enhanced. Spare migration is now possible as is level migration and OLCE (OnLine Capacity Expansion). This support is not quite complete yet and requires MDADM_EXPERIMENTAL=1 in the environment to ensure people only use it with care. In particular if you start a reshape in Linux and then shutdown and boot into Window, the Windows driver may not correctly restart the reshape. And vice-versa.
If you don't want any of the new functionality then it is probably safest to stay with 3.1.5 as it has all recent bug fixes. But if you are at all interested in the new functionality, then by all means give 3.2.1 a try. It should work fine and is no more likely to eat your data than any other program out there.
23 March 2011, 04:59 UTCRelease of mdadm-3.1.5
The last release of mdadm that I mentioned in this blog was 2.6.1. As I am now announcing 3.1.5 you can see that I missed a few. That's OK though as I keep the release announcements in the source distribution so you can always go and read them there.
3.1.5 is just bugfixes. It is essentially 3.1.4 plus all the bug fixes found while working on 3.2 and 3.2.1. The list from the release announcement is:
- Fixes for v1.x metadata on big-endian machines.
- man page improvements
- Improve '--detail --export' when run on partitions of an md array.
- Fix regression with removing 'failed' or 'detached' devices.
- Fixes for "--assemble --force" in various unusual cases.
- Allow '-Y' to mean --export. This was documented but not implemented.
- Various fixed for handling 'ddf' metadata. This is now more reliable but could benefit from more interoperability testing.
- Correctly list subarrays of a container in "--detail" output.
- Improve checks on whether the requested number of devices is supported by the metadata - both for --create and --grow.
- Don't remove partitions from a device that is being included in an array until we are fully committed to including it.
- Allow "--assemble --update=no-bitmap" so an array with a corrupt bitmap can still be assembled.
- Don't allow --add to succeed if it looks like a "--re-add" is probably wanted, but cannot succeed. This avoids inadvertently turning devices into spares when an array is failed.
As you can see - lots of little bits and pieces.
I hope to release 3.2.1 soon. For people who want to use the Intel metadata format (Intel Matrix Storage Manager - IMSM) on Intel motherboards which have BIOS support and MS-Windows support, you should probably wait for 3.2.1. For anyone else, 3.1.5 is what you want.
3.2.1 should be released soonish. I probably won't even start on 3.2.2 for a couple of months, though I already have a number of thoughts about what I want to include. A lot of it will be cleaning up and re-organising the code: stuff I wanted to do for 3.2 but ran out of time.
As always, mdadm can be found via git at git://neil.brown.name/mdadm/ or from http://www.kernel.org/pub/linux/utils/raid/mdadm/.
27 February 2011, 11:42 UTCOff-the-road-map: Data checksums
Among the responses I received to my recent post of a development road-map for md/raid were some suggestions for features that I believe are wrong and should not be implemented. So rather than being simple ommisions, they are deliberate exclusions. On of these suggestions in the idea of calculating, storing, and checking a checksum of each data block.
Checksums are in general a good idea. Whether it is a simple parity bit, an ECC, a CRC or a full cryptographic hash, a checksum can help detect single bit and some multi-bit errors and stop those error propagating further into a system. It is generally better to know that you have lost some data rather than believe that some wrong data is actually good, and checksums allow you to do that.
So I am in favour of checksum in general, but I don't think it is appropriate to sprinkle them around everywhere and in particular I don't think that it is the role of md to manage checksums for all data blocks.
To make this belief more concrete, I see that there are two classes of places where checksums are important. I call these "link checksums" and "end-to-end checksums".
16 February 2011, 04:40 UTCMD/RAID road-map 2011
It is about 2 years since I last published a road-map for md/raid so I thought it was time for another one. Unfortunately quite a few things on the previous list remain undone, but there has been some progress.
I think one of the problems with some to-do lists is that they aren't detailed enough. High-level design, low level design, implementation, and testing are all very different sorts of tasks that seem to require different styles of thinking and so are best done separately. As writing up a road-map is a high-level design task it makes sense to do the full high-level design at that point so that the tasks are detailed enough to be addressed individually with little reference to the other tasks in the list (except what is explicit in the road map).
A particular need I am finding for this road map is to make explicit the required ordering and interdependence of certain tasks. Hopefully that will make it easier to address them in an appropriate order, and mean that I waste less time saying "this is too hard, I might go read some email instead".
So the following is a detailed road-map for md raid for the coming months.
08 September 2010, 07:20 UTCA talk on dm/md convergence
I know that slides from a talk tend to raise more questions than they answer as all the discussion is missing. But maybe raising questions is good...
Anyway, here are the slides of a talk I gave in July about possiblies of convergence between md and dm.
Enjoy ... or not.
19 May 2010, 04:37 UTCDesign notes for a bad-block list in md/raid
I'm in the middle of (finally) implementing a bad block list for Linux md/raid, and I find that the motivation and the desired behaviour isn't (or wasn't) quite as obvious as I expected. So now that I think I have sorted it out, it seems sensible to write it up so that you, my faithful reader, can point out any glaring problems.
The bad block list is simply a list of blocks - one list for each device - which are to be treated as 'bad'. This does not include any relocation of bad blocks to some good location. That might be done by the underlying device, but md doesn't do it. md just tracks which blocks are bad and which, by implication, are good.
The difficulty comes in understanding exactly what "bad" means, why we need to record badness, and what to do when we find that we might want to perform IO against a recorded bad block.
11 February 2010, 05:03 UTCSmart or simple RAID recovery??
I frequently see comments, particularly on the linux-raid mailing list to the effect that md should be more clever when recovering from an inconsistent stripe in an array.
In particular, it is suggested that for a RAID1 with more than 2 devices, a vote should be held and if one content occurs more often than the others (e.g. 2 devices have the same content, the third is different) then the majority vote should rule and the most common content be copied over the less common content.
Similarly with RAID6 if the P and Q blocks don't match the data blocks, it may be possible to find exactly one data block which can be corrected so as to make both P and Q match - so we could change just one data block instead of two "parity" blocks to achieve consistency.
I will call this approach the "Smart recovery" approach.
The assertion is that smart recovery will not only make the stripe consistent, but will also make it "correct".
I do not agree with these comments. It is my position that if there is an inconsistency that needs to be corrected then it should be corrected in a simple predictable way and that any extra complexity is unjustified. For RAID1, that means copying to first block over all the others. For RAID6, that means calculating new P and Q blocks based on the data. This is the "simple recovery" approach.
This note is an attempt to justify this position, both to myself and to you, my loyal reader.
17 August 2009, 00:09 UTCConverting RAID5 to RAID6 and other shape changing in md/raid
Back in early 2006 md/raid5 gained the ability to increase the number of devices in a RAID5,
thus making more space available. As you can imagine, this is a slow process as every
block of data (except possibly those in the first stripe) needs to be relocated. i.e
they need to be read from one place and written to another. md/raid5 allows this reshaping to
happen while the array is live. It temporarily blocks access to a few stripes at a time while
those stripes a rearranged. So instead of the whole array being unavailable for several hours,
little bits are unavailable for a fraction of a second each.
Then in early 2007 we gained the same functionality for RAID6. This was no more complex than RAID5, it just involved a little more code and testing.
Now, in mid 2009, we have most of the rest of the reshaping options that had been planned. These include changing the stripe size, changing the layout (i.e. where the parity blocks get stored) and reducing the number of devices.
Changing the layout provides valuable functionality as it is an important part of converting a RAID5 to a RAID6.
29 January 2009, 23:46 UTCRoad map for md/raid driver - sort of
In mid-December 2008 I wrote a bit of a "road-map" containing some of my thoughts about development work that could usefully be on on the MD/RAID driver in the Linux kernel. Some of it might get done. Some of it might not. It is not a promise at all, more of a discussion starter in case people want to encourage features or suggest different features.
But I really should put this stuff in my blog so, 6 weeks later, here it is.
22 February 2007, 04:22 UTCmdadm 2.6.1 released
26 May 2006, 10:14 UTCmdadm 2.5 released
21 May 2006, 09:26 UTCAuto-assembly mode for mdadm
27 July 2005, 14:31 UTCTODO list for mdadm
27 July 2005, 14:15 UTCTODO list of Linux md/raid
15 June 2005, 09:55 UTCmdadm 1.12.0 released
17 December 2004, 16:11 UTCLinux md/raid update - UPDATED-1
15 December 2004, 10:03 UTCLinux md/raid throughput measurements
27 August 2004, 22:54 UTCRAID10 in Linux MD driver
11 June 2004, 16:50 UTC--grow option for "linear" md arrays
07 June 2004, 13:27 UTCLinux Software RAID
07 June 2004, 12:37 UTCNew "mdadm"
13 May 2004, 15:56 UTCsysfs??
13 May 2004, 15:55 UTCEvent monitoring
13 May 2004, 13:36 UTCauto-correcting read errors
13 May 2004, 13:24 UTCconsistancy check/correct
13 May 2004, 12:58 UTCNew style RAID superblock
13 May 2004, 11:23 UTCRAID10
