29 June 2012, 06:53 UTCA typical week in RAID-land
I should probably write more. I enjoy writing but don't do enough of it. I probably have Elisabeth Bennet's condition. She says to Darcy on the dance floor:
I sometime feel unwilling to write unless I'll say something to amaze the whole Internet. But I think I should try to push through that.
So what has happened this week? I doubt it is really a typical week as no week is really typical - I imagine them all outside my window screaming in chorus "We are all individuals". But I can't really know unless I record it, then compare it with future weeks.
15 June 2012, 07:32 UTCA Nasty md/raid bug
There is a rather nasty RAID bug in some released versions of the Linux kernel. It won't destroy your data, but it could make it hard to access that data.
If you are concerned that this might affect you, the first thing you should do (after not panicking) is to gather the output of
and save this somewhere that is not on a RAID array. The second thing to do is read to the end of this note and then proceed accordingly. You most likely will never need use the output of that command, but if you do it could be extremely helpful.
14 June 2011, 10:17 UTCClosing the RAID5 write hole
Over a year ago I wrote some thoughts about closing the RAID5 write hole in an answer to a comment on a blog post:
I recently had some interest shown in this so I thought it might be useful to write up some thoughts more coherently and completely.
28 March 2011, 02:54 UTCAnother mdadm release: 3.2.1
Hot on the heals of mdadm-3.1.5 I have just released 3.2.1.
The 3.2 series contains two particular sets of new functionality.
Firstly there is the "policy" framework. This allows us to set policy for different devices based on where they are connected (e.g. which controller) so that e.g. when a device is hot-plugged it can immediately be made a hot-spare for an array without further operator intervention. It also allows broader controller of spare-migration between arrays. It is likely that more functionality will be added to this framework over time
Secondly, the support for Intel Matrix Storage Manager (IMSM) arrays has been substantially enhanced. Spare migration is now possible as is level migration and OLCE (OnLine Capacity Expansion). This support is not quite complete yet and requires MDADM_EXPERIMENTAL=1 in the environment to ensure people only use it with care. In particular if you start a reshape in Linux and then shutdown and boot into Window, the Windows driver may not correctly restart the reshape. And vice-versa.
If you don't want any of the new functionality then it is probably safest to stay with 3.1.5 as it has all recent bug fixes. But if you are at all interested in the new functionality, then by all means give 3.2.1 a try. It should work fine and is no more likely to eat your data than any other program out there.
23 March 2011, 04:59 UTCRelease of mdadm-3.1.5
The last release of mdadm that I mentioned in this blog was 2.6.1. As I am now announcing 3.1.5 you can see that I missed a few. That's OK though as I keep the release announcements in the source distribution so you can always go and read them there.
3.1.5 is just bugfixes. It is essentially 3.1.4 plus all the bug fixes found while working on 3.2 and 3.2.1. The list from the release announcement is:
- Fixes for v1.x metadata on big-endian machines.
- man page improvements
- Improve '--detail --export' when run on partitions of an md array.
- Fix regression with removing 'failed' or 'detached' devices.
- Fixes for "--assemble --force" in various unusual cases.
- Allow '-Y' to mean --export. This was documented but not implemented.
- Various fixed for handling 'ddf' metadata. This is now more reliable but could benefit from more interoperability testing.
- Correctly list subarrays of a container in "--detail" output.
- Improve checks on whether the requested number of devices is supported by the metadata - both for --create and --grow.
- Don't remove partitions from a device that is being included in an array until we are fully committed to including it.
- Allow "--assemble --update=no-bitmap" so an array with a corrupt bitmap can still be assembled.
- Don't allow --add to succeed if it looks like a "--re-add" is probably wanted, but cannot succeed. This avoids inadvertently turning devices into spares when an array is failed.
As you can see - lots of little bits and pieces.
I hope to release 3.2.1 soon. For people who want to use the Intel metadata format (Intel Matrix Storage Manager - IMSM) on Intel motherboards which have BIOS support and MS-Windows support, you should probably wait for 3.2.1. For anyone else, 3.1.5 is what you want.
3.2.1 should be released soonish. I probably won't even start on 3.2.2 for a couple of months, though I already have a number of thoughts about what I want to include. A lot of it will be cleaning up and re-organising the code: stuff I wanted to do for 3.2 but ran out of time.
08 March 2011, 07:47 UTClog segments and RAID6 reshaping
Part of the design approach of LaFS - and any other log structured filesystem - is to divide the device space into relatively large segments. Each segment is many megabytes in size so the time to write a whole segment is much more than the time to seek to a new segment. Writes happen sequentially through a segment, so write throughput should be as high as the device can manage.
(obviously there needs to be a way to find or create segments with no live data so they can be written to. This is called cleaning and will not be discussed further here).
One of the innovations of LaFS is to allow segments to be aligned with the stripes in a RAID5 or RAID6 array so that each segment is a whole number of stripes and so that LaFS knows the details of the layout including chunk size and width (number of data devices).
This allows LaFS to always write in whole 'strips' - where a 'strip' is one block from each device chosen such that they all contribute to the one parity block. Blocks in a strip may not be contiguous (they only are if the chunksize matches the block size), so one would not normally write a single strip. However doing so is the most efficient way to write to RAID6 as no pre-reading is needed. So as LaFS knows the precise geometry and is free with how it chooses where to write, it can easily write just a strip if needed. It can also pad out the write with blocks of NULs to make sure a whole strip is written each time.
Normally one would hope that several strip would be written at once, hopefully a whole stripe or more, but it is very valuable to be able to write whole strips at a time.
This is lovely in theory but in practice there is a problem. People like to make their RAID6 arrays bigger, often by adding one or two devices to the array and "restriping" or "reshaping" the array. When you do this the geometry changes significantly and the alignment of strips and stripes and segments will be quite different. Suddenly the efficient IO practice of LaFS becomes very inefficient.
There are two ways to address this, one which I have had in mind since the beginning, one which only occurred to me recently.
27 February 2011, 11:42 UTCOff-the-road-map: Data checksums
Among the responses I received to my recent post of a development road-map for md/raid were some suggestions for features that I believe are wrong and should not be implemented. So rather than being simple ommisions, they are deliberate exclusions. On of these suggestions in the idea of calculating, storing, and checking a checksum of each data block.
Checksums are in general a good idea. Whether it is a simple parity bit, an ECC, a CRC or a full cryptographic hash, a checksum can help detect single bit and some multi-bit errors and stop those error propagating further into a system. It is generally better to know that you have lost some data rather than believe that some wrong data is actually good, and checksums allow you to do that.
So I am in favour of checksum in general, but I don't think it is appropriate to sprinkle them around everywhere and in particular I don't think that it is the role of md to manage checksums for all data blocks.
To make this belief more concrete, I see that there are two classes of places where checksums are important. I call these "link checksums" and "end-to-end checksums".
16 February 2011, 04:40 UTCMD/RAID road-map 2011
It is about 2 years since I last published a road-map for md/raid so I thought it was time for another one. Unfortunately quite a few things on the previous list remain undone, but there has been some progress.
I think one of the problems with some to-do lists is that they aren't detailed enough. High-level design, low level design, implementation, and testing are all very different sorts of tasks that seem to require different styles of thinking and so are best done separately. As writing up a road-map is a high-level design task it makes sense to do the full high-level design at that point so that the tasks are detailed enough to be addressed individually with little reference to the other tasks in the list (except what is explicit in the road map).
A particular need I am finding for this road map is to make explicit the required ordering and interdependence of certain tasks. Hopefully that will make it easier to address them in an appropriate order, and mean that I waste less time saying "this is too hard, I might go read some email instead".
So the following is a detailed road-map for md raid for the coming months.
08 September 2010, 07:20 UTCA talk on dm/md convergence
I know that slides from a talk tend to raise more questions than they answer as all the discussion is missing. But maybe raising questions is good...
Anyway, here are the slides of a talk I gave in July about possiblies of convergence between md and dm.
Enjoy ... or not.
19 May 2010, 04:37 UTCDesign notes for a bad-block list in md/raid
I'm in the middle of (finally) implementing a bad block list for Linux md/raid, and I find that the motivation and the desired behaviour isn't (or wasn't) quite as obvious as I expected. So now that I think I have sorted it out, it seems sensible to write it up so that you, my faithful reader, can point out any glaring problems.
The bad block list is simply a list of blocks - one list for each device - which are to be treated as 'bad'. This does not include any relocation of bad blocks to some good location. That might be done by the underlying device, but md doesn't do it. md just tracks which blocks are bad and which, by implication, are good.
The difficulty comes in understanding exactly what "bad" means, why we need to record badness, and what to do when we find that we might want to perform IO against a recorded bad block.
11 February 2010, 05:03 UTCSmart or simple RAID recovery??
17 August 2009, 00:09 UTCConverting RAID5 to RAID6 and other shape changing in md/raid
28 February 2009, 12:37 UTCThe LaFS directory structure
29 January 2009, 23:46 UTCRoad map for md/raid driver - sort of
22 February 2007, 04:22 UTCmdadm 2.6.1 released
17 June 2006, 08:24 UTCAnother TODO list : nfsd
26 May 2006, 10:14 UTCmdadm 2.5 released
21 May 2006, 09:26 UTCAuto-assembly mode for mdadm
27 July 2005, 14:31 UTCTODO list for mdadm
27 July 2005, 14:15 UTCTODO list of Linux md/raid
15 June 2005, 09:55 UTCmdadm 1.12.0 released
17 December 2004, 16:11 UTCLinux md/raid update - UPDATED-1
15 December 2004, 10:03 UTCLinux md/raid throughput measurements
27 August 2004, 22:54 UTCRAID10 in Linux MD driver
11 June 2004, 16:50 UTC--grow option for "linear" md arrays
08 June 2004, 16:12 UTCEntry
08 June 2004, 15:24 UTCBetter FSID management
07 June 2004, 13:28 UTCLinux NFS server
07 June 2004, 13:27 UTCLinux Software RAID
07 June 2004, 12:37 UTCNew "mdadm"