19 May 2010, 04:37 UTCDesign notes for a bad-block list in md/raid
I'm in the middle of (finally) implementing a bad block list for Linux md/raid, and I find that the motivation and the desired behaviour isn't (or wasn't) quite as obvious as I expected. So now that I think I have sorted it out, it seems sensible to write it up so that you, my faithful reader, can point out any glaring problems.
The bad block list is simply a list of blocks - one list for each device - which are to be treated as 'bad'. This does not include any relocation of bad blocks to some good location. That might be done by the underlying device, but md doesn't do it. md just tracks which blocks are bad and which, by implication, are good.
The difficulty comes in understanding exactly what "bad" means, why we need to record badness, and what to do when we find that we might want to perform IO against a recorded bad block.
11 February 2010, 05:03 UTCSmart or simple RAID recovery??
I frequently see comments, particularly on the linux-raid mailing list to the effect that md should be more clever when recovering from an inconsistent stripe in an array.
In particular, it is suggested that for a RAID1 with more than 2 devices, a vote should be held and if one content occurs more often than the others (e.g. 2 devices have the same content, the third is different) then the majority vote should rule and the most common content be copied over the less common content.
Similarly with RAID6 if the P and Q blocks don't match the data blocks, it may be possible to find exactly one data block which can be corrected so as to make both P and Q match - so we could change just one data block instead of two "parity" blocks to achieve consistency.
I will call this approach the "Smart recovery" approach.
The assertion is that smart recovery will not only make the stripe consistent, but will also make it "correct".
I do not agree with these comments. It is my position that if there is an inconsistency that needs to be corrected then it should be corrected in a simple predictable way and that any extra complexity is unjustified. For RAID1, that means copying to first block over all the others. For RAID6, that means calculating new P and Q blocks based on the data. This is the "simple recovery" approach.
This note is an attempt to justify this position, both to myself and to you, my loyal reader.
17 August 2009, 00:09 UTCConverting RAID5 to RAID6 and other shape changing in md/raid
Back in early 2006 md/raid5 gained the ability to increase the number of devices in a RAID5,
thus making more space available. As you can imagine, this is a slow process as every
block of data (except possibly those in the first stripe) needs to be relocated. i.e
they need to be read from one place and written to another. md/raid5 allows this reshaping to
happen while the array is live. It temporarily blocks access to a few stripes at a time while
those stripes a rearranged. So instead of the whole array being unavailable for several hours,
little bits are unavailable for a fraction of a second each.
Then in early 2007 we gained the same functionality for RAID6. This was no more complex than RAID5, it just involved a little more code and testing.
Now, in mid 2009, we have most of the rest of the reshaping options that had been planned. These include changing the stripe size, changing the layout (i.e. where the parity blocks get stored) and reducing the number of devices.
Changing the layout provides valuable functionality as it is an important part of converting a RAID5 to a RAID6.
29 January 2009, 23:46 UTCRoad map for md/raid driver - sort of
In mid-December 2008 I wrote a bit of a "road-map" containing some of my thoughts about development work that could usefully be on on the MD/RAID driver in the Linux kernel. Some of it might get done. Some of it might not. It is not a promise at all, more of a discussion starter in case people want to encourage features or suggest different features.
But I really should put this stuff in my blog so, 6 weeks later, here it is.
22 February 2007, 04:22 UTCmdadm 2.6.1 released
Yes, I forgot to announce 2.6 here, sorry about that.
2.6.1 is just some minor bug fixes. The release is motivated primarily by the fact that I have implemented raid6 reshape (i.e. add one or more devices to a raid6 while online). For the moment you need to collect patches from the linux-raid mailing list or wait for the next -mm release. They will hopefully be in 2.6.21-rc2. Earlier versions of mdadm can start a raid6 reshape with a new kernel, but there is one small case where it didn't quite do the right thing so I wanted to get that fix out.
2.6 introduced --incremental mode. This is intended for interfacing with 'udev'. When a new device is discoverred it is passed to "mdadm --incremental" and mdadm tries to include it in an md array if that is appropriate. As soon as all devices become available, the array is ready. Of course if one device is missing, we have a problem. Do we start the array degraded as soon as possible, or wait for the missing device to appear, possible waiting forever... No go answers to this question yet. mdadm allows you to try either.
17 June 2006, 08:24 UTCAnother TODO list : nfsd
I love todo-lists. They are liberating. If you don't know what to
work on next, you write a TODO list. Then you just work through the
things on the list. Or maybe you don't, but having written them all
down, they no longer rattle around in your brain and distract you from
more important things.
Anyway, to the point. Someone recently asked me what my priorities for the Linux NFS server were and while I'm not spending a lot of time on it at the moment, there are things that I would like to see done, so I wrote a little todo list. And having written it, I might as well share it. So here it is...
-
auto-adjust the number of nfsd threads. I'd really like the
sysadmin *not* to have to choose a number, but it should still be
possible to set a maximum.
I imagine:
- slow growth when there is high load and we have never had this many running before
- slow decay in numbers when <50% are in use fast growth when we have dropped below the highwater mark and load is high
- A maximum that is somehow based on the requested number of threads
- Some way of measuring if extra threads in actually improving throughput, and feed that in to the growth calculations.
- Find a way to overcome the current bottle neck when replying to requests on a udp socket.
- explore whether it would help to make the scheduling of nfsd threads more SMP (and NUMA) aware.
- Finalise and implement upcall changes to support new NFSv4 features like auth-type selection and fs_locations
- make exporting of filesystem via NFSv4 work more smoothly (mountd should automatically create the 'virtual filesystem' thing).
26 May 2006, 10:14 UTCmdadm 2.5 released
I have just released mdadm 2.5. It is available from kernel.org and freshmeat knows about it.
It had originally expected this to be a fairly small update of assorted bug fixes. But when it came to putting it together, there turned out to be quite a lot of enhancements.
One - the major one - is the auto-assembly that I mentioned in an earlier past. Others were due to the fact that the maintainer of the Debian package took decided at the same time that it was time to sort through bug reports and forward some to me. Still others were just normal stuff on the linux-raid list.
All it all there is a reasonable amount of stuff in there. Hopefully it will get some testing, and even better: will get some feed back. The only way to make it the best is for people to tell me what is wrong with it.
21 May 2006, 09:26 UTCAuto-assembly mode for mdadm
Probably the most wanted feature of mdadm is auto-assembly. People want it to just do-the-right-thing. They want to simply be able to assemble all of their arrays without having to worry about creating and maintaining config files or anything like that.
I've always been against blind auto-assembly as it can (and occasionally has) cause problems when the wrong thing gets assembled.
However it is possible to find a middle ground, that isn't completely blind, but that requires minimal configuration effort. I've finally figured out how I want to implement that and scheduled the time to do it, and so it should appear in mdadm-2.5.
The core idea is to report the host name of each raid array. mdadm can then assemble every array that it can find, providing it is for 'this' host.
27 July 2005, 14:31 UTCTODO list for mdadm
I not only have a TODO list for linux/md/raid, but for mdadm -- the userspace md management tool -- too.
It is mostly focussed on getting 2.0 ready for release, but there are some bits that can wait until after 2.0
It includes a test-suite, a '--hostid' flag to tie arrays to host and make automatic assembly more possible, and improvements to support for version-1 superblocks.
27 July 2005, 14:15 UTCTODO list of Linux md/raid
I've just spent a while hunting through old emails and todo-lists and patches and my brain, to try to create a fairly complete TODO list of md/raid in linux.
Rather than keeping it to myself, I thought I would let you, my loyal reader, see it too.
It mentions various enhancements including not kicking drives on read-errors, backgroup check/repair, sysfs support, adding devices to linear arrays and fixing particularly involving version-1 superblocks but also improving read-only mode, making 'linear' cope with v.large devices and other things.
15 June 2005, 09:55 UTCmdadm 1.12.0 released
17 December 2004, 16:11 UTCLinux md/raid update - UPDATED-1
15 December 2004, 10:03 UTCLinux md/raid throughput measurements
27 August 2004, 22:54 UTCRAID10 in Linux MD driver
11 June 2004, 16:50 UTC--grow option for "linear" md arrays
08 June 2004, 16:12 UTCEntry
08 June 2004, 15:24 UTCBetter FSID management
07 June 2004, 13:28 UTCLinux NFS server
07 June 2004, 13:27 UTCLinux Software RAID
07 June 2004, 12:37 UTCNew "mdadm"
13 May 2004, 15:56 UTCsysfs??
13 May 2004, 15:55 UTCEvent monitoring
13 May 2004, 13:36 UTCauto-correcting read errors
13 May 2004, 13:24 UTCconsistancy check/correct
13 May 2004, 12:58 UTCNew style RAID superblock
13 May 2004, 11:23 UTCRAID10
13 May 2004, 10:01 UTCBetter read-ahead management and other file-open issues
12 May 2004, 17:17 UTCMore efficient COMMIT
12 May 2004, 17:14 UTCRPCSEC/GSS
12 May 2004, 13:40 UTCIPv6 support