TODO list of Linux md/raid

27 July 2005, 14:15 UTC

I've just spent a while hunting through old emails and todo-lists and patches and my brain, to try to create a fairly complete TODO list of md/raid in linux.

Rather than keeping it to myself, I thought I would let you, my loyal reader, see it too.

It mentions various enhancements including not kicking drives on read-errors, backgroup check/repair, sysfs support, adding devices to linear arrays and fixing particularly involving version-1 superblocks but also improving read-only mode, making 'linear' cope with v.large devices and other things.

# Enhancements
# Fixes
# Comments...

Enhancements

Readonly assembly

It should be possible to assemble an array in 'read-only' mode. In this mode the superblocks would not be updated, and resync would not be started. Read requests would be allowed, but not write requests.

This would allow a raid1 to be assembled as soon as hotplug found any device. As other devices were found they could be added. When either all devices were found, or write access was required, it could be switched to writeable and any required resync or reconstruction could then happen.

Similarly a raid5 could be assembled before the last device was found.

Possibly "mdadm --assemble", which will currently not start a degraded array without "--run" or "--scan" could be change to start it in read-only mode, which is just as safe.

It might be nice for all assembly to be read-only but that normally the first write request will automatically switch to read-write and start off any resync/recovery.

Don't kick drives on read errors

This is probably the most requested enhancement, as drive get bigger, the chance of individual block read errors seems to be increasing.

When a read error is detected, we flag the fact but don't fail the drive. The data is recoverred from elsewhere and we attempt to re-write (if array is not read-only). If the re-write succeeds, everyone is happy, but it should be logged and possibly counted in the superblock.

If the re-write fails we should probably kick the drive and start a reconstruction. We could possibly hold on to the drive an allow reads to succeed until reconstruction finishes (if ever). This means that write errors on different blocks of different devices doesn't completely kill the array.

On a write error, we behave as with a re-write failure.

background check/repair

md already does backgroup resync/recovery. It would be good to also do background read checks.

This would involve reading all blocks of all devices and checking that all redundancy information is consistant. For raid5/raid6 this means checking the parity (and syndrome) blocks are correct. For raid1/raid10 it means checking that all copies are the same.

This has two particular values. One is to check that all blocks can actually be read. The other is to check the consistancy. If errors are found they can be reported and optionally corrected.

sysfs support

md currently has two tunables -- max and min resync speed -- and these apply equally across all devices. It would be nice to allow more tunables (e.g. cache size for raid 5/6) and have them per-array. This would be best done with entries in sysfs.

event reporting

Linux seems to have lots of ways to report events to user-space. I want to report events like "a drive has failed" directly rather than having to re-read /proc/mdstat all the time. I need to find the best mechanism to do this.

online add-device to linear arrays

It is conceptually straight forward to add a drive to a linear array while the array is online. It needs to be made practically straight forward too.

raid1 write-mostly and write-behind

To support mirroring to a remote device (similar to drdb) it would be good to add 'write-mostly' and 'write-behind' flags to devices in raid1.

'write-mostly' is easy and means the we should never read from this device unless there is no alternative.

'write-behind' is harder. It means "don't wait for the write to complete before returning", and requires the data to be copied before we submit that write request.

Different device-create interface

Currently, /dev/mdX needs to exist before mdX can be assembled. This is contrary to the hotplug/udev approach. A different interface that could create mdX through some other window into the kernel would be good.

more transparency

It would be nice if md were more integrated with dm. It would also be nice if a "normal" device (e.g SCSI drive) could be transparently turned into a raid1 with one disk, that could then be extended.

Fixes

barriers

the block layer has a concept of 'barriers' and 'flushing' that md understands barely if at all. It should be educated.

version-1 super

Probably a few fixes needed here, but particularly:

It might be good to flag spares more obviously.

bypass cache for some raid5/6 reads

Currently reading a raid5 (or raid6) reads the data into the stripe-cache, and then copies into the client buffer. This is an unneeded copy in many cases. We should by-pass the cache in the simple cases where there is no active write on the same stripe, and the array isn't degraded.

read-only support

You can currently only switch an md array to read-only when no-one is using it. It would be best if this was 'no-one is writing to it' and a 'force readonly even if there a writers' could be useful.

start active degraded raid5 at boot time

md will not currently start an 'active' (i.e. dirty) raid5 if it is degraded. mdadm can re-write the superblock to convince md otherwise, but this is no good for booting from raid5. A kernel parameter to force this is needed.

linear on v.large devices

Currently on 32bit syhstems, a linear array cannot have componenets bigger than 2terrabytes. this should be relaxed.

linear chunksize of 0 should be allowed

'linear' treats 'chunksize' as 'rounding' and reduces the effective size of the device to a multiple of 'chunksize'. We should allow a 'chunksize' of 0 meaning 'no rounding' but we currently don't.

allow quiescent spares

md currently updates the superblock quite frequently in some circumstances (occasionaly write traffic). This includes the superblocks on spares. This means that 'spares' cannot spin-down, should you want that.

We should allow spares to be incorporated into an array even if their event count is old, and then stop updating superblocks on spares when simply switching between active and inactive mode.

Set date for deprecation of START_ARRAY

Currently this is marked to deprecate "after 2.6". That doesn't mean anything any more. We need a date.




Comments...

Re: TODO list of Linux md/raid (11 December 2006, 01:21 UTC)

What is the current development/production status regarding "background check/repair"?

[permalink][hide]

Re: TODO list of Linux md/raid (12 December 2006, 01:14 UTC)

(I should really do another TODO list shouldn't I ...)

"background check/repair" is stable in 2.6.18 and later. (maybe even eariler, I'm not sure).. You

echo check > /sys/block/mdX/md/sync_action
where X is the number of the md array. This causes a background check to run which will read all block and trigger auto-recovery of any read errors. Any inconsistencies will be counted and included in
cat /sys/block/mdX/md/mismatch_cnt
If you instead

echo repair > /sys/block/mdX/md/sync_action

it will do the same check process but if any inconsistencies are found they will automatically be corrected.

Finally you can

echo idle > /sys/block/mdX/md/sync_action

to abort any running check or repair.

[permalink][hide]

Re: TODO list of Linux md/raid (12 December 2006, 12:20 UTC)
Thank you very much for your answer. I should have noticed linux/Documentation/md.txt and md(4) earlier, both peacefully resting on my hard disk, but trying to search the web, I completely forgot to check the standard places for documentation -- but hey, at least I found mdadm(8) :-).

Perhaps, you could give us some annotations to this above TODO list on what has already been implemented. I noticed that you did implement "Don't kick drives on read errors", "sysfs support" and "start active degraded raid5 at boot time", but I'm not sure about the others. Alternatively, you could provide an up-to-date TODO list with all features/fixes still missing.

However and nonetheless, I'd very much like to thank you for all of this (md/mdadm), especially those features mentioned earlier. Keep up the good work. :-)

[permalink][hide]

Re: TODO list of Linux md/raid (15 January 2007, 14:39 UTC)

My understanding is that on host power-failure, drives must be resynced in full. Are there any plans to enable resyncing only those parts that are not in sync, i.e. those areas of the drives which were being written to at the time of failure? I'm aware of the work along these lines in context of 'dm'.

Thanks

[permalink][hide]

Re: TODO list of Linux md/raid (15 January 2007, 23:11 UTC)

Partial resync is supported since about 2.6.16 via 'bitmaps'. A 'bitmap' tracks which blocks of the device are potentially out-of-sync, and when a resync is needed, only those blocks are considered.

The bitmap can be stored 'internally' i.e. on each drive near the superblock, or externally in a file on another device.

There is some performance overhead as the bitmap has to be written to before the data can be written. Depending on workload, this can be un-noticable, or as much as 20%.

[permalink][hide]

Comment (13 November 2007, 07:17 UTC)
How about hotplug for RAID 1 (mirrors)?

[permalink][hide]

Re: TODO list of Linux md/raid (13 November 2007, 07:28 UTC)

http://www.mail-archive.com/linux-raid@vger.kernel.org/msg06802.html

(In reply to a question posed in that post, "how does this work in a pure scsi environment?" answer: it doesn't work. Failing a hotswap aacraid JBOD drive kills the md raid1 devices on it. Here's a follow up question: of what use is md raid 1 if high availability is intended to be the main benefit of mirrors? answer: sadly none.)

[permalink][hide]

Re: TODO list of Linux md/raid (16 November 2007, 03:58 UTC)

I think your final statement is overly simplistic. There certainly are configurations where raid1 cannot save you - the classic being 2 PATA drives on the same channel - one master, one slave. In this situation you will live through simple media errors, but not much else.

On a good SATA controller, any failure on one drive - including firmware crashing - should not affect any or device for more than a few seconds, so your uptime guarantees will be much better.

md/raid1 is certainly not a complete solution in itself. But when combined with sensible hardware choices, it can make your availability significantly higher.

[permalink][hide]

allow quiescent spares (19 November 2007, 05:39 UTC)

Yes, quiescent spares could help implement MAID "massive array of idle disks".

So does "stopping a disk when you're not using it" really make it last longer than keeping it spinning all the time? (Continuous spinning wears out the spindle bearings faster; start/stop cycles wear out the read-write heads faster).


[permalink][hide]

Re: TODO list of Linux md/raid (22 January 2008, 16:01 UTC)

I'm searching for the feature to set --write-mostly and --write-behind online.

In http://www.issociate.de/board/post/296710/Fw:_mdadm_options,_and_man_page.html i read that you were thinking about this issue: > > Questions: > > 1. it will be possible to set/unset the --write-mostly, --write-behind > > options online? > > (i know, currently not.) > Hmmm... I'll look into that.

Is there already some possibility to set this over /sys/*? mdadm can't do this for me online.

[permalink][hide]

Re: TODO list of Linux md/raid (06 February 2008, 01:00 UTC)

If you write "writemostly" to /sys/mdX/md/dev-YYYY/state it will turn on write-mostly for that device. You can turn it off with "-writemostly".

"Writebehind" is a function of the bitmap. You cannot current change this directly online, but you can remove the current bitmap and add a new one with write-behind enabled.


[permalink][hide]

Re: TODO list of Linux md/raid (14 February 2008, 23:06 UTC)

In professional life I use Proliants with SmartArrays so everything just always works. I'm new to md so forgive me if I make some "newbie" suggestions.

First, I like it A LOT. As I told someone recently that wanted to know "why I bothered" to mirror boot & root: "If I have the luxury of reloading an OS from CD or alternate disk and then have the time it takes to configure it enough to restore a backup tape, then I can do the same with the user's data. To me ... without boot & root raid would be useless.

But Disk Druid needs to have a MUCH clearer understanding of what Raid is and how people use it. Four dives /boot & root Raid1 while /usr /var & /home are Raid5 ... can take you all night to set up and sometimes Druid gets so confused that Anaconda crashes. I have MANY Anaconda issues regarding Raid. Many crashes, many reformats ... many issues.

That said ... the md system itself ....

First, I'd like a monilithic config file. It seems that too many utilities pull too many things from different places. Then again, maybe it does and I just don't know. One day I'll Read The Friendly Manual.

Next, while GUI would be welcomed from an admin standpoint the priority use should be for monitoring. An Xterm graphical update of volume health that can be left on the desktop. While printing /proc/mdstat is OK ... it's impossible to tell someone ELSE what they're looking at if you have to lead them through it.

Next thing is .. WAY too many error messages stating "md: invalid raid superblock magic on /dev/md{whatever}" or "/dev/md{x} does not contain valid ext2 superblock" or other messages. Of those two ... why would anyone thing that the first was acceptable/normal but not the second?

Last -- what others have said -- I want raid for high availability, so it needs to bust it's buns to restart. If a degraded array won't start (btw that's news to me!!!) then it's not a degraded array, it's a broken array.

[permalink][hide]

Re: TODO list of Linux md/raid (18 February 2008, 04:47 UTC)


1. Yes, read the Fine Manual. That would be a great idea
2. No, there is no monolithic config file. The config for each array is stored on the array itself. Information about how to find the various arrays is stored in /etc/mdadm.conf. It is possible that more info could go there. e.g. setting stripe_cache sizes etc. But without a concrete description of what problems you see, it is hard to address them.
3. I agree, a GUI would probably be nice. I don't write GUI's. Maybe somebody else does.
4. visual monitoring tool - find one that you like (gkrellm isn't bad) and if it doesn't monitor what you want it to monitor, ask the maintainers to improve it.
5. I would need more context about the error message to comment usefully.
6. A degraded raid5 array won't start after a system crash precisely because it is in fact a broken array. raid5 cannot withstand multiple failures. A miss drive plus a crash is multiple failures.

If you want to discuss md/raid further, I suggest linux-raid@vger.kernel.org would be the best place.

[permalink][hide]

Re: TODO list of Linux md/raid (11 June 2009, 09:37 UTC)
Hi! I think, there is way to increase raid1 read performance on long files. Now raid1 read only from one disk, becose striping requied head seek. Solution in raid10 with "far" replicas also have disadvantages.

But we can use second (or third) disk to read ahead from curent request. For example, if get read request for blocks "ABCD", let it read to first drive, but also we can make requset to second disk (if there is no other request), read blocks "abcd" to read-ahead cache. Blocks "abcd" have offset from "ABCD". This mean, when long file will be readed, in the cache we can take all blocks from second drive. Value of this offset can be seted from /proc fs, or recalculated in dynamic. This method get double read performance without any changes in stored data.

What do you think about this algorithm? Artem Ryabov

[permalink][hide]

Re: TODO list of Linux md/raid (11 June 2009, 09:47 UTC)

It is conceivable that something like that could work. raid1 would need to keep track of how many pending read requests there are against each drive and when some threshold is crossed, prefer the other drive even if the head appears to be further away.

Why don't you give it a try!!

[permalink][hide]

Re: TODO list of Linux md/raid (11 June 2009, 10:20 UTC)
Oh, I am no so good in kernel internals to realize this.

[permalink][hide]

Re: TODO list of Linux md/raid (28 November 2009, 12:03 UTC)

Since I recognized some features as realized by now, but I am for example not clear about about desktop notification, could you mark the listed items done/pending/etc. or link to a updated status page?

[permalink][hide]

Re: TODO list of Linux md/raid (30 November 2009, 02:30 UTC)

Yes, I think everything on this list is "DONE".

A more recent similar document is http://neil.brown.name/blog/20090129234603

I don't know what you are referring to when you mention "Desktop notification".

[permalink][hide]

Comment (02 December 2009, 15:12 UTC)
Thank you for your answer and the link!



[permalink][hide]




[æ]