Auto-assembly mode for mdadm

21 May 2006, 09:26 UTC

Probably the most wanted feature of mdadm is auto-assembly. People want it to just do-the-right-thing. They want to simply be able to assemble all of their arrays without having to worry about creating and maintaining config files or anything like that.

I've always been against blind auto-assembly as it can (and occasionally has) cause problems when the wrong thing gets assembled.

However it is possible to find a middle ground, that isn't completely blind, but that requires minimal configuration effort. I've finally figured out how I want to implement that and scheduled the time to do it, and so it should appear in mdadm-2.5.

The core idea is to report the host name of each raid array. mdadm can then assemble every array that it can find, providing it is for 'this' host.

# What are the dangers?
# What is the solution?
# The fine details
# Bootstrapping
# Finding the host name
# Comments...


What are the dangers?

The chief problem with blind auto-assembly is that if you move the drives comprising a RAID array from one host to another, and if the destination host already has RAID arrays, then the auto-assembly could assemble the new drives rather than the old drives, resulting is a situation that is at least confusing, and may be detrimental.

The other problem that is substantially less significant is that if someone has been experimenting with RAID and created a few different arrays with different devices, then these old arrays might 'magically' reappear, and could get confused with array that should exist.

What is the solution?

As mentioned, the solution is simply to store the host name in the superblock, and only to assemble arrays which contain the right host name. This removes the problem of incorrectly assembling arrays imported from another machine.

We also need to check the create time for the array (which is already stored in the superblock) and if there is any ambiguity (multiple arrays with the same name) assemble the most recently created array.

Storing the host name in a version-1 superblock is quite easy. We have a 32 character 'name' field which can be used for whatever mdadm wants. We decide to treat that as having an optional hostname prefix separated from the rest of the name by a ':'. So if the name contains a colon, everything before that is the host name.

Storing the host name in a version-0.90 superblock isn't quite so easy. There isn't really anywhere to store it. While there is some unused space, squeezing a hostname in there is rather ugly.

So instead, we borrow some of the 16 byte uuid. This is normally chosen randomly to ensure that different arrays will have different uuids and make sure bits of different arrays don't get confused. If a hostname is known when an array is created, we will now use 8 bytes of random data, and 8 bytes taken from the SHA1 hash of the host name. This should provide a very similar guarantee of uniqueness, but allow us to tell if an array is intended for a particular machine or not.

Doing it this way does mean that we cannot easily tell what host an array is intended for, but that is a fairly small cost. If you want this functionality, use version-1 superblocks.

The fine details

So assuming all of your arrays have been tagged by the host name, how is mdadm going to auto-assemble them?

We make multiple passes through the list of available devices. The first superblock we find for an array on 'this' host is put aside and all other superblocks for the same array are put with it. When we have found all available devices for the array, we try to assemble.

While looking, if we find a superblock for a different array (different uuid) but the same name (more on that later) then we check the create time and choose this superblock in place of the current set if this is newer.

When we find a collection of superblocks that form an array, we need to decide what device name to use for assembling the array. For version-0.90 arrays, we use the minor number as stored on the superblock to identify the minor number and device name.

For version-1 arrays, we take the remainder of the name field. If this is numeric, we treat it just like the minor number of version 0.90 arrays. If it is non-numeric, we choose a free minor number, and create device with the given number under /dev/md/.

Bootstrapping

On problem with introducing this functionality is that people have pre-existing array that aren't tagged with the host name. To help we this we will have a new 'update' option of --assemble: --update=homehost which will update the host information in the superblock prior to assembly. This is usable fairly easily for everything except an array holding a root filesystem. In order to avoid needing to boot from different media, there will be an option that can safely be used from an initramfs which will do the right thing.

This will probably be called --auto-update-home-host.

This option is only meaningful when doing a hostname based auto assembly. If the autoassembly process finds anything to assemble, the option is ignored. However if nothing is found with the right host name, then a second pass is made. On this pass, any md array that is found is updated to belong to the current host, and is automatically assembled.

Thus it should be safe to always run mdadm with --auto-update-home-host in initramfs. It will only do its magic once, and after that the arrays should always assemble properly.

Finding the host name

All that is left is to find the host name. Often the host name is stored in a file on the root filesystem. If this is on a raid array, then we won't be able to assemble the array until we know the host name, and cannot find the host name until we assemble the array. To break this deadlock it is recommended that the host name be passed as a kernel parameter via whichever boot loader is being used.




Comments...

Re: Auto-assembly mode for mdadm (02 September 2006, 22:48 UTC)

why not just sign each device in the raid with some array specific signature, and if you want to double check make it some hash that includes information about the configuration of that array plus some random/specified but likely to be unique key as well. So if you do move the device to a new host and there does happen to be another array there, it is unlikely for the hash to match any of the arrays on the new host. Also for auto assembly then you just need the hash from any one device to know where to look for the other components of this array.

Host name is not bad, but there is the problem of having to look on disk or something for this, unless say its written into the kernel.

If there were some key written into the kernel to uniquely id that kernel and build you could try using that too.

I don't think host name is a very good option just because in many cases it requires being able to read contents from your array before you assemble it. Of course if you have a separate boot partition you could just have a file there containting the name. But I think that having all the information in the per device (right term?) superblock would be better and would solve the concern you listed.


[permalink][hide]

Re: Auto-assembly mode for mdadm (05 September 2006, 05:18 UTC)

Each device does contain an 'array specific signature' - the uuid in the superblock.

We do ensure that if drives are moved from one host to another the drives from one don't get confused with the drives from the other. However that isn't the main issue.

One of the main issues is that each array records its 'name'. So the array with '0' recorded becomes /dev/md0. The array with '1' recorded becomes /dev/md1. If you find 2 arrays that both claim to be '0', which one do you believe? I propose using the hostname to distinguish.

mdadm doesn't require that string given to --hosthost to actually be a hostname. It can be any string that will identify the host. You cannot use something from the kernel build as sites often run exactly the same kernel on all (or many) machines.

Getting a host name isn't really that hard. In some situations you could get it from dhcp. In others you could build it in to the initrd. Or you could use MAC address if you are sure to only have one. But somehow you need to identify 'this' host.

Thanks for your thoughts.

[permalink][hide]

Re: Auto-assembly mode for mdadm (08 December 2006, 12:54 UTC)

Looks good - here's another example which should be fixed by using a combination of hostname and creation date:

  • Install to RAID1 pair
  • Disconnect one half
  • Reinstall on the other half (new UUID)
  • Connect both halves
  • Boot
  • Ugliness may ensue

FYI - This is a common telco approach to upgrades - backup, break the mirror, fresh install on one half, restore databases, attach mirror. Or on a bad upgrade, revert to the "good" half of the mirror.

Refs:

http://www.contribs.org/bugzilla/show_bug.cgi?id=961 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=184570

Gordon

[permalink][hide]

Re: Auto-assembly mode for mdadm (16 December 2006, 22:45 UTC)

I'm sort of new to mdadm, but I have 15 years experience with Novell NetWare and during these days, I have seen and touched a lot of servers, most having software mirroring. NetWare has a "mirror group" ID (32 bit) and all disks have a 32-bit ID (based on timestamps). It is highly unlikely to find two disks from two different system having the same group ID and a belonging disk ID (not to mention the same size for the partition). I swapped/changed disks and broke/ create mirrors for like hundreds of occasion of migrations, and I never had any problem with that. I strongly believe that it is much more likely to find two systems with the same hostname... (I mean like: "server", "backup", etc...).

I really like to understand how mdadm works in deep tech - possibly without studying the source code. For one thing, I just cannot imagine a "wrong thing gets assembled" scenario. Can someone help me with this? I mean having two disks from two system with a same 16 byte UUID and same partition size (if that is needed for mdadm) is so unlikely that we can just skip that. Or am I missing something here?

About "One of the main issues ... If you find 2 arrays that both claim to be '0', which one do you believe?" Well, such an issue can only happen if someone on purpose moves one disk to another system. Also I fail to see why this is an "assemble" problem, if we have different UIDs on both disks. Then simply there should be two broken md0-s, the "second" md0 being called as md1 (or the next available #). This is more like a mounting/booting issue, as if I would want to boot from md0(:1) then I'd need to specify md1. This would be a temporary problem if I could just "rename" the separate array member disk "name".

Also consider that some people (like myself) use "ironclad" kernels, eg. no "initrd and stuff". (This leaves -e 0.90 for system partitions if I see correctly.)

[permalink][hide]

Re: Auto-assembly mode for mdadm (15 January 2007, 04:42 UTC)

Replying to Gordon:

From looking at the bug reports, the core problem is that you want to have (during a transition period) two split mirrors on the one system. These are independant system images, the old and the new. If you boot from the 'old' device, you want the partitions on that device to be assembled in md0, md1,... as degraded arrays. When you boot from the 'new' device you want partitions from that device to become the mdX arrays.

This is a perfect example of where in-kernel autodetect won't work, as it cannot really know.

User-space would have a hard time 'knowing' too, but you could tell it.

One approach would be to

  • use a different homehost name for the 'new' installation (it is just a string after all. Maybe $HOSTNAME-new
  • Have some custom code in the initrd figure out which device was booted from and call "mdadm --autoassemble" with an appropriate homehost
.

Fundamentally this is a very special case scenario. The choice of which array to make /dev/md0 cannot be determined in a totally generic way. You need some special case code. The obvious place for that is in initrd. The new 'homehost' concept should make it fairly easy for that special-case code to tell mdadm what to do.


Does that seem reasonable?

[permalink][hide]

Re: Auto-assembly mode for mdadm (15 January 2007, 05:06 UTC)

Replying to the comment just after Gordon's (I guess I need hierarchical comments, don't I ...).

  • Yes, having two hosts with the same name is a real possibility. In Gordon's case it was two instances of the same host. Other cases should be fairly uncommon in the one data center though.

    I'm not sure how best to deal with that. If mdadm is to reliably assemble one specific array as 'md0', then there needs to be some way for it to know which is the right md0. Aside from host name... what is there?

  • Yes, having the same UUID on two arrays is a possibility not worth worrying about. So if you identify the md device by UUID then there is no problem. But people don't seem to want to do that. They want to mount '/dev/md0', not '/dev/md/xxxxx-yyyyyy-dddddd-qqqqqqq'.
  • Maybe you are right. Maybe it is a boot/mount problem, not an md assemble problem. Maybe I should have an option that assembles all arrays that are found, assigning random unit numbers to each, then rely on mount-by-uuid or similar to mount the right filesystem. That might work for some people, but some people won't like it

  • I suspect it won't be very long before you cannot use linux without an initrd. It really is the way of the future, and fighting against it is not going to get you anywhere. Much better to embrace it, steer it, and make it work for you. It is a bit like 'udev'. It might not be perfect but it is here to stay.

Thanks for your input.

[permalink][hide]

Re: Auto-assembly mode for mdadm (15 January 2007, 06:17 UTC)

If you boot from the 'old' device, you want the partitions on that device to be assembled in md0, md1,... as degraded arrays. When you boot from the 'new' device you want partitions from that device to become the mdX arrays.

Yep. We dealt with the issue by avoiding raidautorun and using mdassemble to access the correct set of devices in the initrd. Since we know which devices exist at install time, we can decide which ones we need.

The new 'homehost' concept should make it fairly easy for that special-case code to tell mdadm what to do.

Does that seem reasonable?

Yes. Another option which would work would be a "root=UUID-of-root" option or "md 0=UUID-of-md0,1=UUID-of-md1,..."

Gordon

[permalink][hide]

Re: Auto-assembly mode for mdadm (28 January 2007, 10:03 UTC)

>> Yes, having two hosts with the same name is a real possibility. In Gordon's case it was two instances of the same host. Other cases should be fairly uncommon in the one data center though.

Depending if we're talking about fqdn or just a name like "server", "mail", "backup", etc. but that is not the main problem.


>> I'm not sure how best to deal with that. If mdadm is to reliably assemble one specific array as 'md0', then there needs to be some way for it to know which is the right md0. Aside from host name... what is there?

Let's ask the question: _when_ is such to be dealt with?

I believe the "multiple md0" scenario only happens if someone wilfully puts two md0-s (or parts) into one PC (/disk subsystem). Then one _must know_ that there will be minor problems. (also I believe this can happen mostly for temporary period, such as migrate one disk's data to a new mirror, etc.).

So my (and possibly other's) concern is:

- Two different md0's partial disks should NEVER be assebled into one md0 (as it would result data loss for one of the disks) (So different UUID is more important here, I'm not talking about "mount /dev/md/xxxxx-yyyyyy-dddddd-qqqqqqq))

- The ordering: As I usually need it for temporary solution, this wouldn't matter too much for me, but even - if it is by some logic, (eg: always use the same disk for md0) - permanent automount shouldn't be a problem. If possible, I "suggest" using the first reported (mirror member) disk as md0, then next as md1, etc. The "fix-named" arrays could be shifted, or just not mounted. If I have a "working" system (md0), then I should be able to sort out the rest.


>> I suspect it won't be very long before you cannot use linux without an initrd.

That would mean, I cannot compile a "single-file" kernel with all my required modules in it. I don't see that coming. There are several servers of mine which need to have a "brick and mortar" kernel, not allowing "kernel module injection" and not relying on "module names" which can be changed, like it happened before to me, when lan-driver name was changed and server rebooted normally, but without lan driver... I think this part is getting off topic, so let's concentrate on the md* part :-)


Cheers, Gabor


[permalink][hide]

Re: Auto-assembly mode for mdadm (31 October 2008, 10:50 UTC)
OK, all is well, but what happens when the hostname changes? For example I change from "server" to "server-john". Would the md0 for example be assembled? If I am right, I thing this would be a big annoyance.

[permalink][hide]

Re: Auto-assembly mode for mdadm (21 December 2008, 00:17 UTC)

I think today i noticed a problem described here. I have a running server with two disks (2x400) running in RAID-1, /dev/md0-5. I prepared an other set of disks (2x500) on another machine: /dev/md0, /dev/md1, /dev/md2. And now - when i put these 2x500 disks into my server whic halready has and raid-1 array (2x400) then the server WONT boot. Kernel panic - could not mount root. Right before some messages like:

/dev/md0 (driver?) /dev/md1 (driver?)


So, I think, i need to "destroy" the 2x500 array on another machine and then create again on my server directly. I found this site looking for an option if it's possible to tell mdadm to create the arrays staticaly or sth.

Peace.


[permalink][hide]




[æ]