A Nasty md/raid bug

15 June 2012, 07:32 UTC

There is a rather nasty RAID bug in some released versions of the Linux kernel. It won't destroy your data, but it could make it hard to access that data.

If you are concerned that this might affect you, the first thing you should do (after not panicking) is to gather the output of

mdadm -Evvvvs

and save this somewhere that is not on a RAID array. The second thing to do is read to the end of this note and then proceed accordingly. You most likely will never need use the output of that command, but if you do it could be extremely helpful. # Who is vulnerable
# What does it do
# What should I do to avoid being bitten by the bug
# What should I do if I have already been bitten
# What was the bug anyway
# Comments...

Who is vulnerable

The bug was introduced by

commit c744a65c1e2d59acc54333ce8 md: don't set md arrays to readonly on shutdown.

and fixed by

commit 30b8aa9172dfeaac6d77897c67ee9f9fc574cdbb md: fix possible corruption of array metadata on shutdown.

These entered the upstream kernel for v3.4-rc1 and v3.4-rc5 respectively, so no main-line released kernel is vulnerable.

However the first patch was tagged "Cc: stable@vger.kernel.org" as it fixed a bug, and so it was added to some stable releases.

For v3.3.y the bug was introduced by commit ed1b69c5592d1 in v3.3.1 and fixed by commit ff459d1ea87ea7 in v3.3.4, so v3.3.1, v3.3,2, and v3.3.3 are vulnerable.

For v3.2.y the bug was introduced by commit 6bd620a44f7fd in v3.2.14 and fixed by commit 31097a1c490c in v3.2.17 so v3.2.14, v3.2.15. v3.2.16 are all vulnerable.

The bug was not backported to any other kernel.org kernels. so only those 6 are vulnerable. Some distributors may have picked up the patch applied it to their own kernel so it is possible that other kernels are vulnerable too.

e.g SLES11-SP2 introduced the bug in 3.0.26-0.7 and fixed it in 3.0.31-0.9.

Ubuntu-precise introduced the bug in Ubuntu-3.2.0-22.35 and fixed the bug in Ubuntu-3.2.0-24.38


What does it do

The bug only fires when you shutdown/poweroff/reboot the machine. While the machine remains up the bug is completely inactive. So you will only notice the bug when you boot up again.

The effect of the bug is to erase important information from the metadata that is stored on the disk drives. In particular the level, chunksize, number of devices, data_offset and role of each device in the array are erased ... and probably some other information too. This means that if you know those details you can recover your data, but if you don't, it will be harder. Hence the "mdadm -E" command suggested earlier.

The bug will only fire if, while the machine is shutting down, and array is partially assembled but not yet started. If the array is started and running it is safe ... as long as it stays that way.

The typical way that an array can get into this "partially assembled by not started" start is for "mdadm --incremental" to have been run on some, but not all, of the devices in the array. If it had been run on all it would have started the array, but until then it is assembled but not started and this is the dangerous situation.

Another way it can happen is if you use "mdadm -A" to assemble the array and explicitly list some, but not all, of the devices that make up the array. This also will assemble the array and not start it.

It is very unlikely that an array will be in this state at shutdown, though certainly not impossible, so it seems unlikely that the bug will affect many people. But it has affected some.

Most of the people who have reported problems have been running Ubuntu. This might just be because Ubuntu happened to make a release with a vulnerable kernel and no other distro did, or it might be that Ubuntu does something during shutdown that makes triggering the bug more likely.

For example, if Ubuntu stop arrays as part of shutdown (which is a good idea) and if it has udev scripts which automatically pass changed devices to "mdadm --incremental" in case they are part of an array (which is also a good idea, and I think they do), and further if something causes udev to think that the devices that were part of the array had changed - which seems unlikely but is not completely impossible, then arrays could get partially re-assembled late in the shutdown sequence, which could trigger exactly the bug we see.

Note that I don't know that Ubuntu does this, and I'd be surprised if they did, and there is probably some other explanation. But I like try to think of all possibilities to try to understand things.

What should I do to avoid being bitten by the bug

If you are running a newer kernel that those identified above which contain the fix (3.4, 3.3.4, or 3.2.17) then you are not vulnerable and there is nothing else that you need to do.

If you are running a kernel that was compiled before March 2012 (when the bug was created), then you aren't vulnerable, and as long as you don't upgraded to a vulnerable kernel you will continue to be safe.

If you are running a vendor kernel compiled since 19th Match 2012 but with a version earlier than those listed above as containing the fix, then I cannot tell you if you are vulnerable or not. You might need to check with your vendor.

If you decide to upgrade your kernel, you should do so carefully. Remember that the bug triggers on shutdown/reboot so you aren't safe until the new kernel is running.

To be completely safe you must ensure that no arrays are partially assembled at the moment that the shutdown process in the kernel looks at md array. To do this you can

mv /sbin/mdadm /sbin/mdadm.moved /sbin/mdadm.moved --stop --scan

Any array that is partially assembled (and any other array that is not in use) will be stopped. As "/sbin/mdadm" will not exist, no array can be partially - or fully - assembled.

Doing this may cause the shutdown process to complain if it cannot find mdadm to stop arrays itself, but this should not be a problem.

The boot process in the new kernel might also complain as it won't be able to find /sbin/mdadm either. You might have to boot into a rescue mode and

mv /sbin/mdadm.moved /sbin/mdadm

back into place. Then boot again.

Also this fiddling with moving mdadm might be completely unnecessary. Or it might not. I cannot be sure. What I can be sure of is that doing it this way is safest.

Once you are running a non-vulnerable kernel, you are safe and cannot be bitten - by this bug at least.

What should I do if I have already been bitten

You will know that you have been bitten if some array or arrays don't seem to work and all the devices appear to be spares, and if you then use "mdadm --examine" to look at the devices and find that the RAID level is "-unknown-". At this point you might want to send mail to linux-raid@vger.kernel.org and ask for help. Or you might be able to work your way through the following and fix it by yourself.

The way to fix an array that has been affected by the bug is to "Create" the array again with mdadm. Only the metadata has been destroyed and "mdadm --create" only writes new metadata so this is a simple and effective fix and makes all data available again. Constructing the correct "mdadm --create" command might be tricky. If you have recent "mdadm --examine" output, then that will be a big help. If not, you can probably get some of the required information from somewhere. Maybe from your memory, maybe from kernel logs, maybe from guessing and seeing if it works.

First you need to know the metadata version that was in use. This is one piece of information that is not destroyed so "mdadm --examine" of one of the devices will give it to you. It might be 0.90 for older arrays, or 1.2 for newer arrays, or possibly 1.0 or 1.1 if they were chosen when the array was originally created.

Next you need to know the number of devices, RAID level, layout, and chunk size. Some of these you might simply know (Maybe you know it is a 5-device RAID6 array) or can determine by examining kernel logs or asking a colleague. Others you can probably assume were the default if you don't know otherwise. People rarely set a non-default layout for RAID5 or RAID6, though they do for RAID10. If you have 0.90 metadata, you probably have 64K chunks while if you have 1.2 you probably have 512K chunks so if you don't have any reason to think otherwise, this is a good place to start.

It is entirely possible that there is one device that was in the array in the recent past, but wasn't in the array at the moment when it was corrupted. If this happens, then that device is a good source for some of this information.

Then you need to know the order of devices, and whether the array was degraded, in which case some devices will have to be specified as "missing". As devices can change order when there are failures and spares are added, you cannot be sure that the obvious order is the correct order. e.g it might be sda1, sda2, sda3, and sda4, but it could be sd1, sda2, sda4, sda3. For RAID1, this doesn't matter of course. For RAID4,6,10 it is very important. Your kernel logs might have this information in a "RAID conf printout", so it is worth checking. Remember however that device names can change after a reboot so prepare to be flexible.

If you have a recent output of "cat /proc/mdstat" that might be useful, but be careful. The number is brackets "sda1[3]" are not always the position of the device in the array. For 0.90 metadata they are but for 1.x, they show the order in which devices were added to the array. When a spare is changed to being an active member of the array, this number does not change. So it might be an indicator, but it is definitely not a promise.

Finally you need to know the "Data Offset" of each device. This only applies to 1.1 and 1.2 metadata. Different versions of mdadm use different values for the default Data Offset, so it is best to try to recreate the array using the same version of mdadm as was used to create the array.

Worse - if you created the array with an older version of mdadm, then added a spare with a newer version then it is possible that different devices have different data offsets. If that seems to be the case it would be best to ask for help on linux-raid@vger.kernel.org as you'll need a special non-released version of mdadm to fix things up.

Once you have assembled all this information you can try creating the array. Note that you don't need information about whether there was a write-intent bitmap or not. Just assume there wasn't and don't try adding one. Once you have the data back you can always add a bitmap later.

When you create the array, give all the details and make sure to specify "--assume-clean". This ensure that mdadm doesn't start any resync. This is import as if you get details wrong and need to try again, then resync could corrupt data causing subsequent attempts to be pointless.

So something like: mdadm --create /dev/md0 --assume-clean --level=5 --raid-disk=4 \ --chunk 128 --metadata=1.2 /dev/sda1 /dev/sd2 missing /dev/sda4

This will write out new metadata, assemble the array, but not write to any data. You should then try to determine if the array contains the correct data. How you do this depends on what was there before. If you had an ext2/3/4 filesystem "fsck -n /dev/md0" is the thing to use. If XFS, then xfs_check is the tool of choice. If you used LVM, then you might need to try starting LVM (vgchange -ay) and if that works, run 'fsck' or whatever on the contents.

Only if 'fsck' reports positive results should you try mounting the filesystem. Even if you mount the filesystem read-only, some filesytems might try to write to the device anyway, and you don't want that until you are sure.

If you don't know the correct chunk size, or don't know the correct order of devices you might need to try multiple permutations until you find one that works. To speed this up a little you can assume that once "fsck" or "vgcreate" reports that it sees something that looks vaguely right, even if there are lots of errors, then you probably have the first device in the array correct, so you only need to continue permuting the other devices.

If your array is RAID6, and if it is missing at most one device, then an alternate approach to checking if the order is correct is to echo check > /sys/block/md0/md/sync_action

(where "md0" might be replaced by a different mdX in your case).

If /sys/block/md0/md/mismatch_cnt grows into the thousands quickly, then you certainly have the order wrong and should stop the array and try again. If it reports zero or only a small number of hundreds then the order is probably correct and it is worth running 'fsck' etc.

This does not work for RAID5, as the parity used for RAID5 is insensitive to the order of devices. It can be useful for RAID10 but a positive result is not as strong as a positive result for RAID6.

Once you have a valid filesystem and you can see all your data, I recommend a "check" and possibly a "repair" sync_action, a full "fsck", and probably it's a good time to refresh your backups.

What was the bug anyway

So, you want to know the background do you? Well..... are you sitting comfortably?

md has always had a "reboot notifier" that tried to make sure nice things happened to arrays at shutdown. The particular purpose was to reduce the possibility that a resync might been needed on reboot, even if the shutdown wasn't particularly clean.

So at shutdown, md would try to switch all arrays to "read-only" mode so no more writes are allowed, and so the array would be marked as clean.

This originally applied only to arrays which were not active, but when people started have root on md more often, they started getting complaints that md couldn't stop all arrays - because the array holding the root filesystem would still be busy at this point.

So in 2008, for 2.6.27, md changed to switch any array to read-only, even if it was still in use. No more writes should be happening at this point, the only use should effectively be read-only, so everyone should be happy.

However sometime around the 3.0 kernel there were changes to the way filesystem write-back was happening so that writes could still arrive while the reboot was progressing. A normal clean shutdown that called "sync" first should avoid that, but people don't always do that - sometimes with good reason.

md strongly believes that no-one should be writing to a read-only device so it has a BUG_ON if a write arrives while the device is readonly. Around 3.0/3.1 people started reporting this bug being triggered at shutdown. Obviously a BUG at shutdown isn't a very big problem, but it is untidy so once the problem was identified it should be fixed.

As switching to read-only was no longer an option, I decided to switch to immediate-safemode instead.

'safemode' is a precursor to write-intent bitmaps. It is like a WIB, but with only one bit. Before writing, md makes sure the "dirty" bit is set in the metadata. 200ms after the last write the bit is cleared in the metadata. immediate-safemode is the same without the 200ms delay. So as soon as there are no outstanding writes, the bit is cleared. This is very nearly as safe as switching to readonly mode, and if people a rebooting without calling 'sync' first, then they are deliberately giving up some safety and I don't feel the need to try any harder. If no writes comes, immediate safe mode is just as good as read-only.

So the code was changed to set immediate-safemode instead of read-only, and this patch was marked for 'stable' kernels as it fixed a BUG_ON crash.

Unfortunately the patch was imperfect. Setting immediate-safemode involves clearing the bit and writing the metadata out immediately (if there are no outstanding writes). This is normally good ... unless the array is not actually active. If the array is not active, the code still writes out the metadata, but does this from its knowledge of the array, which is that the array is inactive. The result is that it corrupts the metadata exactly as described above.

The fix, once the problem was understood, was simple. Only set immediate-safemode if the array is active.





Comments...

Re: A Nasty md/raid bug (18 June 2012, 22:54 UTC)

In all Ubuntu releases, we do incremental assembly for all raids on boot via udev rules 85-mdadm.rules. Maybe that is the reason.

[permalink][hide]

Re: A Nasty md/raid bug (26 June 2012, 23:29 UTC)

Any thoughts on sorting this out when the (Ubuntu) system in question has whole-disk encryption at boot stacked on top of the raid (and LVM on top of that)?

There are actually 2 partitions mirrored by raid in this case -- a small one with /boot installed (no encryption) and a larger one with /swap and an encrypted LVM volume containing / .

Assuming the system were affected, if one were to move mdadm as suggested prior to shutting down, it wouldn't be possible to move it back after boot without mounting the encrypted disk and LVM volume via busybox/cryptsetup or a rescue disk, but wouldn't these be on the inaccessible array and therefore... inaccessible?

Thanks for any thoughts you may be willing to share.

[permalink][hide]

Re: A Nasty md/raid bug (27 June 2012, 06:26 UTC)

Your configuration of lvm over raid over crypt should not present any particular problem. For it to work at all, the initrd must contain mdadm an other tools needed to assemble and mount the root filesystem. Once it is assembled and mounted, you can move mdadm back.


[permalink][hide]

Re: A Nasty md/raid bug (27 June 2012, 17:18 UTC)
Ahh, so one would be moving the copy of mdadm in /sbin that would have otherwise been called on shutdown, potentially activating the bug, but the mdadm in initrd/initramfs that allows the system to start up will still be present, allowing progress as normal.

Makes perfect sense, thanks for that -- puts me a little more at ease with the 3 boxes that uname -a tells me are running a kernel right in the "danger zone."

Thanks again.

[permalink][hide]

Re: A Nasty md/raid bug (05 August 2012, 06:57 UTC)
Hi Neil!

I am experiencing prehaps a new, but similar rather nasty bug.

My goal: create a RAID 1 array between a partition on the hard drive and a ramdisk and mount this md array at /usr. Once poplulated, the vast majority of my programs will run out of RAM, and be very fast! I have 8 GB of RAM, so why not?


On startup, the partition, which is a linux_raid_member, is detected as such, and /dev/md127 is automatically created.

The problem: /dev/md127 is not usable. "sudo blkid /dev/md127" yeilds nothing.

Since this is RAID _one_ it should be able to function just fine with only one member, but it doesn't.

In fact, to get it going again, I have to do:

sudo mdadm --stop /dev/md127 sduo mdadm -A /dev/md127 -f /dev/sda2 [worksat this point, now to add in the ramdisk:] sudo mdadm /dev/md127 -a /dev/ram0


Then "sudo blkid /dev/md127" says that it's ext4 formatted, that it has a UUID, etc., and it is usable. After mounting it after fixing it, all my data is there.

Problem restated: /usr needs to be mounted so early in startup that there is no script that I can have automate the fixing of my /usr RAID device in time for it to be mounted when it needs to be.

I have a thread going on about this here: http://bbs.archbang.org/viewtopic.php?id=3179 .

I have even tried renaming /sbin/mdadm to /sbin/mdadm.moved and doing "sudo /sbin/mdadm.moved --stop --scan" before shutdown. After restart, the automatically generated md device for the found linux_raid_member partition is still unusable.

I have tried both kernel 3.3.4-1-ARCH with mdadm 3.2.3-2 and kernel 3.4.7-1 with mdadm 3.2.5-2.

[permalink][hide]

Re: A Nasty md/raid bug (06 August 2012, 22:13 UTC)

Hi,

1/ If this actually improves your performance, then there is something seriously wrong with the Linux page cache. If you have lots of memory, then everything that is read into memory should stay there.

2/ It is very strange that stopping and restating the array (which I think is what you are saying you are doing) makes any difference. Can you post the output of "mdadm -D /dev/md127" both when the array is not working and when the array is working again. Preferably post it to linux-raid@vger.kernel.org (you don't need to be subscribed).

[permalink][hide]

Re: A Nasty md/raid bug (12 August 2012, 22:29 UTC)
Hi Neil!

Sorry for the delayed response. I kept checking on my iPhone for a response, and for some reason or lack thereof didn't see it. I figured that I'd have to wait a few days for my post to go through. I started checking over here everyday and never got a response: http://marc.info/?l=linux-raid&m=134415164808677&w=2 . I guess I neglected this blog. And I was out camping for a few days.

When I do "sudo /sbin/mdadm.moved -D /dev/md127" after startup (before it's working), it says:

mdadm: md device /dev/md127 does not appear to be active.

(I now moved /sbin/mdadm.moved back to /sbin/mdadm for ease of use.)

When I do "sudo mdadm --misc --query /dev/md127", I get:

/dev/md127: is an md device which is not active


I found out that I can do:

"sudo mdadm -R /dev/md127" and that starts it up just fine and everything is good again. (Don't have to stop it or anything.)

I can then do: sudo mdadm /dev/md127 -a /dev/ram0.

However, the problem still remains that it is not started on start-up, thus I cannot have /usr be such a raid device, which needs to be mounted very early in startup, before any scripts get ran. Therefore I can't just put "mdadm -R /dev/md127" in a script somewhere to fix the issue. I even tried the raid=autodetect option at the kernel parameters line (And yes, "mdadm_udev" was in the HOOKS line in mkinitcpio.conf and "md_mod" and "raid0" were in the MODULES line in mkinitcpio.conf when I built the initial ramdisk with mkinitcpio.) I also tried doing "md=1,/dev/sda2" at the kernel line, which was ignored, because /dev/md1 was not created. Instead, I got the regular /dev/md127 (which is not active). Also tried "md=1,/dev/sda2,missing". Same result. Also tried "md=1,/dev/sda2,/dev/ram0". Same result. Also tried "md=1,/dev/ram0,/dev/sda2". Same result. Also tried "md=1,missing,/dev/sda2". Same result. I also tried md-mod.start_dirty_degraded=1. Still wasn't active on startup.

After running the above two commands (before the kernel parameters paragraph) to get the raid device going, "sudo mdadm -D /dev/md127" gives:

/dev/md127: Version : 1.2 Creation Time : Wed Aug 1 19:33:20 2012 Raid Level : raid1 Array Size : 3144692 (3.00 GiB 3.22 GB) Used Dev Size : 3144692 (3.00 GiB 3.22 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Sun Aug 12 13:39:56 2012 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 9% complete Name : archbang:0 UUID : 31f0bda6:4cd69924:46a0e3b2:4f7e32ba Events : 166 Number Major Minor RaidDevice State 2 1 0 0 spare rebuilding /dev/ram0 1 8 2 1 active sync writemostly /dev/sda2



Also, when I first startup, if I do: sudo mdadm --manage /dev/md127 --remove /dev/sda2

I get: mdadm: cannot get array info for /dev/md127

And if I do: sudo mdadm --manage /dev/md* --remove /dev/sda2

I get: mdadm: Must give one of -a/-r/-f for subsequent devices at /dev/md127

What's going on with the above two command outputs?


I will put a copy of this reply here: http://marc.info/?l=linux-raid&m=134415164808677&w=2

For those of you looking at the linux-raid mailing list, I am coming from here: http://neil.brown.name/blog/20120615073245 .

The idea of RAID1ing with a ramdisk came from here:https://bbs.archlinux.org/viewtopic.php?pid=493773#p493773 .

Cheers, Jake

[permalink][hide]

Re: A Nasty md/raid bug (12 August 2012, 22:34 UTC)

If you ask a question on the blog, you get a reply on the blog. If you ask a question on linux-raid you get a reply on linux-raid....

The array isn't starting at boot time because it is degraded. This suggests a problem with the boot scripts. I suspect your distro ('arch'?) is relying on "mdadm -I" being run from udev to start the arrays. This by itself isn't always enough. After all devices have been discovered you need "mdadm -Irs" or "mdadm --incremental --run --scan" to assemble any newly-degraded arrays.

So in some script before it tries to mount filesystems you want something like:

udevadm settle mdadm --incremental --run --scan

This will have to go in a script in the initrd of course.


[permalink][hide]

Re: A Nasty md/raid bug (12 August 2012, 23:57 UTC)
sudo mdadm --create /dev/answers --level=1 --raid-devices=2 /dev/blog,/dev/linux-raid@vger.kernel.org

Thanks for the responses Neil!

I did try the following kernel parameter:

md-mod.start_dirty_degraded=1

Does my kernel simply not honor that parameter? I'll try and look into my initial ramdisk and see some of these scripts you're talking about.

Ok, I'll post more after I look at what you talked about.

Jake

[permalink][hide]

Comment (13 August 2012, 07:44 UTC)
Note: the ramdisk conversation has been moved to the mailing list: http://marc.info/?l=linux-raid&m=134484064215775&w=2 .

There is still another (the original) conversation about this here: http://bbs.archbang.org/viewtopic.php?id=3179 .

[permalink][hide]

Re: A Nasty md/raid bug (13 August 2012, 07:50 UTC)
Oh, and SOOOOOLLVVED!!!! (My raiding with a ramdisk and mounting that for /usr, that is.) See the mailing list and ArchBang thread, which I give links to in a previous comment.

Cheers, Jake

[permalink][hide]

Re: A Nasty md/raid bug (28 August 2012, 06:14 UTC)

> Worse - if you created the array with an older version of mdadm, then added a spare with a newer version then it is possible that different devices have different data offsets. If that seems to be the case it would be best to ask for help on linux-raid@vger.kernel.org as you'll need a special non-released version of mdadm to fix things up.

In fact, being able to specify the data offsets (at least at creation time) would be a very interesting feature when using metadata 1.1 or 1.2 with HD with specific alignment constraints. You can read more details in http://bugs.debian.org/614841

Can you consider adding such an option?

Regards, Vincent

[permalink][hide]

Comment (27 September 2012, 23:54 UTC)
Just a note about creating a custom hook to automate fixing a system-critical md device, premount (because it gets corrupted on shutdown, and you can't keep it from being corrupted on shutdown, because you can't stop it before shutdown, because you can't unmount it before shutdown, because it's system-critical):

See my "P.P.S" on the 25th post here: http://bbs.archbang.org/viewtopic.php?pid=17316#p17316, or the linux-raid mailing list archive "thread" associated with all this: http://marc.info/?t=134415172800001&r=1&w=2.

Cheers, Jake

[permalink][hide]

Re: A Nasty md/raid bug (09 January 2013, 09:24 UTC)

I haven't hit this bug but I've had to re-run mdadm --create on an existing array for another reason. What isn't pointed out above is that different versions of mdadm use different data_offsets.

My array is about 750GB and was initially created with 3.1.5, and that set data_offset to 2048 sectors (1MiB), but after I re-ran madadm --create, data_offset was set to 134217728 (128MiB), a long way into to the data so it just looked like garbage and couldn't be used. re-running --create with 3.1.5 set the data_offset back to 2048 and now I can access the data again.

I had to spend several hours last night debugging mdadm with gdb to figure all this out. I wonder how many times other people have been bitten by this and given up on the array as lost.

[permalink][hide]

Re: A Nasty md/raid bug (14 January 2013, 21:06 UTC)
> What isn't pointed out above is that different versions of mdadm use different data_offsets.

Well, "data_offset" isn't mentioned, which I had searched for, but "data offset" is, sorry...

[permalink][hide]




[æ]