Tedious bugs

06 July 2012, 12:29 UTC

Some bugs can be educational, but others are just tedious. I guess one still learns, but there must be a better way.

One of the many issues with my Open Phoenux GTA04 is that it hangs occasionally. Note that I'm not complaining about there being issues. I deliberately got this phone because I wanted to experiment and explore and if it all worked perfectly, where would the fun be? But there are issues and some are fun to fix. Others..

So anyway, it hangs. The symptom is that sometimes I ask it to wake up from suspend and it doesn't. The console is completely silent (no sysrq even). This seemed to correlate with someone trying to call or TXT me, and the message not getting through. So my initial assumption was that it would hang on resume. But only occasionally.

So to find out more I installed a kernel with debugging enabled (this is a 3.2 based kernel, it currently seems most stable), connected a serial console, (via a 3.5mm phono jack that I manage to wire up) and programmed the phone to set the RTC alarm every minute (fortunately I got the RTC alarm working!) so it would suspend and resume every 60 seconds. Then let it go.

Within about 30 minutes it would be very likely to hang.... providing the USB cable wasn't plugged in. Even after I discovered this requirement I kept leaving the USB cable in after copying a new kernel across, and so wasted lots of testing time - I would leave it for 30 minutes with the cable in, and then realise that it wasn't going to hang like that. Like I said: tedious.

The first hang showed that it was just after "PM: early resume of devices complete" so I added some more tracing and found that it was in resume_device_irqs(). Then more and it was just as interrupt 92 - musb-hdrc - was enabled. There was a pending interrupt for this device and something got confused in there. A few more cycles (about an hour each!) found the bug for me.

omap2430_musb_set_vbus in omap2430.c contains:

while (musb_readb(musb->mregs, MUSB_DEVCTL) & 0x80) { cpu_relax(); if (time_after(jiffies, timeout)) { dev_err(musb->controller, "configured as A device timeout"); ret = -EINVAL; break; } }

where 'timeout' is set one second in the future. Now looping for up to one second in code that can be called from an interrupt handler (as this is called from musb_stage0_irq), is pretty poor form to start with, but it is worse than that.

When resume_device_irqs() calls __enable_irq() which checks for pending interrupts and calls the handler, it does this with all interrupts disabled (and having interrupts disabled in an interrupt handler is pretty common, so this should not be a problem). With interrupts disable, the timer doesn't tick, so jiffies doesn't ever change. You can see where this is going, can't you.

If that interrupt is pending on resume, we get to this code and don't just spin for 1 second, we spin forever - with interrupts disabled. This exactly matches what I saw.

Working around this should be easy - just put in a maxiumum loop count ... I'll get to that in a minute. But why is there an interrupt at all?

Well the interrupt handler - generic_interrupt() in musb_core.c - reads the interrupt status registers and the only bit that is set is MUSB_INTR_SESSRQ in int_usb. i.e. There is a request to start a new session.

I *think* this means that the ID pin in the USB port was grounded. A USB-OTG cable will do this to request that the USB-OTG port switch to host mode and start talking to a gadget. This is what the code seems to be doing. In the first instance it turns on VBUS (the 5V power supply for the USB). So it all seems consistent.

However I never plugged anything into the USB port, and definitely didn't ground the ID pin. Remember that when I did have the USB port plugged in the problem didn't happen. Maybe that is related.

My guess is that some electrical noise either during suspend or resume triggers the interrupt. Maybe there really should be a capacitor on that 'ID' pin - it wouldn't be the first time that a missing capacitor had caused problems for Openmoko devices. I don't know how to test this, or whether to care if I can just make it work.

But it seems I cannot, at least not yet.

I imposed a 1000-loops count on that loop and it got further, but not much. It took about 5 second (so I can make the max a lot smaller), the interrupt handler gave up, and we moved on to resume all devices (having completed early-resume). It got up to the MMC drivers (hsmmc), stopped for 180 seconds, then the soft-lockup timer triggered. So not it is not hanging with interrupts disabled, so it might be easier to debug, but it is still hanging.

The stack trace shows

[ 4758.457092] mmcqd/0 D c0449efc 5644 59 2 0x00000000 [ 4758.463806] [<c0449efc>] (__schedule+0x57c/0x608) from [<c044a3e0>] (schedule_timeout+0x1c/0x1d0) [ 4758.473052] [<c044a3e0>] (schedule_timeout+0x1c/0x1d0) from [<c044985c>] (wait_for_common+0xd8/0x150) [ 4758.482666] [<c044985c>] (wait_for_common+0xd8/0x150) from [<c02f4408>] (mmc_wait_for_req_done+0x24/0xa4) [ 4758.492645] [<c02f4408>] (mmc_wait_for_req_done+0x24/0xa4) from [<c02f4bc0>] (mmc_start_req+0x50/0x144) [ 4758.502410] [<c02f4bc0>] (mmc_start_req+0x50/0x144) from [<c02fe460>] (mmc_blk_issue_rw_rq+0x78/0x4dc) [ 4758.512115] [<c02fe460>] (mmc_blk_issue_rw_rq+0x78/0x4dc) from [<c02fecc8>] (mmc_blk_issue_rq+0x404/0x434) [ 4758.522155] [<c02fecc8>] (mmc_blk_issue_rq+0x404/0x434) from [<c02ff824>] (mmc_queue_thread+0x98/0x100) [ 4758.531951] [<c02ff824>] (mmc_queue_thread+0x98/0x100) from [<c0055060>] (kthread+0x80/0x88) [ 4758.540771] [<c0055060>] (kthread+0x80/0x88) from [<c000f30c>] (kernel_thread_exit+0x0/0x8)

which suggests that a block io request to the mmc is being retried immediately on resume, and not getting anywhere. Now it could be that this is completely unrelated to the USB and I should run more tests, but not today.

The suspend/resume thread is :

[ 4758.669219] susman D c0449efc 5060 1423 1420 0x00000000 [ 4758.675872] [<c0449efc>] (__schedule+0x57c/0x608) from [<c02f36c8>] (__mmc_claim_host+0xb8/0x154) [ 4758.685119] [<c02f36c8>] (__mmc_claim_host+0xb8/0x154) from [<c02f8edc>] (mmc_sd_resume+0x34/0x5c) [ 4758.694458] [<c02f8edc>] (mmc_sd_resume+0x34/0x5c) from [<c02f2b20>] (mmc_resume_host+0xc8/0x15c) [ 4758.703704] [<c02f2b20>] (mmc_resume_host+0xc8/0x15c) from [<c0307874>] (omap_hsmmc_resume+0xa0/0xe4) [ 4758.713317] [<c0307874>] (omap_hsmmc_resume+0xa0/0xe4) from [<c024e760>] (platform_pm_resume+0x44/0x54) [ 4758.723114] [<c024e760>] (platform_pm_resume+0x44/0x54) from [<c0252bcc>] (pm_op+0x6c/0xb8) [ 4758.731811] [<c0252bcc>] (pm_op+0x6c/0xb8) from [<c0253778>] (device_resume+0x190/0x228) [ 4758.740234] [<c0253778>] (device_resume+0x190/0x228) from [<c025391c>] (dpm_resume+0x10c/0x244) [ 4758.749298] [<c025391c>] (dpm_resume+0x10c/0x244) from [<c0253a60>] (dpm_resume_end+0xc/0x18) [ 4758.758178] [<c0253a60>] (dpm_resume_end+0xc/0x18) from [<c00768cc>] (suspend_devices_and_enter+0x1d0/0x22c) [ 4758.768432] [<c00768cc>] (suspend_devices_and_enter+0x1d0/0x22c) from [<c0076a54>] (enter_state+0x12c/0x18c) [ 4758.778656] [<c0076a54>] (enter_state+0x12c/0x18c) from [<c0075884>] (state_store+0x94/0x118) [ 4758.787536] [<c0075884>] (state_store+0x94/0x118) from [<c01d4d8c>] (kobj_attr_store+0x1c/0x24) [ 4758.796600] [<c01d4d8c>] (kobj_attr_store+0x1c/0x24) from [<c01172e4>] (sysfs_write_file+0x108/0x13c) [ 4758.806213] [<c01172e4>] (sysfs_write_file+0x108/0x13c) from [<c00bfc5c>] (vfs_write+0xac/0x180) [ 4758.815368] [<c00bfc5c>] (vfs_write+0xac/0x180) from [<c00bfde8>] (sys_write+0x40/0x6c) [ 4758.823699] [<c00bfde8>] (sys_write+0x40/0x6c) from [<c000e9c0>] (ret_fast_syscall+0x0/0x3c)

So it looks like it is waiting for the mmc device, which itself is hanging. My feeling here is that this is a very different problem. The block queue should block everything on suspend so no requests should be pending at this point. I guess I'm going to have to look into that some more.

So I've probably learned a bit, but really not much. Mostly just some very silly code in an interrupt handler, which took hours to find. Hopefully examining the block-device issue will be more fun.


UPDATE:

The MMC bug was fairly easy to find - the 'suspend' callback was simply not being called. This was easily fixed by applying commit 32d317c60e56c2a34463b51fc0336cc96b3e1735 from Linux-3.4.

So now my GTA04 doesn't crash on resume any more. Hurray.



Comments...

Re: Tedious bugs (27 August 2012, 10:55 UTC)

I have a little theory about this and some other bugs found in electronic devices. This is from years of seeing strange coding with no comments and just puzzling it out. I think that a lot of the "best practice" learning that goes on happens in forums and dev mailing lists that are english only. For devices created (or electronic components therein) in places like Taiwan and China, they maybe learn code a little different from us westerners.

I say this after years of seeing bugs and thinking "Why's he/she doing that? wtf?" and in my old age (40!) realizing that maybe people who don't speak English just don't grow up / learn the same way we do and with our conventions.

I say this as no disrespect whatever to our non-english (speaking) friends. If I am even partially correct then it just makes the case for finding better ways for expressing our best practices in programming. It is even more important with the explosion in open source that is going on right now.

Thanks for your blog Neil - I found it by accident during this 5 day marathon at finding a cure for a raid problem in Ubuntu and getting to use mdadm ("M, Dad, M!" is how I say it ;))

Kevin McCaughey Northern Ireland kevinmccaughey.org

[permalink][hide]




[æ]