Re: BUG spinlock lockup, rtc related, 3.0-rc7+

From: Ben Greear
Date: Tue Jul 19 2011 - 18:24:58 EST

Next message: RafaÅ Bilski: "[PATCH] pata_via: Add SATA registers for VX800 SATA/PATA controller"
Previous message: Terry Loftin: "Re: [PATCH 2/2] sched: Fix "divide error: 0000" in find_busiest_group"
In reply to: john stultz: "Re: BUG spinlock lockup, rtc related, 3.0-rc7+"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 07/19/2011 03:17 PM, john stultz wrote:

On Wed, Jul 13, 2011 at 10:29 AM, Ben Greear<greearb@xxxxxxxxxxxxxxx> wrote:
This is on the same nfs testing machine I've been posting about. This
has some additional nfs patches included, running tests to mount, do io,
unmount
over and over again. Seems that the NFS bugs might be finally fixed, but
system is still un-stable in general when under load.

This info was printed after several other warnings that I previously posted
to lkml.

This one appears to lock up the machine pretty badly though...can't ssh into
it anymore, and similar messages keep spewing every few minutes.

I *think* the BUG at the end of this email is the important part, but
maybe it's just a symptom of something else...

Huh. So does this trigger frequently, or was this just a one time
thing? I suspect the latter.

It seems I have been hitting a lot of rcu-boost locking issues
on this system with my nfs mount/unmount testing.

The system was having various lockups and bugs, but I don't think
I saw this particular one more than once or perhaps twice.

I plan to run some more tests with the rcu-boost locking fixes
applied to the kernel shortly.

At the time I reported this, I wasn't aware of the rcu boost bugs,
but perhaps that is root cause here as well...I don't know enough
about the code in question to make an educated guess.

From the looks of it, there's the btserver process (on cpu4) which
during exit is caught up spinning trying to get the hrtimer base lock
from hrtimer_cancel() in rtc_irq_set_state() when cleaning up from
rtc_device_release().

Meanwhile, On cpu0, a rtc periodic timer has fired and we're stuck in
rtc_handle_legacy_irq(), likely waiting for the irq_task_lock held by
cpu4 in rtc_irq_set_state().

The rest of the cpus are idle, with the exception of the one that
detected the stall from the normal timer tick.

Hrmm.. It sounds like a circular lock between the rtc->irq_task_lock
and the hrtimer base lock.

rtc_irq_set_state: Grab irq_task_lock -> call hrtimer_cancel -> grab
hrtimer_base_lock

IRQ: grab hrtimer_base_lock -> run timers -> rtc_handle_legacy_irq ->
grab irq_task_lock

But looking at __run_hrtimer(), the base lock should be released
before the timer is run.

So I'm not really sure what would be gumming up things here.

Thomas: Any thoughts? There shouldn't be an issue calling
hrtimer_cancel or other hrtimer operations from an hrtimer handler
right?

thanks
-john

--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: RafaÅ Bilski: "[PATCH] pata_via: Add SATA registers for VX800 SATA/PATA controller"
Previous message: Terry Loftin: "Re: [PATCH 2/2] sched: Fix "divide error: 0000" in find_busiest_group"
In reply to: john stultz: "Re: BUG spinlock lockup, rtc related, 3.0-rc7+"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]