On Wed, Jul 13, 2011 at 10:29 AM, Ben Greear<greearb@xxxxxxxxxxxxxxx> wrote:This is on the same nfs testing machine I've been posting about. This
has some additional nfs patches included, running tests to mount, do io,
unmount
over and over again. Seems that the NFS bugs might be finally fixed, but
system is still un-stable in general when under load.
This info was printed after several other warnings that I previously posted
to lkml.
This one appears to lock up the machine pretty badly though...can't ssh into
it anymore, and similar messages keep spewing every few minutes.
I *think* the BUG at the end of this email is the important part, but
maybe it's just a symptom of something else...
Huh. So does this trigger frequently, or was this just a one time
thing? I suspect the latter.
From the looks of it, there's the btserver process (on cpu4) whichduring exit is caught up spinning trying to get the hrtimer base lock
from hrtimer_cancel() in rtc_irq_set_state() when cleaning up from
rtc_device_release().
Meanwhile, On cpu0, a rtc periodic timer has fired and we're stuck in
rtc_handle_legacy_irq(), likely waiting for the irq_task_lock held by
cpu4 in rtc_irq_set_state().
The rest of the cpus are idle, with the exception of the one that
detected the stall from the normal timer tick.
Hrmm.. It sounds like a circular lock between the rtc->irq_task_lock
and the hrtimer base lock.
rtc_irq_set_state: Grab irq_task_lock -> call hrtimer_cancel -> grab
hrtimer_base_lock
IRQ: grab hrtimer_base_lock -> run timers -> rtc_handle_legacy_irq ->
grab irq_task_lock
But looking at __run_hrtimer(), the base lock should be released
before the timer is run.
So I'm not really sure what would be gumming up things here.
Thomas: Any thoughts? There shouldn't be an issue calling
hrtimer_cancel or other hrtimer operations from an hrtimer handler
right?
thanks
-john