Re: [GIT PULL rcu/next] rcu commits for 2.6.40

From: Frederic Weisbecker
Date: Mon May 16 2011 - 19:52:50 EST


On Mon, May 16, 2011 at 02:24:49PM -0700, Paul E. McKenney wrote:
> On Mon, May 16, 2011 at 02:23:29PM +0200, Ingo Molnar wrote:
> >
> > * Ingo Molnar <mingo@xxxxxxx> wrote:
> >
> > > > In the meantime, would you be willing to try out the patch at
> > > > https://lkml.org/lkml/2011/5/14/89? This patch helped out Yinghai in
> > > > several configurations.
> > >
> > > Wasn't this the one i tested - or is it a new iteration?
> > >
> > > I'll try it in any case.
> >
> > oh, this was a new iteration, mea culpa!
> >
> > And yes, it solves all problems for me as well. Mind pushing it as a fix? :-)
>
> ;-)
>
> Unfortunately, the only reason I can see that it works is (1) there
> is some obscure bug in my code or (2) someone somewhere is failing to
> call irq_exit() on some interrupt-exit path. Much as I might be tempted
> to paper this one over, I believe that we do need to find whatever the
> underlying bug is.
>
> Oh, yes, there is option (3) as well: maybe if an interrupt deschedules
> a process, the final irq_exit() is omitted in favor of rcu_enter_nohz()?
> But I couldn't see any evidence of this in my admittedly cursory scan
> of the x86 interrupt-handling code.
>
> So until I learn differently, I am assuming that each and every
> irq_enter() has a matching call to irq_exit(), and that rcu_enter_nohz()
> is called after the final irq_exit() of a given burst of interrupts.
>
> If my assumptions are mistaken, please do let me know!

About 2), I believe that such an unpairing would have been detected before
your whole patchset was merged.
For example if an interrupt failed to call rcu_irq_exit(), we would have
found cases where we have:

rcu_enter_nohz()
<irq>
rcu_irq_enter()
</irq>
rcu_exit_nohz()

And then that last call would trigger "WARN_ON_ONCE(!(rdtp->dynticks & 0x1))".

But may be there was a patch in your set that touched one of these rcu_irq_...
callsites.

About 3), it shouldn't happen because preempt_schedule_irq() is called in the
exit path of the low level interrupt handler. rcu_exit_irq() is called from
the higher level, before resuming to the low level.

That said there might be something nasty that the old checks in the QS APIs
were missing.

I think it would be nice to add some checks in rcu-lockdep inside
rcu_read_lock()/rcu_dereference() to ensure rdp->dynticks is not even, ie
that we are not in an extended qs. That's something I planned to add for
my next nohz tasks patchset version, because I bring more dance with the
extended quiescent state, but given the problems we are facing today, it
may be better sooner.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/