Re: [GIT PULL rcu/next] fixes and breakup ofmemory-barrier-decrease patch

From: Paul E. McKenney
Date: Sun May 22 2011 - 12:17:45 EST


On Sun, May 22, 2011 at 11:04:40AM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> > > I mean, without Frederic's patch we are getting very long hangs due to the
> > > barrier patch, right?
> >
> > Yes. The reason we are seeing these hangs is that HARDIRQ_ENTER() invoked
> > irq_enter(), which calls rcu_irq_enter() but that the matching HARDIRQ_EXIT()
> > invoked __irq_exit(), which does not call rcu_irq_exit(). This resulted in
> > calls to rcu_irq_enter() that were not balanced by matching calls to
> > rcu_irq_exit(). Therefore, after these tests completed, RCU's dyntick-idle
> > nesting count was a large number, which caused RCU to conclude that the
> > affected CPU was not in dyntick-idle mode when in fact it was.
> >
> > RCU would therefore incorrectly wait for this dyntick-idle CPU.
> >
> > With Frederic's patch, these tests don't ever call either rcu_irq_enter() or
> > rcu_irq_exit(), which works because the CPU running the test is already
> > marked as not being in dyntick-idle mode.
> >
> > So, with Frederic's patch, the rcu_irq_enter() and rcu_irq_exit() calls are
> > balanced and things work.
> >
> > The reason that the imbalance was not noticed before the barrier patch was
> > applied is that the old implementation of rcu_enter_nohz() ignored the
> > nesting depth. This could still result in delays, but much shorter ones.
> > Whenever there was a delay, RCU would IPI the CPU with the unbalanced nesting
> > level, which would eventually result in rcu_enter_nohz() being called, which
> > in turn would force RCU to see that the CPU was in dyntick-idle mode.
> >
> > Hmmm... I should add this line of reasoning to one of the commit logs,
> > shouldn't I? (Added it. Which of course invalidates my pull request.)
>
> Well, the thing i was missing from the tree was Frederic's fix patch. Or was
> that included in one of the commits?

Ah! I don't see any evidence of anyone else having taken it, so I just
now queued it.

> I mean, if we just revert the revert, we reintroduce the delay, no matter who
> is to blame - not good! :-)

Good point! ;-)

> > > Even if the barrier patch is not to blame - somehow it still managed to
> > > produce these hangs - and we do not understand it yet.
> >
> > >From Yinghai's message https://lkml.org/lkml/2011/5/12/465, I believe
> > that the residual delay he is seeing is not due to the barrier patch,
> > but rather due to a26ac2455 (move TREE_RCU from softirq to kthrea).
> >
> > More on this below.
>
> Ok - we can treat that regression differently. Also, that seems like a much
> shorter delay, correct? The delays fixed by Frederic's patch were huge (i think
> i saw a 1 hour delay once) - they were essentially not delays but hangs.

Yes, the delays fixed by Frederic's patch were hours in length, while
the remaining delays measure in seconds. And I am looking at the code
and at how grace-period duration has varied, so hope to get to the
bottom of it in a few days. Hopefully sooner. ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/