Re: [RFC][PATCH 12/13] stop_machine: Remove lglock

From: Paul E. McKenney
Date: Thu Jun 25 2015 - 10:52:04 EST


On Thu, Jun 25, 2015 at 04:20:11PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 25, 2015 at 06:47:55AM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 25, 2015 at 01:07:34PM +0200, Peter Zijlstra wrote:
> > > I'm still somewhat confused by the whole strict order sequence vs this
> > > non ordered 'polling' of global state.
> > >
> > > This funnel thing basically waits random times depending on the
> > > contention of these mutexes and tries again. Ultimately serializing on
> > > the root funnel thing.
> >
> > Not random at all!
>
> No, they are random per, definition it depends on the amount of
> contention and since that's random, the rest it too.

Not sure how to parse this one. ;-)

> > The whole funnel is controlled by the root ->exp_funnel_mutex holder,
> > who is going to hold the lock for a single expedited grace period, then
> > release it. This means that any time a task acquires a lock, there is
> > very likely to have been a recent state change. Hence the checks after
> > each lock acquisition.
> >
> > So in the heavy-use case, what tends to happen is that there are one
> > or two expedited grace periods, and then the entire queue of waiters
> > acquiring ->exp_funnel_mutex simply evaporates -- they can make use of
> > the expedited grace period whose completion resulted in their acquisition
> > completing and thus them being awakened. No fuss, no muss, no unnecessary
> > contention or cache thrashing.
>
> Plenty of cache trashing, since your 'tree' is not at all cache aligned
> or even remotely coherent with the actual machine topology -- I'll keep
> reminding you :-)

And, as I keep reminding you, if you actually show me system-level data
demonstrating that this is a real problem, I might consider taking some
action. And also reminding you that in the meantime, you can experiment
by setting the fanout sizes to match a given system and see if it makes
any visible difference. (Yes, I do understand the odd numbering of
hyperthreads, but you can still run a reasonable experiment.)

> But I must admit that the workings of the sequence thing elided me this
> morning. Yes that's much better than the strict ticket order of before.

OK, good!

> > > You also do not take the actual RCU state machine into account -- this
> > > is a parallel state.
> > >
> > > Can't we integrate the force quiescent state machinery with the
> > > expedited machinery -- that is instead of building a parallel state, use
> > > the expedited thing to push the regular machine forward?
> > >
> > > We can use the stop_machine calls to force the local RCU state forward,
> > > after all, we _know_ we just made a context switch into the stopper
> > > thread. All we need to do is disable interrupts to hold off the tick
> > > (which normally drives the state machine) and just unconditionally
> > > advance our state.
> > >
> > > If we use the regular GP machinery, you also don't have to strongly
> > > order the callers, just stick them on whatever GP was active when they
> > > came in and let them roll, this allows much better (and more natural)
> > > concurrent processing.
> >
> > That gets quite complex, actually. Lots of races with the normal grace
> > periods doing one thing or another.
>
> How so? I'm probably missing several years of RCU trickery and detail
> again, but since we can advance from the tick, we should be able to
> advance from the stop work with IRQs disabled with equal ease.
>
> And since the stop work and the tick are fully serialized, there cannot
> be any races there.
>
> And the stop work against other CPUs is the exact same races you already
> had with tick vs tick.
>
> So please humour me and explain how all this is far more complicated ;-)

Yeah, I do need to get RCU design/implementation documentation put together.

In the meantime, RCU's normal grace-period machinery is designed to be
quite loosely coupled. The idea is that almost all actions occur locally,
reducing contention and cache thrashing. But an expedited grace period
needs tight coupling in order to be able to complete quickly. Making
something that switches between loose and tight coupling in short order
is not at all simple.

> > However, it should be quite easy to go the other way and make the normal
> > grace-period processing take advantage of expedited grace periods that
> > happened to occur at the right time. I will look into this, thank you
> > for the nudge!
>
> That should already be happening, right? Since we force context
> switches, the tick driven RCU state machine will observe those and make
> progress -- assuming it was trying to make progress at all of course.

It is to an extent, but I believe that I can do better. On the other hand,
it is quite possible that this is a 6AM delusion on my part. ;-)

If it is not a delusion, the eventual solution will likely be a much more
satisfying answer to your "why not merge into the normal RCU grace period
machinery" question. But I need to complete reworking the expedited
machinery first!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/