Re: [RFC][PATCH 12/13] stop_machine: Remove lglock

From: Paul E. McKenney
Date: Thu Jun 25 2015 - 09:48:13 EST


On Thu, Jun 25, 2015 at 01:07:34PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 24, 2015 at 08:23:17PM -0700, Paul E. McKenney wrote:
> > Here is what I had in mind, where you don't have any global trashing
> > except when the ->expedited_sequence gets updated. Passes mild rcutorture
> > testing.
>
> > /*
> > + * Each pass through the following loop works its way
> > + * up the rcu_node tree, returning if others have done the
> > + * work or otherwise falls through holding the root rnp's
> > + * ->exp_funnel_mutex. The mapping from CPU to rcu_node structure
> > + * can be inexact, as it is just promoting locality and is not
> > + * strictly needed for correctness.
> > */
> > + rnp0 = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> > + for (; rnp0 != NULL; rnp0 = rnp0->parent) {
> > + if (sync_sched_exp_wd(rsp, rnp1, &rsp->expedited_workdone1, s))
> > return;
> > + mutex_lock(&rnp0->exp_funnel_mutex);
> > + if (rnp1)
> > + mutex_unlock(&rnp1->exp_funnel_mutex);
> > + rnp1 = rnp0;
> > + }
> > + rnp0 = rnp1; /* rcu_get_root(rsp), AKA root rcu_node structure. */
> > + if (sync_sched_exp_wd(rsp, rnp0, &rsp->expedited_workdone2, s))
> > + return;
>
> I'm still somewhat confused by the whole strict order sequence vs this
> non ordered 'polling' of global state.
>
> This funnel thing basically waits random times depending on the
> contention of these mutexes and tries again. Ultimately serializing on
> the root funnel thing.

Not random at all!

The whole funnel is controlled by the root ->exp_funnel_mutex holder,
who is going to hold the lock for a single expedited grace period, then
release it. This means that any time a task acquires a lock, there is
very likely to have been a recent state change. Hence the checks after
each lock acquisition.

So in the heavy-use case, what tends to happen is that there are one
or two expedited grace periods, and then the entire queue of waiters
acquiring ->exp_funnel_mutex simply evaporates -- they can make use of
the expedited grace period whose completion resulted in their acquisition
completing and thus them being awakened. No fuss, no muss, no unnecessary
contention or cache thrashing.

> So on the one hand you have to strictly order these expedited caller,
> but then you don't want to actually process them in order. If 'by magic'
> you manage to process the 3rd in queue, you can drop the 2nd because it
> will have waited long enough. OTOH the 2nd will have waited too long.

Let's take the example of a 4096-CPU system with default configuration of
CONFIG_RCU_FANOUT=64 and CONFIG_RCU_FANOUT_LEAF=16. There will then be
256 leaf rcu_node structures, each of which is subordinate to one of four
internal rcu_node structures, each of which is subordinate to the root
rcu_node structure. There can then be up to 260 tasks waiting on non-root
rcu_node ->exp_funnel_mutex, with an additional task holding the root
rcu_node ->exp_funnel_mutex and carrying out an expedited grace period.
Once that grace period completes, one of the tasks holding an internal
->exp_funnel_mutex acquires the root ->exp_funnel_mutex. If it can use
the just-completed grace period, it releases its ->exp_funnel_mutex,
and the cycle repeats, until the queue drains. If not, then it will
carry out another grace period, perhaps making some of the queue wait
unnecessarily -- but that can happen in the strictly queued case as well,
due to delays between snapshotting the counter and getting on the queue.

The key advantage of the funnel approach is that many tasks can be
concurrently discovering that the grace period they need has already
happened.

Of course, if there are more than 260 tasks queued, the excess tasks will
queue on the leaf ->exp_funnel_mutex mutexes. But they will eventually
start draining 256 at a time, in parallel.

And nothing comes for free. In an idle system, the single task wanting
an expedited grace period must work its way up the rcu_node tree. In
the 4096-CPU case with default configuration, it must acquire three
uncontended mutexes. But this is way down in the noise compared to
the 4095 cache misses required to determine that all the rest of the
CPUs are idle. So the funnel approach is a good tradeoff.

> You also do not take the actual RCU state machine into account -- this
> is a parallel state.
>
> Can't we integrate the force quiescent state machinery with the
> expedited machinery -- that is instead of building a parallel state, use
> the expedited thing to push the regular machine forward?
>
> We can use the stop_machine calls to force the local RCU state forward,
> after all, we _know_ we just made a context switch into the stopper
> thread. All we need to do is disable interrupts to hold off the tick
> (which normally drives the state machine) and just unconditionally
> advance our state.
>
> If we use the regular GP machinery, you also don't have to strongly
> order the callers, just stick them on whatever GP was active when they
> came in and let them roll, this allows much better (and more natural)
> concurrent processing.

That gets quite complex, actually. Lots of races with the normal grace
periods doing one thing or another.

However, it should be quite easy to go the other way and make the normal
grace-period processing take advantage of expedited grace periods that
happened to occur at the right time. I will look into this, thank you
for the nudge!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/