Re: dyntick-idle CPU and node's qsmask

From: Paul E. McKenney
Date: Wed Nov 21 2018 - 09:40:02 EST


On Tue, Nov 20, 2018 at 08:37:22PM -0800, Joel Fernandes wrote:
> On Tue, Nov 20, 2018 at 06:41:07PM -0800, Paul E. McKenney wrote:
> [...]
> > > > > I was thinking if we could simplify rcu_note_context_switch (the parts that
> > > > > call rcu_momentary_dyntick_idle), if we did the following in
> > > > > rcu_implicit_dynticks_qs.
> > > > >
> > > > > Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> > > > > rdp->cpu_no_qs flag. Then there should be no need to call
> > > > > rcu_momentary_dyntick_idle from rcu_note_context switch.
> > > >
> > > > But does this also work for the rcu_all_qs() code path?
> > >
> > > Could we not do something like this in rcu_all_qs? as some over-simplified
> > > pseudo code:
> > >
> > > rcu_all_qs() {
> > > if (!urgent_qs || !heavy_qs)
> > > return;
> > >
> > > rcu_qs(); // This clears the rdp->cpu_no_qs flags which we can monitor in
> > > // the diff in my last email (from rcu_implicit_dynticks_qs)
> > > }
> >
> > Except that rcu_qs() doesn't necessarily report the quiescent state to
> > the RCU core. Keeping down context-switch overhead and all that.
>
> Sure yeah, but I think the QS will be indirectly anyway by the force_qs_rnp()
> path if we detect that rcu_qs() happened on the CPU?

The force_qs_rnp() path won't see anything that has not already been
reported to the RCU core.

> > > > > I think this would simplify cond_resched as well. Could this avoid the need
> > > > > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..
> > > >
> > > > There is also the code path from cond_resched() in PREEMPT=n kernels.
> > > > This needs rcu_all_qs(). Though it is quite possible that some additional
> > > > code collapsing is possible.
> > > >
> > > > > Basically for some background, I was thinking can we simplify the code that
> > > > > calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> > > > > ways (like by resetting cpu_no_qs).
> > > >
> > > > One complication is that rcu_all_qs() is invoked with interrupts
> > > > and preemption enabled, while rcu_note_context_switch() is
> > > > invoked with interrupts disabled. Also, as you say, Tasks RCU.
> > > > Plus rcu_all_qs() wants to exit immediately if there is nothing to
> > > > do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> > > > -- yes, it could check, but that would be redundant with the checks
> > >
> > > This immediate exit is taken care off in the above psuedo code, would that
> > > help the cond_resched performance?
> >
> > It look like you are cautiously edging towards the two wrapper functions
> > calling common code, relying on inlining and simplification. Why not just
> > try doing it? ;-)
>
> Sure yeah. I was more thinking of the ambitious goal of getting rid of the
> complexity and exploring the general design idea, than containing/managing
> the complexity with reducing code duplication. :D
>
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index c818e0c91a81..5aa0259c014d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > > > > * read-side critical section that started before the beginning
> > > > > * of the current RCU grace period.
> > > > > */
> > > > > - if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > > > > + if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
> > > >
> > > > If I am not too confused, this change could cause trouble for
> > > > nohz_full CPUs looping in the kernel. Such CPUs don't necessarily take
> > > > scheduler-clock interrupts, last I checked, and this could prevent the
> > > > CPU from reporting its quiescent state to core RCU.
> > >
> > > Would that still be a problem if rcu_all_qs called rcu_qs? Also the above
> > > diff is an OR condition so it is more relaxed than before.
> >
> > Yes, because rcu_qs() is only guaranteed to capture the quiescent
> > state on the current CPU, not necessarily report it to the RCU core.
>
> The reporting to the core is necessary to call rcu_report_qs_rnp so that the
> QS information is propogating up the tree, right?
>
> Wouldn't that reporting be done anyway by:
>
> force_qs_rnp
> -> rcu_implicit_dynticks_qs (which returns 1 because rdp->cpu_no_qs.b.norm
> was cleared by rcu_qs() and we detect that
> with help of above diff)

Ah. It is not safe to sample rdp->cpu_no_qs.b.norm off-CPU, and that
is what your patch would do. This is intentional -- if it were safe to
sample off-CPU, then it would be more expensive to read/update on-CPU.

> -> rcu_report_qs_rnp is called with mask bit set for corresponding CPU that
> has the !rdp->cpu_no_qs.b.norm
>
>
> I think that's what I am missing - that why wouldn't the above scheme work.
> The only difference is reporting to the RCU core might invoke pending
> callbacks but I'm not sure if that matters for this. I'll these changes,
> and try tracing it out and study it more. thanks for the patience,

There are a lot of moving parts and you have not yet gotten to all
of them. I suggest next taking a look at the relationship between
rcu_check_callbacks() and rcu_process_callbacks(), including the
open_softirq(). These have old names -- they handle the interface
between the CPU and RCU code, among other things. Including invoking
callbacks, but only for some configurations. :-/

Thanx, Paul