Re: dyntick-idle CPU and node's qsmask

From: Joel Fernandes
Date: Sun Nov 11 2018 - 16:04:13 EST


On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
[..]
> > > > > CPU will with high probability report its own quiescent state before three
> > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > would be wasted motion.
> > > >
> > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > rcu_data right?
> > >
> > > Yes, but assuming that all CPUs report their quiescent states before
> > > the first call to rcu_gp_fqs(). One exception is when some CPU is
> > > looping in the kernel for many milliseconds without passing through a
> > > quiescent state. This is because for recent kernels, cond_resched()
> > > is not a quiescent state until the grace period is something like 100
> > > milliseconds old. (For older kernels, cond_resched() was never an RCU
> > > quiescent state unless it actually scheduled.)
> > >
> > > Why wait 100 milliseconds? Because otherwise the increase in
> > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > to complain bitterly. Besides, I would expect that in the common case,
> > > CPUs would be executing usermode code.
> >
> > Makes sense. I was also wondering about this other thing you mentioned about
> > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > that mean that even if a single CPU is dyntick-idle for a long period of
> > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > someone wants shorter grace periods, they should just use expedited GPs), but
> > it sounds like it would be shorter GP if we just set the qsmask early on some
> > how and we can manage the overhead of doing so.
>
> First, there is some autotuning of the delay based on HZ:
>
> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>
> So at HZ=300, you should be seeing a two-jiffy delay rather than the
> usual HZ=1000 three-jiffy delay. Of course, this means that the delay
> is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> rates often mean slower instruction execution and thus a desire for
> lower RCU overhead. There is further autotuning based on number of
> CPUs, but this does not kick in until you have 256 CPUs on your system,
> and I bet that smartphones aren't there yet. Nevertheless, check out
> RCU_JIFFIES_FQS_DIV for more info on this.

Got it. I agree with that heuristic.

> But you can always override this autotuning using the following kernel
> boot paramters:
>
> rcutree.jiffies_till_first_fqs
> rcutree.jiffies_till_next_fqs
>
> You can even set the first one to zero if you want the effect of pre-scanning
> for idle CPUs. ;-)
>
> The second must be set to one or greater.
>
> Both are capped at one second (HZ).

Got it. Thanks a lot for the explanations.

> > > > Anyway it was just an idea that popped up when I was going through traces :)
> > > > Thanks for the discussion and happy to discuss further or try out anything.
> > >
> > > Either way, I do appreciate your going through this. People have found
> > > RCU bugs this way, one of which involved RCU uselessly calling a particular
> > > function twice in quick succession. ;-)
> >
> > Thanks. It is my pleasure and happy to help :) I'll keep digging into it.
>
> Looking forward to further questions and patches. ;-)

Will do! thanks,

- Joel