Re: INFO: rcu_sched detected stalls on CPUs/tasks with `kswapd` and `mem_cgroup_shrink_node`

From: Peter Zijlstra
Date: Thu Dec 01 2016 - 00:30:48 EST


On Wed, Nov 30, 2016 at 11:40:19AM -0800, Paul E. McKenney wrote:

> > See commit:
> >
> > 4a81e8328d37 ("rcu: Reduce overhead of cond_resched() checks for RCU")
> >
> > Someone actually wrote down what the problem was.
>
> Don't worry, it won't happen again. ;-)
>
> OK, so the regressions were in the "open1" test of Anton Blanchard's
> "will it scale" suite, and were due to faster (and thus more) grace
> periods rather than path length.
>
> I could likely counter the grace-period speedup by regulating the rate
> at which the grace-period machinery pays attention to the rcu_qs_ctr
> per-CPU variable. Actually, this looks pretty straightforward (famous
> last words). But see patch below, which is untested and probably
> completely bogus.

Possible I suppose. Didn't look too hard at it.

> > > > Also, I seem to have missed, why are we going through this again?
> > >
> > > Well, the point I've brought that up is because having basically two
> > > APIs for cond_resched is more than confusing. Basically all longer in
> > > kernel loops do cond_resched() but it seems that this will not help the
> > > silence RCU lockup detector in rare cases where nothing really wants to
> > > schedule. I am really not sure whether we want to sprinkle
> > > cond_resched_rcu_qs at random places just to silence RCU detector...
> >
> > Right.. now, this is obviously all PREEMPT=n code, which therefore also
> > implies this is rcu-sched.
> >
> > Paul, now doesn't rcu-sched, when the grace-period has been long in
> > coming, try and force it? And doesn't that forcing include prodding CPUs
> > with resched_cpu() ?
>
> It does in the v4.8.4 kernel that Boris is running. It still does in my
> -rcu tree, but only after an RCU CPU stall (something about people not
> liking IPIs). I may need to do a resched_cpu() halfway to stall-warning
> time or some such.

Sure, we all dislike IPIs, but I'm thinking this half-way point is
sensible, no point in issuing user visible annoyance if indeed we can
prod things back to life, no?

Only if we utterly fail to make it respond should we bug the user with
our failure..

> > I'm thinking not, because if it did, that would make cond_resched()
> > actually schedule, which would then call into rcu_note_context_switch()
> > which would then make RCU progress, no?
>
> Sounds plausible, but from what I can see some of the loops pointed
> out by Boris's stall-warning messages don't have cond_resched().
> There was another workload that apparently worked better when moved from
> cond_resched() to cond_resched_rcu_qs(), but I don't know what kernel
> version was running.

Egads.. cursed if you do, cursed if you dont eh..