Re: [PATCH 3/3] sched: Use cpu_dying() to fix balance_push vs hotplug-rollback

From: Peter Zijlstra
Date: Tue Apr 13 2021 - 02:52:13 EST


On Mon, Apr 12, 2021 at 06:22:42PM +0100, Valentin Schneider wrote:
> On 12/04/21 14:03, Peter Zijlstra wrote:
> > On Thu, Mar 11, 2021 at 03:13:04PM +0000, Valentin Schneider wrote:
> >> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> >> > @@ -7910,6 +7908,14 @@ int sched_cpu_deactivate(unsigned int cp
> >> > }
> >> > rq_unlock_irqrestore(rq, &rf);
> >> >
> >> > + /*
> >> > + * From this point forward, this CPU will refuse to run any task that
> >> > + * is not: migrate_disable() or KTHREAD_IS_PER_CPU, and will actively
> >> > + * push those tasks away until this gets cleared, see
> >> > + * sched_cpu_dying().
> >> > + */
> >> > + balance_push_set(cpu, true);
> >> > +
> >>
> >> AIUI with cpu_dying_mask being flipped before even entering
> >> sched_cpu_deactivate(), we don't need this to be before the
> >> synchronize_rcu() anymore; is there more than that to why you're punting it
> >> back this side of it?
> >
> > I think it does does need to be like this, we need to clearly separate
> > the active=true and balance_push_set(). If we were to somehow observe
> > both balance_push_set() and active==false, we'd be in trouble.
> >
>
> I'm afraid I don't follow; we're replacing a read of rq->balance_push with
> cpu_dying(), and those are still written on the same side of the
> synchronize_rcu(). What am I missing?

Yeah, I'm not sure anymnore either; I tried to work out why I'd done
that but upon closer examination everything fell flat.

Let me try again today :-)

> Oooh, I can't read, only the boot CPU gets its callback uninstalled in
> sched_init()! So secondaries keep push_callback installed up until
> sched_cpu_activate(), but as you said it's not effective unless a rollback
> happens.
>
> Now, doesn't that mean we should *not* uninstall the callback in
> sched_cpu_dying()? AFAIK it's possible for the initial secondary CPU
> boot to go fine, but the next offline+online cycle fails while going up -
> that would need to rollback with push_callback installed.

Quite; I removed that shortly after sending this; when I tried to write
a comment and found it.