Re: Requirements to control kernel isolation/nohz_full at runtime

From: Paul E. McKenney
Date: Fri Sep 04 2020 - 16:47:48 EST


On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> Hi,
>
> I'm currently working on making nohz_full/nohz_idle runtime toggable
> and some other people seem to be interested as well. So I've dumped
> a few thoughts about some pre-requirements to achieve that for those
> interested.
>
> As you can see, there is a bit of hard work in the way. I'm iterating
> that in https://pad.kernel.org/p/isolation, feel free to edit:
>
>
> == RCU nocb ==
>
> Currently controllable with "rcu_nocbs=" boot parameter and/or through nohz_full=/isolcpus=nohz
> We need to make it toggeable at runtime. Currently handling that:
> v1: https://lwn.net/Articles/820544/
> v2: coming soon

Looking forward to seeing it!

> == TIF_NOHZ ==
>
> Need to get rid of that in order not to trigger syscall slowpath on CPUs that don't want nohz_full.
> Also we don't want to iterate all threads and clear the flag when the last nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does that well.

Would it help if RCU was able to, on a per-CPU basis, distinguish between
nohz_full userspace execution on the one hand and idle-loop execution
on the other? Or do you have some other trick in mind?

Thanx, Paul

> == Proper entry code ==
>
> We must make sure that a given arch never calls exception_enter() / exception_exit().
> This saves the previous state of context tracking and switch to kernel mode (from context tracking POV)
> temporarily. Since this state is saved on the stack, this prevents us from turning off context tracking
> entirely on a CPU: The tracking must be done on all CPUs and that takes some cycles.
>
> This means that, considering early entry code (before the call to context tracking upon kernel entry,
> and after the call to context tracking upon kernel exit), we must take care of few things:
>
> 1) Make sure early entry code can't trigger exceptions. Or if it does, the given exception can't schedule
> or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception must call exception_enter()/exception_exit()
> which we don't want.
>
> 2) No call to schedule_user().
>
> 3) Make sure early entry code is not interruptible or preempt_schedule_irq() would rely on
> exception_entry()/exception_exit()
>
> 4) Make sure early entry code can't be traced (no call to preempt_schedule_notrace()), or if it does it
> can't schedule
>
> I believe x86 does most of that well. In the end we should remove exception_enter()/exit implementations
> in x86 and replace it with a check that makes sure context_tracking state is not in USER. An arch meeting
> all the above conditions would earn a CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> at runtime would depend on that.
>
>
> == Cputime accounting ==
>
> Both write and read side must switch to tick based accounting and drop the use of seqlock in task_cputime(),
> task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special ordering/state machine is required to make that without races.
>
> == Nohz ==
>
> Switch from nohz_full to nohz_idle. Mind a few details:
>
> 1) Turn off 1Hz offlined tick handled in housekeeping
> 2) Handle tick dependencies, take care of racing CPUs setting/clearing tick dependency. It's much trickier when
> we switch from nohz_idle to nohz_full
>
> == Unbound affinity ==
>
> Restore kernel threads, workqueue, timers, etc... wide affinity. But take care of cpumasks that have been set through other
> interfaces: sysfs, procfs, etc...