Re: Requirements to control kernel isolation/nohz_full at runtime

From: peterz
Date: Mon Sep 07 2020 - 11:38:39 EST



(your mailer broke and forgot to keep lines shorter than 78 chars)

On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:

> == TIF_NOHZ ==
>
> Need to get rid of that in order not to trigger syscall slowpath on
> CPUs that don't want nohz_full. Also we don't want to iterate all
> threads and clear the flag when the last nohz_full CPU exits nohz_full
> mode. Prefer static keys to call context tracking on archs. x86 does
> that well.

Build on the common entry code I suppose. Then any arch that uses that
gets to have the new features.

> == Proper entry code ==
>
> We must make sure that a given arch never calls exception_enter() /
> exception_exit(). This saves the previous state of context tracking
> and switch to kernel mode (from context tracking POV) temporarily.
> Since this state is saved on the stack, this prevents us from turning
> off context tracking entirely on a CPU: The tracking must be done on
> all CPUs and that takes some cycles.
>
> This means that, considering early entry code (before the call to
> context tracking upon kernel entry, and after the call to context
> tracking upon kernel exit), we must take care of few things:
>
> 1) Make sure early entry code can't trigger exceptions. Or if it does,
> the given exception can't schedule or use RCU (unless it calls
> rcu_nmi_enter()). Otherwise the exception must call
> exception_enter()/exception_exit() which we don't want.

I think this is true for x86. Early entry has interrupts disabled, any
exception that can still happen is NMI-like and will thus use
rcu_nmi_enter().

On x86 that now includes #DB (which is also excluded due to us refusing
to set execution breakpoints on entry code), #BP, NMI and MCE.

> 2) No call to schedule_user().

I'm not sure what that is supposed to do, but x86 doesn't appear to have
it, so all good :-)

> 3) Make sure early entry code is not interruptible or
> preempt_schedule_irq() would rely on
> exception_entry()/exception_exit()

This is so for x86.

> 4) Make sure early entry code can't be traced (no call to
> preempt_schedule_notrace()), or if it does it can't schedule

noinstr is your friend.

> I believe x86 does most of that well.

It does now.