Re: [RFC 2/2] rcu: Remove ->dynticks_nmi_nesting from struct rcu_dynticks

From: Paul E. McKenney
Date: Fri Jun 22 2018 - 14:12:32 EST


On Fri, Jun 22, 2018 at 12:01:49PM -0400, Steven Rostedt wrote:
> On Fri, 22 Jun 2018 06:28:43 -0700
> "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> > It has been some years since I traced the code flow, but what happened
> > back then is that it switches itself from an interrupt handler to not
> > without actually returning from the interrupt. This can only happen when
> > interrupting a non-idle process, thankfully, and RCU's dyntick-idle code
> > relies on this restriction. If I remember correctly, the code ends up
> > executing in the context of the interrupted process, but it has been some
> > years, so please apply appropriate skepticism.
>
> If irq_enter() is not paired with irq_exit() then major things will
> break. Especially since that's how in_interrupt() and friends rely on to
> work.
>
> Now, perhaps rcu_irq_enter() is called elsewhere (as a git grep appears
> it may be), and that rcu_irq_enter() may not be paired with
> rcu_irq_exit(). But that's not anything to do with the irq_enter() and
> irq_exit() routines being paired or not.

The non-irq_enter() calls to rcu_irq_enter() and the non-irq_exit()
calls to rcu_irq_exit() do appear to be balanced as of v4.17.

If I recall correctly, the offending piece of functionality was the
usermode helpers, which on some architectures did a syscall exception
from within the kernel to make a system call happen. This seems to now
be common code using workqueues, kernel threads, and do_execve().
Here is the best reference I could find to the specific problem
I encountered back in the day:

https://groups.google.com/forum/#!msg/linux.kernel/B5hZX1tJRs8/sOVVfhrirL8J

I do recall that there were real failures. There is no way I would have
written code tolerating half-interrupts without cause, no more than I
would have written code handling what looks to RCU like interrupts from
NMI handlers without cause. ;-)

One approach would be for me to add a WARN_ON_ONCE() to check for
misnesting. If this didn't trigger for some time long enough for the
check to propagate to the various distros' users, then this code could
be simplified. Though it would not be that big a deal, just the removal
of a store or two.

Thanx, Paul