Re: [PATCH tip/core/rcu 11/15] rcu: Avoid spurious RCU CPU stallwarnings

From: Paul E. McKenney
Date: Thu Sep 06 2012 - 17:59:21 EST


On Thu, Sep 06, 2012 at 05:41:01PM -0400, Steven Rostedt wrote:
> On Thu, 2012-09-06 at 14:03 -0700, Paul E. McKenney wrote:
>
> > Here are a few other ways that stalls can happen:
> >
> > o A CPU looping in an RCU read-side critical section.
>
> For a minute? That's a bug.
>
> >
> > o A CPU looping with interrupts disabled. This condition can
> > result in RCU-sched and RCU-bh stalls.
>
> Also a bug.
>
> >
> > o A CPU looping with preemption disabled. This condition can
> > result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
> > stalls.
>
> Bug as well.
>
> >
> > o A CPU looping with bottom halves disabled. This condition can
> > result in RCU-sched and RCU-bh stalls.
>
> Bug too.
>
> >
> > o For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
> > without invoking schedule().
>
> Another bug.
>
> >
> > o A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
> > happen to preempt a low-priority task in the middle of an RCU
> > read-side critical section. This is especially damaging if
> > that low-priority task is not permitted to run on any other CPU,
> > in which case the next RCU grace period can never complete, which
> > will eventually cause the system to run out of memory and hang.
> > While the system is in the process of running itself out of
> > memory, you might see stall-warning messages.
>
> Buggy system.
>
> >
> > o A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
> > is running at a higher priority than the RCU softirq threads.
> > This will prevent RCU callbacks from ever being invoked,
> > and in a CONFIG_TREE_PREEMPT_RCU kernel will further prevent
> > RCU grace periods from ever completing. Either way, the
> > system will eventually run out of memory and hang. In the
> > CONFIG_TREE_PREEMPT_RCU case, you might see stall-warning
> > messages.
>
> Not really a bug, but the developers need a spanking.

And RCU does what it can, which is limited to a splat on the console.

> > o A hardware or software issue shuts off the scheduler-clock
> > interrupt on a CPU that is not in dyntick-idle mode. This
> > problem really has happened, and seems to be most likely to
> > result in RCU CPU stall warnings for CONFIG_NO_HZ=n kernels.
>
> Driving the bug.
>
> >
> > o A bug in the RCU implementation.
>
> Bug in the name.
>
> >
> > o A hardware failure. This is quite unlikely, but has occurred
> > at least once in real life. A CPU failed in a running system,
> > becoming unresponsive, but not causing an immediate crash.
> > This resulted in a series of RCU CPU stall warnings, eventually
> > leading the realization that the CPU had failed.
>
> Hardware bug.
>
> So, where's the "spurious RCU CPU stall warnings"?

I figured that would count as a bug in the RCU implementation. ;-)

> All these cases deserve a warning.

Agreed, and that is the whole purpose of the stall warnings.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/