Re: rcu_sched stalls in idle task introduced in pre-4.8?

From: Rich Felker
Date: Mon Sep 12 2016 - 11:14:05 EST


On Sat, Sep 10, 2016 at 03:19:38AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 08, 2016 at 06:16:53PM -0400, Rich Felker wrote:
> > On Wed, Aug 03, 2016 at 09:16:31AM -0700, Paul E. McKenney wrote:
> > > On Tue, Aug 02, 2016 at 01:45:04PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Aug 02, 2016 at 04:32:17PM -0400, Rich Felker wrote:
> > > > > On Tue, Aug 02, 2016 at 12:48:02PM -0700, Paul E. McKenney wrote:
> > >
> > > [ . . . ]
> > >
> > > > > > Does the problem reproduces easily?
> > > > >
> > > > > Yes, it happens right after boot and repeats every 30-90 seconds or
> > > > > so.
> > > >
> > > > Well, that at least makes it easier to test any patches!
> > > >
> > > > > > A bisection might be very helpful.
> > > > >
> > > > > Bisection would require some manual work to setup because the whole
> > > > > reason I was rebasing on Linus's tree was to adapt the drivers to
> > > > > upstream infrastructure changes (the new cpuhp stuff replacing
> > > > > notifier for cpu starting). The unfortunate way it was done, each
> > > > > driver adds an enum to linux/cpuhotplug.h so all the patches have
> > > > > gratuitous conflicts. In addition, for older revisions in Linus's
> > > > > tree, there's at least one show-stopping (hang during boot) bug that
> > > > > needs a cherry-pick to fix. There may be other small issues too. I
> > > > > don't think they're at all insurmountible but it requires an annoying
> > > > > amount of scripting.
> > > >
> > > > I had to ask! Might eventually be necessary, but let's see what we
> > > > can learn from what you currently have.
> > >
> > > And at first glance, my overnight run looks uglier than I would expect.
> > > I am now running tests at v4.7, and will run other tests to see if
> > > there really is a statistically significant degradation. If there is,
> > > then I might be able to bisect, though with nine-hour runs this could
> > > take quite some time.
> >
> > Any more thoughts on this? I'm testing v4.8-rc5 (plus jcore drivers
> > not yet upstream) and it's still happening.
>
> Not seeing it, but please do send me a recent splat from your dmesg and
> your .config.
>
> Because I am not seeing it, I also suggest inspecting your jcore drivers
> with the information in Documentation/RCU/stallwarn.txt in mind.

I found a plausible cause -- it looks like the kernel timer system is
programming the clock event device with intervals so low that another
timer interrupt is pending before the first handler finishes, and it
immediately interrupts the softirq part of the timer. Here's a snippet
from ftrace event tracing:

<idle>-0 [001] d.h. 2646.702790: irq_handler_entry: irq=72 name=jcore_pit
<idle>-0 [001] d.h. 2646.703398: softirq_raise: vec=1 [action=TIMER]
<idle>-0 [001] d.h. 2646.703607: softirq_raise: vec=9 [action=RCU]
<idle>-0 [001] d.h. 2646.703884: softirq_raise: vec=7 [action=SCHED]
<idle>-0 [001] d.h. 2646.704191: irq_handler_exit: irq=72 ret=handled
<idle>-0 [001] d.H. 2646.704509: irq_handler_entry: irq=72 name=jcore_pit
<idle>-0 [001] d.H. 2646.704990: softirq_raise: vec=1 [action=TIMER]
<idle>-0 [001] d.H. 2646.705182: softirq_raise: vec=9 [action=RCU]
<idle>-0 [001] d.H. 2646.705465: softirq_raise: vec=7 [action=SCHED]
<idle>-0 [001] d.H. 2646.705761: irq_handler_exit: irq=72 ret=handled
<idle>-0 [001] d.H. 2646.706071: irq_handler_entry: irq=72 name=jcore_pit
<idle>-0 [001] d.H. 2646.706328: irq_handler_exit: irq=72 ret=handled
<idle>-0 [001] ..s. 2646.706530: softirq_entry: vec=1 [action=TIMER]
<idle>-0 [001] ..s. 2646.706708: softirq_exit: vec=1 [action=TIMER]
<idle>-0 [001] ..s. 2646.706816: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [001] ..s. 2646.707270: softirq_exit: vec=7 [action=SCHED]
<idle>-0 [001] ..s. 2646.707397: softirq_entry: vec=9 [action=RCU]
<idle>-0 [001] ..s. 2646.707641: softirq_exit: vec=9 [action=RCU]
ksoftirqd/1-14 [001] ..s. 2646.708569: softirq_entry: vec=1 [action=TIMER]
ksoftirqd/1-14 [001] ..s. 2646.708727: softirq_exit: vec=1 [action=TIMER]
ksoftirqd/1-14 [001] ..s. 2646.708829: softirq_entry: vec=7 [action=SCHED]
ksoftirqd/1-14 [001] ..s. 2646.709073: softirq_exit: vec=7 [action=SCHED]
ksoftirqd/1-14 [001] ..s. 2646.709180: softirq_entry: vec=9 [action=RCU]
ksoftirqd/1-14 [001] ..s. 2646.709882: softirq_exit: vec=9 [action=RCU]

Presumably it usually just hits one or two levels of nesting like
above, but in the worst case it gets in some sort of feedback loop and
the backlog grows until it lasts >21 sec. I was able to see that there
are plenty of shorter stalls (too short for the stall detector to kick
in) by removing the SD card and observing a multi-second delay before:

[ 401.375518] mmc0: SPI card removed

The spi mmc host code that does this probe (no irq) is scheduled to
run every 1s, so if the message takes >1s to appear (>0.5s avg),
timers must be stalled.

Other interrupt activity (like serial or ethernet) breaks the stall
immediately so it's not noticable in most interactive usage.

I was able to make it go away by increasing the min_delta for the
clock even device to 2.5ms (chosen because I measured timer interrupts
taking up to ~1.5ms -- keep in mind this is at 50 MHz) but hard-coding
a number like that does not seem like a proper solution. The kernel
timer subsystem should have some way to choose oneshot expiration
appropriately based on feedback, or it should otherwise ensure that
the feedback loop can't happen (maybe by waiting to program the next
expiration from the softirq instead of the hard one, but I'm not sure
if that's possible...?).

Rich