Re: rcu_sched stall detected, but no state dump

From: Miroslav Benes
Date: Fri Dec 19 2014 - 06:09:56 EST


On Mon, 15 Dec 2014, Miroslav Benes wrote:

> On Fri, 12 Dec 2014, Paul E. McKenney wrote:
>
> > On Fri, Dec 12, 2014 at 03:06:20PM +0100, Miroslav Benes wrote:
> >

[...]

> > > One last thing. Is there a reason that there is not any similar INFO in
> > > the console for tiny implementation? I enabled RCU_TRACE and still got
> > > nothing. I have only shallow knowledge of RCU, though.
> >
> > It is supposed to work given CONFIG_RCU_TRACE=y, and it looks like
> > it should, at least as long as scheduling-clock interrupts are being
> > delivered. One thing to try for diagnosis is to have every (say) 1000th
> > call to check_cpu_stall() (in kernel/rcu/tiny_plugin.h) do a printk()
> > and same for rcu_check_callbacks() (in kernel/rcu/tiny.c).
>
> Unfortunately nothing is there. I'll look into it when I have more time
> later this week and let you know.

So I looked into it. Scheduling-clock interrupts are delivered and
check_cpu_stall() in tiny_plugin.h is called twice each tick, once for
rcu_bh and once for rcu_sched. What happens is the following (part of the
printk output in check_cpu_stall)...

...
[ 276.049053] name: rcu_bh, jiffies: 4294943345, jiffies_stall: 4294964342, rcp->curtail: (null)
[ 276.050046] name: rcu_sched, jiffies: 4294943345, jiffies_stall: 4294943357, rcp->curtail: (null)
[ 276.051312] name: rcu_bh, jiffies: 4294943347, jiffies_stall: 4294964345, rcp->curtail: (null)
[ 276.052305] name: rcu_sched, jiffies: 4294943347, jiffies_stall: 4294943357, rcp->curtail: (null)
[ 276.054634] name: rcu_bh, jiffies: 4294943356, jiffies_stall: 4294964347, rcp->curtail: (null)
[ 276.054634] name: rcu_sched, jiffies: 4294943356, jiffies_stall: 4294943357, rcp->curtail: (null)
[ 276.062989] name: rcu_bh, jiffies: 4294943358, jiffies_stall: 4294964356, rcp->curtail: (null)
[ 276.063982] name: rcu_sched, jiffies: 4294943358, jiffies_stall: 4294943357, rcp->curtail: (null)
[ 276.065285] name: rcu_bh, jiffies: 4294943361, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.066278] name: rcu_sched, jiffies: 4294943361, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.068566] name: rcu_bh, jiffies: 4294943370, jiffies_stall: 4294964361, rcp->curtail: (null)
[ 276.068566] name: rcu_sched, jiffies: 4294943370, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.076867] name: rcu_bh, jiffies: 4294943372, jiffies_stall: 4294964370, rcp->curtail: (null)
[ 276.077861] name: rcu_sched, jiffies: 4294943372, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.079199] name: rcu_bh, jiffies: 4294943375, jiffies_stall: 4294964372, rcp->curtail: (null)
[ 276.080193] name: rcu_sched, jiffies: 4294943375, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.082565] name: rcu_bh, jiffies: 4294943384, jiffies_stall: 4294964375, rcp->curtail: (null)
[ 276.082565] name: rcu_sched, jiffies: 4294943384, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.090984] name: rcu_bh, jiffies: 4294943386, jiffies_stall: 4294964384, rcp->curtail: (null)
[ 276.091976] name: rcu_sched, jiffies: 4294943386, jiffies_stall: 4294964358, rcp->curtail: (null)
[ 276.093296] name: rcu_bh, jiffies: 4294943389, jiffies_stall: 4294964386, rcp->curtail: (null)
[ 276.094289] name: rcu_sched, jiffies: 4294943389, jiffies_stall: 4294964358, rcp->curtail: (null)
...

In case of rcu_bh jiffies_stall is regularly updated and no stall is thus
detected (if I understand the code correctly). In case of rcu_sched the
situation is different. jiffies_stall is updated every 21 seconds (my
RCU_CPU_STALL_TIMEOUT). So I would say the stall should be detected there.
Unfortunately it is bound to *rcp->curtail being not NULL, and it is in my
case (in printk output there is rcp->curtail, but it is *rcp->curtail in
fact). The patch below did the trick for me and cpu stall was detected and
info printed.

...
[ 138.534004] INFO: rcu_sched stall on CPU (41980 ticks this GP) idle=140000000000001 (t=42000 jiffies q=64)
[ 138.534004] CPU: 0 PID: 4140 Comm: test_thread Tainted: G O 3.18.0-next-20141218-rcu_tiny_patched+ #1
[ 138.534004] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20141015_183547-build31 04/01/2014
[ 138.534004] 0000000000000000 ffffffff81e2ce28 ffffffff81881b77 ffffffff81e2ce48
[ 138.534004] ffffffff810925c1 ffffffff81e2ce68 0000000000000000 ffffffff81e2ce68
[ 138.534004] ffffffff810932c2 0000000000000000 ffff8801395bc650 ffffffff81e2ce88
[ 138.534004] Call Trace:
[ 138.534004] <IRQ> [<ffffffff81881b77>] dump_stack+0x19/0x1b
[ 138.534004] [<ffffffff810925c1>] check_cpu_stall.part.6+0x61/0xb0
[ 138.534004] [<ffffffff810932c2>] rcu_check_callbacks+0x82/0x90
[ 138.534004] [<ffffffff81097258>] update_process_times+0x38/0x60
[ 138.534004] [<ffffffff810a6635>] tick_sched_handle.isra.22+0x35/0x50
[ 138.534004] [<ffffffff810a6ab3>] tick_sched_timer+0x43/0x70
[ 138.534004] [<ffffffff81097e1a>] __run_hrtimer+0xaa/0x240
[ 138.534004] [<ffffffff810981b6>] ? hrtimer_interrupt+0x66/0x250
[ 138.534004] [<ffffffff810a6a70>] ? tick_sched_do_timer+0x30/0x30
[ 138.534004] [<ffffffff8109822f>] hrtimer_interrupt+0xdf/0x250
[ 138.534004] [<ffffffffa0000000>] ? 0xffffffffa0000000
[ 138.534004] [<ffffffff8102d986>] local_apic_timer_interrupt+0x26/0x50
[ 138.534004] [<ffffffff8189026f>] smp_apic_timer_interrupt+0x3f/0x60
[ 138.534004] [<ffffffff8188e91d>] apic_timer_interrupt+0x6d/0x80
[ 138.534004] <EOI> [<ffffffff8188e75c>] ? retint_restore_args+0xe/0xe
[ 138.534004] [<ffffffff81062c9d>] ? kthread_should_stop+0x1d/0x30
[ 138.534004] [<ffffffffa0000015>] ? test_thread_fn+0x15/0x20 [kthread_mod]
[ 138.534004] [<ffffffff81063139>] kthread+0xf9/0x110
[ 138.534004] [<ffffffff81063040>] ? kthread_create_on_node+0x230/0x230
[ 138.534004] [<ffffffff8188da2a>] ret_from_fork+0x7a/0xb0
[ 138.534004] [<ffffffff81063040>] ? kthread_create_on_node+0x230/0x230
...

But I'm not sure whether it is possible that the list of pending callback
is empty. Maybe the bug is elsewhere. My understanding of RCU is limited
but I'll try to look around. Also the commit log might be imprecise :/

Best regards,
Miroslav

-- >8 --