Re: RCU stall/SOFT-Lockup on 4.11.3/4.13.11 after multiple days uptime

From: Paul E. McKenney
Date: Sat Nov 11 2017 - 20:21:39 EST


On Sat, Nov 11, 2017 at 08:38:32PM +0100, Bruno Prémont wrote:
> Hi,
>
> On a single-CPU KVM-based virtual machine I'm suffering from RCU stall
> and soft-lockup. 4.10.x kernels run fine (4.10.12) but starting with
> 4.11.x (4.11.3, 4.13.11) I'm getting system freezes for no apparent
> reason.
>
> All info I have is following console dump (from 4.13.11):
> [526415.290012] INFO: rcu_sched self-detected stall on CPU
> [526415.290012] o0-...: (745847 ticks this GP) idle=ba2/2/0 softirq=37393463/37393463 fqs=0
> [526415.290012] o (t=745854 jiffies g=23779976 c=23779975 q=32)
> [526415.290012] rcu_sched kthread starved for 745854 jiffies! g23779976 c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0

The above line says that the rcu_sched kthread asked to sleep for three
jiffies, but ended up sleeping for more than 745,854 jiffies.

If your system does not let the RCU's kernel threads run, RCU cannot
help you much.

The ->state of 0x0 indicates that the kthread is in fact runnable, but
did not get a chance to run. Was the system heavily loaded to the
point where you would expect a kthread to remain preempted for many
minutes?

I am guessing that the answer is no, given that CPU 0 is actually idle
(idle=ba2/2/0). Seems unlikely, but I have to ask: Did you bind the
kthread to a specific CPU?

Thanx, Paul

> [526440.020015] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> [526468.020005] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> [526478.320009] INFO: rcu_sched self-detected stall on CPU
> [526478.320009] o0-...: (752143 ticks this GP) idle=ba2/2/0 softirq=37393463/37393463 fqs=0
> [526478.320009] o (t=752157 jiffies g=23779976 c=23779975 q=32)
> [526478.320009] rcu_sched kthread starved for 752157 jiffies! g23779976 c23779975 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
> [526504.020016] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> [526532.020007] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0]
> ...
>
> Attached is kernel config (4.13.11).
>
>
> The output obtained with 4.11.3 was:
> [ 280.680010] INFO: rcu_sched self-detected stall on CPU
> [ 280.680021] o0-...: (27312 ticks this GP) dile=b11/2/0 softirq=6119/6119 fqs=0
> [ 280.680021] o (t=27312 jiffies g=441 c=440 q=0)
> [ 280.680021] rcu_sched_kthread starved for 27312 jiffies! g441 c440 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
> ...
>
>
> As it's a remote VM for which I don't have access to the host I have little
> options for further digging (can't trigger sysrq's).
>
>
> Same kernel (4.13.11) seems to be running just fine on another KVM-base VM that
> has two CPUs.
>
>
> Does it ring a bell or is there some info that might be of any use,
> assuming I can obtain it?
>
> Bruno