Watchdog Reset on Idle CPU with a task on its runq

From: Vijay Balakrishna
Date: Tue May 07 2024 - 17:41:02 EST


Hello,

We are seeing watchdog reset on ARM64 SoC running v5.10.178 kernel (stable) where CPU 0 running an idle task even though there is a runnable task on CFS runq (rcu_sched in output below). We are wondering why do we see a task waiting to get scheduled to run a CPU otherwise running an idle task. What does this indicate with respect to state of CPU 0? What else could we check in the kernel crash dump. Any pointers appreciated.

Thanks,
Vijay

(crash tool output)

[530671.963762] Kernel panic - not syncing: SBSA Generic Watchdog timeout
[530671.970288] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G O 5.10.178.13-microsoft-standard #1
[530671.980969] Hardware name: Overlake (DT)
[530671.984967] Call trace:
[530671.987499] dump_backtrace+0x0/0x1f0
[530671.991238] show_stack+0x1c/0x24
[530671.994630] dump_stack+0xe0/0x13c
[530671.998107] panic+0x198/0x3a4
[530672.001239] sbsa_gwdt_set_timeout+0x0/0x7c
[530672.005498] __handle_irq_event_percpu+0xf0/0x2ac
[530672.010277] handle_irq_event+0x60/0x144
[530672.014275] handle_fasteoi_irq+0x144/0x234
[530672.018533] __handle_domain_irq+0x8c/0xcc
[530672.022704] gic_handle_irq+0xc0/0x120
[530672.026527] el1_irq+0xcc/0x180
[530672.029744] cpuidle_enter_state+0x1fc/0x31c
[530672.034088] cpuidle_enter+0x3c/0x50
[530672.037740] do_idle+0x1e4/0x28c
[530672.041042] cpu_startup_entry+0x28/0x2c
[530672.045042] rest_init+0xc4/0xd0
[530672.048346] arch_call_rest_init+0x14/0x1c
[530672.052517] start_kernel+0x328/0x3a4
[530672.056267] SMP: stopping secondary CPUs
[530672.060450] Starting crashdump kernel...
[530672.064447] Bye!
crash> runq -c 0
CPU 0 RUNQUEUE: ffff07cf49233200
CURRENT: PID: 0 TASK: ffffde8e444e8900 COMMAND: "swapper/0"
RT PRIO_ARRAY: ffff07cf49233440
[no tasks queued]
CFS RB_ROOT: ffff07cf492332b0
[120] PID: 11 TASK: ffff07ad40c10000 COMMAND: "rcu_sched"
crash> bt ffffde8e444e8900
PID: 0 TASK: ffffde8e444e8900 CPU: 0 COMMAND: "swapper/0"
#0 [ffff800010003db0] __crash_kexec at ffffde8e4370b424
#1 [ffff800010003e60] panic at ffffde8e4363b64c
#2 [ffff800010003eb0] sbsa_gwdt_interrupt at ffffde8e43d92aa8
#3 [ffff800010003ed0] __handle_irq_event_percpu at ffffde8e436b9720
#4 [ffff800010003f40] handle_irq_event at ffffde8e436b99c4
#5 [ffff800010003f70] handle_fasteoi_irq at ffffde8e436bff0c
#6 [ffff800010003fa0] __handle_domain_irq at ffffde8e436b831c
#7 [ffff800010003fe0] gic_handle_irq at ffffde8e43600974
--- <IRQ stack> ---
#8 [ffffde8e444d3e50] el1_irq at ffffde8e43602288
#9 [ffffde8e444d3e70] cpuidle_enter_state at ffffde8e43dd6190
#10 [ffffde8e444d3ed0] cpuidle_enter at ffffde8e43dd6314
#11 [ffffde8e444d3f10] do_idle at ffffde8e4368307c
#12 [ffffde8e444d3f70] cpu_startup_entry at ffffde8e4368314c
#13 [ffffde8e444d3f90] rest_init at ffffde8e4408d79c
#14 [ffffde8e444d3fb0] arch_call_rest_init at ffffde8e443b0730
#15 [ffffde8e444d3fe0] start_kernel at ffffde8e443b0a60
crash>