Re: [syzbot] INFO: rcu detected stall in ext4_file_write_iter (4)

From: Waiman Long
Date: Mon Dec 27 2021 - 22:24:25 EST



On 12/24/21 19:52, Hillf Danton wrote:
On Wed, 22 Dec 2021 09:33:26 -0800
syzbot found the following issue on:

HEAD commit: 9eaa88c7036e Merge tag 'libata-5.16-rc6' of git://git.kern..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=142d4143b00000
kernel config: https://syzkaller.appspot.com/x/.config?x=e3bdfd29b408d1b6
dashboard link: https://syzkaller.appspot.com/bug?extid=03464269af631f4a4bdf
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11b1537db00000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=125de549b00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+03464269af631f4a4bdf@xxxxxxxxxxxxxxxxxxxxxxxxx

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
(detected by 0, t=10502 jiffies, g=5509, q=170)
rcu: All QSes seen, last rcu_preempt kthread activity 8642 (4295000735-4294992093), jiffies_till_next_fqs=1, root ->qsmask 0x0
rcu: rcu_preempt kthread starved for 8643 jiffies! g5509 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:R running task stack:28208 pid: 14 ppid: 2 flags:0x00004000
Call Trace:
<TASK>
context_switch kernel/sched/core.c:4972 [inline]
context_switch kernel/sched/core.c:4972 [inline] kernel/sched/core.c:6253
__schedule+0xa9a/0x4940 kernel/sched/core.c:6253 kernel/sched/core.c:6253
schedule+0xd2/0x260 kernel/sched/core.c:6326 kernel/sched/core.c:6326
schedule_timeout+0x14a/0x2a0 kernel/time/timer.c:1881 kernel/time/timer.c:1881
rcu_gp_fqs_loop+0x186/0x810 kernel/rcu/tree.c:1955 kernel/rcu/tree.c:1955
rcu_gp_kthread+0x1de/0x320 kernel/rcu/tree.c:2128 kernel/rcu/tree.c:2128
kthread+0x405/0x4f0 kernel/kthread.c:327 kernel/kthread.c:327
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 arch/x86/entry/entry_64.S:295
</TASK>
rcu: Stack dump where RCU GP kthread last ran:
NMI backtrace for cpu 0
CPU: 0 PID: 3672 Comm: syz-executor549 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:88 [inline]
__dump_stack lib/dump_stack.c:88 [inline] lib/dump_stack.c:106
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 lib/dump_stack.c:106
nmi_cpu_backtrace.cold+0x47/0x144 lib/nmi_backtrace.c:111 lib/nmi_backtrace.c:111
nmi_trigger_cpumask_backtrace+0x1b3/0x230 lib/nmi_backtrace.c:62 lib/nmi_backtrace.c:62
trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline] kernel/rcu/tree_stall.h:481
rcu_check_gp_kthread_starvation.cold+0x1fb/0x200 kernel/rcu/tree_stall.h:481 kernel/rcu/tree_stall.h:481
print_other_cpu_stall kernel/rcu/tree_stall.h:586 [inline]
check_cpu_stall kernel/rcu/tree_stall.h:729 [inline]
rcu_pending kernel/rcu/tree.c:3878 [inline]
print_other_cpu_stall kernel/rcu/tree_stall.h:586 [inline] kernel/rcu/tree.c:2597
check_cpu_stall kernel/rcu/tree_stall.h:729 [inline] kernel/rcu/tree.c:2597
rcu_pending kernel/rcu/tree.c:3878 [inline] kernel/rcu/tree.c:2597
rcu_sched_clock_irq+0x2125/0x2200 kernel/rcu/tree.c:2597 kernel/rcu/tree.c:2597
update_process_times+0x16d/0x200 kernel/time/timer.c:1785 kernel/time/timer.c:1785
tick_sched_handle+0x9b/0x180 kernel/time/tick-sched.c:226 kernel/time/tick-sched.c:226
tick_sched_timer+0x1b0/0x2d0 kernel/time/tick-sched.c:1428 kernel/time/tick-sched.c:1428
__run_hrtimer kernel/time/hrtimer.c:1685 [inline]
__run_hrtimer kernel/time/hrtimer.c:1685 [inline] kernel/time/hrtimer.c:1749
__hrtimer_run_queues+0x1c0/0xe50 kernel/time/hrtimer.c:1749 kernel/time/hrtimer.c:1749
hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811 kernel/time/hrtimer.c:1811
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline] arch/x86/kernel/apic/apic.c:1103
__sysvec_apic_timer_interrupt+0x146/0x530 arch/x86/kernel/apic/apic.c:1103 arch/x86/kernel/apic/apic.c:1103
sysvec_apic_timer_interrupt+0x8e/0xc0 arch/x86/kernel/apic/apic.c:1097 arch/x86/kernel/apic/apic.c:1097
</IRQ>
<TASK>
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 arch/x86/include/asm/idtentry.h:638
RIP: 0010:owner_on_cpu kernel/locking/rwsem.c:667 [inline]
The lock acquirer was spinning long enough to make RCU stall, and the
question is, if this is not reproducible, how was the report triggered?
Anything that syzbot is special in this case?

Thanks
Hillf

The test was running on a CONFIG_PREEMPT kernel. So if the syzkaller kthread is running at a higher priority than the rcu_preempt kthread, it is possible for the rcu_preempt kthread to be starved of cpu time. The rwsem optimistic spinning code will relinquish the cpu if there is a higher priority thread running. Since rcu_preempt kthread did not seem to be able to get the cpu, I suspect that it is probably caused by the syzkaller thread having a higher priority.

Cheers,
Longman