Re: [PATCH] sched: fix rq lock recursion issue

From: Satya Durga Srinivasu Prabhala
Date: Tue Jul 05 2022 - 23:49:13 EST



On 7/1/22 1:33 AM, Peter Zijlstra wrote:
On Thu, Jun 30, 2022 at 10:53:10PM +0100, Qais Yousef wrote:
Hi Satya

On 06/24/22 00:42, Satya Durga Srinivasu Prabhala wrote:
Below recursion is observed in a rare scenario where __schedule()
takes rq lock, at around same time task's affinity is being changed,
bpf function for tracing sched_switch calls migrate_enabled(),
checks for affinity change (cpus_ptr != cpus_mask) lands into
__set_cpus_allowed_ptr which tries acquire rq lock and causing the
recursion bug.

Fix the issue by switching to preempt_enable/disable() for non-RT
Kernels.
Interesting bug. Thanks for the report. Unfortunately I can't see this being
a fix as it just limits the bug visibility to PREEMPT_RT kernels, but won't fix
anything, no? ie: Kernels compiled with PREEMPT_RT will still hit this failure.
Worse, there's !RT stuff that grew to rely on the preemptible migrate
disable stuff, so this actively breaks things.
Sorry about that. I'm cross checking further on ways to repro issue easily.
I'm curious how the race with set affinity is happening. I would have thought
user space would get blocked as __schedule() will hold the rq lock.

Do you have more details on that?
Yeah, I'm not seeing how this works either, in order for
migrate_enable() to actually call __set_cpus_allowed_ptr(), it needs to
have done migrate_disable() *before* schedule, schedule() will then have
to call migrate_disable_swich(), and *then* migrate_enable() does this.

However, if things are nicely balanced (as they should be), then
trace_call_bpf() using migrate_disable()/migrate_enable() should never
hit this path.

If, OTOH, migrate_disable() was called prior to schedule() and we did do
migrate_disable_switch(), then it should be impossible for the
tracepoint/bpf stuff to reach p->migration_disabled == 0.

Thanks for explanation. Will cross check further on these points.
-010 |spin_bug(lock = ???, msg = ???)
-011 |debug_spin_lock_before(inline)
-011 |do_raw_spin_lock(lock = 0xFFFFFF89323BB600)
-012 |_raw_spin_lock(inline)
-012 |raw_spin_rq_lock_nested(inline)
-012 |raw_spin_rq_lock(inline)
-012 |task_rq_lock(p = 0xFFFFFF88CFF1DA00, rf = 0xFFFFFFC03707BBE8)
-013 |__set_cpus_allowed_ptr(inline)
-013 |migrate_enable()
-014 |trace_call_bpf(call = ?, ctx = 0xFFFFFFFDEF954600)
-015 |perf_trace_run_bpf_submit(inline)
-015 |perf_trace_sched_switch(__data = 0xFFFFFFE82CF0BCB8, preempt = FALSE, prev = ?, next = ?)
-016 |__traceiter_sched_switch(inline)
-016 |trace_sched_switch(inline)
-016 |__schedule(sched_mode = ?)
-017 |schedule()
-018 |arch_local_save_flags(inline)
-018 |arch_irqs_disabled(inline)
-018 |__raw_spin_lock_irq(inline)
-018 |_raw_spin_lock_irq(inline)
-018 |worker_thread(__worker = 0xFFFFFF88CE251300)
-019 |kthread(_create = 0xFFFFFF88730A5A80)
-020 |ret_from_fork(asm)
This doesn't clarify much. Please explain how things got to be
unbalanced, don't ever just make a problem dissapear like this without
understanding what the root cause is, that'll just get your reputation
sullied.
Agreed, thanks for the comments and suggestion. Yes, I'm trying to cross
check further and find ways to repro the issue. Will get back once I find
a better way to handle the issue. I should have just tried to get
comments/feedback on the issue first instead proposing fix. Lesson learnt :)