Re: [PATCH v5] kernel/fork: beware of __put_task_struct calling context

From: Wander Lairson Costa
Date: Mon Feb 13 2023 - 07:15:58 EST


On Fri, Feb 10, 2023 at 06:19:54PM +0100, Sebastian Andrzej Siewior wrote:
> On 2023-02-10 13:13:21 [-0300], Wander Lairson Costa wrote:
> > Under PREEMPT_RT, __put_task_struct() indirectly acquires sleeping
> > locks. Therefore, it can't be called from an non-preemptible context.
> >
> > One practical example is splat inside inactive_task_timer(), which is
> > called in a interrupt context:
> >
> > CPU: 1 PID: 2848 Comm: life Kdump: loaded Tainted: G W ---------
> > Hardware name: HP ProLiant DL388p Gen8, BIOS P70 07/15/2012
> > Call Trace:
> > dump_stack_lvl+0x57/0x7d
> > mark_lock_irq.cold+0x33/0xba
> > ? stack_trace_save+0x4b/0x70
> > ? save_trace+0x55/0x150
> > mark_lock+0x1e7/0x400
> > mark_usage+0x11d/0x140
> > __lock_acquire+0x30d/0x930
> > lock_acquire.part.0+0x9c/0x210
> > ? refill_obj_stock+0x3d/0x3a0
> > ? rcu_read_lock_sched_held+0x3f/0x70
> > ? trace_lock_acquire+0x38/0x140
> > ? lock_acquire+0x30/0x80
> > ? refill_obj_stock+0x3d/0x3a0
> > rt_spin_lock+0x27/0xe0
> > ? refill_obj_stock+0x3d/0x3a0
> > refill_obj_stock+0x3d/0x3a0
> > ? inactive_task_timer+0x1ad/0x340
> > kmem_cache_free+0x357/0x560
> > inactive_task_timer+0x1ad/0x340
> > ? switched_from_dl+0x2d0/0x2d0
> > __run_hrtimer+0x8a/0x1a0
> > __hrtimer_run_queues+0x91/0x130
> > hrtimer_interrupt+0x10f/0x220
> > __sysvec_apic_timer_interrupt+0x7b/0xd0
> > sysvec_apic_timer_interrupt+0x4f/0xd0
> > ? asm_sysvec_apic_timer_interrupt+0xa/0x20
> > asm_sysvec_apic_timer_interrupt+0x12/0x20
> > RIP: 0033:0x7fff196bf6f5
>
> Now that I looked around: There are other put_task_struct() while the rq
> lock is held. I didn't look outside o dl.c.
>
> > Instead of calling __put_task_struct() directly, we defer it using
> > call_rcu(). A more natural approach would use a workqueue, but since
> > in PREEMPT_RT, we can't allocate dynamic memory from atomic context,
> > the code would become more complex because we would need to put the
> > work_struct instance in the task_struct and initialize it when we
> > allocate a new task_struct.
> >
> > Changelog
> > =========
> >
> > v1:
> > * Initial implementation fixing the splat.
> >
> > v2:
> > * Isolate the logic in its own function.
> > * Fix two more cases caught in review.
> >
> > v3:
> > * Change __put_task_struct() to handle the issue internally.
> >
> > v4:
> > * Explain why call_rcu() is safe to call from interrupt context.
> >
> > v5:
> > * Explain why __put_task_struct() doesn't conflict with
> > put_task_sruct_rcu_user.
> >
> > Signed-off-by: Wander Lairson Costa <wander@xxxxxxxxxx>
> > Reported-by: Hu Chunyu <chuhu@xxxxxxxxxx>
> > Suggested-by: Oleg Nesterov <oleg@xxxxxxxxxx>
> > Suggested-by: Valentin Schneider <vschneid@xxxxxxxxxx>
> > Cc: Paul McKenney <paulmck@xxxxxxxxxx>
> > ---
> > kernel/fork.c | 33 ++++++++++++++++++++++++++++++++-
> > 1 file changed, 32 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 9f7fe3541897..9bf30c725ed8 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -840,7 +840,7 @@ static inline void put_signal_struct(struct signal_struct *sig)
> > free_signal_struct(sig);
> > }
> >
> > -void __put_task_struct(struct task_struct *tsk)
> > +static void ___put_task_struct(struct task_struct *tsk)
> > {
> > WARN_ON(!tsk->exit_state);
> > WARN_ON(refcount_read(&tsk->usage));
> > @@ -857,6 +857,37 @@ void __put_task_struct(struct task_struct *tsk)
> > sched_core_free(tsk);
> > free_task(tsk);
> > }
> > +
> > +static void __put_task_struct_rcu(struct rcu_head *rhp)
> > +{
> > + struct task_struct *task = container_of(rhp, struct task_struct, rcu);
> > +
> > + ___put_task_struct(task);
> > +}
> > +
> > +void __put_task_struct(struct task_struct *tsk)
> > +{
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
>
> No. If you do this on non-RT kernel with CONFIG_PROVE_RAW_LOCK_NESTING
> then it will complain. And why do we have in_task() here?
>

Initially I thought you were saying it would cause a build failure, but
I built the kernel successfully with CONFIG_PROVE_RAW_LOCK_NESTING.
If it is a non-RT kernel, I understand the optimizer will vanish with
the `if` clause. Would mind further explaining the conflict with
CONFIG_PROVE_RAW_LOCK_NESTING?

The `!in_task()` call is to test if we are in interrupt context.

> If Oleg does not want the unconditional RCU then I would prefer an
> explicit put task which delays it to RCU for the few users that need it.
>

Do you mean like the approach in v2[1]? I believe to spot all possible
problematic scenarios, would should add

```
if (IS_ENABLED(CONFIG_PREEMPT_RT))
might_sleep();
```

to `put_task_struct()`.

> A lockdep annotation _before_ ___put_task_struct() should spot users
> which are not obviously visible from audit.
>
> > + /*
> > + * under PREEMPT_RT, we can't call put_task_struct
> > + * in atomic context because it will indirectly
> > + * acquire sleeping locks.
> > + *
> > + * call_rcu() will schedule delayed_put_task_struct_rcu()
> > + * to be called in process context.
> > + *
> > + * __put_task_struct() is called called when
> > + * refcount_dec_and_test(&t->usage) succeeds.
> > + *
> > + * This means that it can't "conflict" with
> > + * put_task_struct_rcu_user() which abuses ->rcu the same
> > + * way; rcu_users has a reference so task->usage can't be
> > + * zero after rcu_users 1 -> 0 transition.
> > + */
> > + call_rcu(&tsk->rcu, __put_task_struct_rcu);
> > + else
> > + ___put_task_struct(tsk);
> > +}
> > EXPORT_SYMBOL_GPL(__put_task_struct);
> >
> > void __init __weak arch_task_cache_init(void) { }
> > --
> > 2.39.1
> >
>

[1] https://lore.kernel.org/all/20230120150246.20797-1-wander@xxxxxxxxxx/