Re: [PATCH v5] kernel/fork: beware of __put_task_struct calling context

From: Sebastian Andrzej Siewior
Date: Wed Feb 15 2023 - 06:42:55 EST


On 2023-02-13 09:13:55 [-0300], Wander Lairson Costa wrote:

> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 9f7fe3541897..9bf30c725ed8 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -857,6 +857,37 @@ void __put_task_struct(struct task_struct *tsk)
> > > sched_core_free(tsk);
> > > free_task(tsk);
> > > }
> > > +
> > > +static void __put_task_struct_rcu(struct rcu_head *rhp)
> > > +{
> > > + struct task_struct *task = container_of(rhp, struct task_struct, rcu);
> > > +
> > > + ___put_task_struct(task);
> > > +}
> > > +
> > > +void __put_task_struct(struct task_struct *tsk)
> > > +{
> > > + if (IS_ENABLED(CONFIG_PREEMPT_RT) && (!preemptible() || !in_task()))
> >
> > No. If you do this on non-RT kernel with CONFIG_PROVE_RAW_LOCK_NESTING
> > then it will complain. And why do we have in_task() here?
> >
>
> Initially I thought you were saying it would cause a build failure, but
> I built the kernel successfully with CONFIG_PROVE_RAW_LOCK_NESTING.
> If it is a non-RT kernel, I understand the optimizer will vanish with
> the `if` clause. Would mind further explaining the conflict with
> CONFIG_PROVE_RAW_LOCK_NESTING?

Documentation/locking/locktypes.rst explains the individual lock types
we have in the kernel and how you should nest them. In short,

mutex_t -> spinlock_t -> raw_spinlock_t

You nest/ acquire them left to right, i.e. first the mutex_t, last
raw_spinlock_t. This works always. If you leave PREEMPT_RT out of the
picture then
raw_spinlock_t -> spinlock_t
and
spinlock_t -> raw_spinlock_t

make no difference because the underlying lock structure is the same,
the behaviour is the same. It only causes warning or a boom once
PREEMPT_RT is enabled.
CONFIG_PROVE_RAW_LOCK_NESTING performs exactly this kind of
verification so you can see on a !PREEMPT_RT kernel if there is a
locking chain (or nesting) that would not be okay on PREEMPT_RT.

In this case, at the time you do __put_task_struct() the sched-RQ lock
is held which is a raw_spinlock_t. Later in __put_task_struct() it will
free memory (or do something else) requiring a spinlock_t which would do
the nesting
raw_spinlock_t -> spinlock_t

which is invalid and so lockdep should yell here.

> The `!in_task()` call is to test if we are in interrupt context.

I am aware of this but here in terms of PREEMPT_RT it doesn't matter.
It excluded the hardirq context which is the important one but this also
happens with preemptible(). It additionally excludes the "serving"
softirq context which is fine because it is preemtible on PREEMPT_RT.

> > If Oleg does not want the unconditional RCU then I would prefer an
> > explicit put task which delays it to RCU for the few users that need it.
> >
>
> Do you mean like the approach in v2[1]? I believe to spot all possible
> problematic scenarios, would should add

Yes, an explicit function because you know the context in which put_.*()
is invoked. It wasn't audited by the time it was added, it is not
"regular" case.

> ```
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> might_sleep();
> ```
>
> to `put_task_struct()`.

This only works on PREEMPT_RT and should be enough to spot some of the
offender we have right now. It might also trigger if task::state is
changed (not TASK_RUNNING) and it should be fine. Therefore I would
suggest to use rtlock_might_resched() for testing which is in
kernel/locking/spinlock_rt.c
but you get the idea.

Longterm, something like the diff at the bottom might compile and will
show raw_spinlock_t -> spinlock_t nesting with
CONFIG_PROVE_RAW_LOCK_NESTING. We won't catch explicit
preempt_disable(), local_irq_disable() users but _should_ be enough and
it would have warned us in this case because:
- the scheduler acquires a raw_spinlock_t
- the hrtimer has an check for this in lockdep_hrtimer_enter() to
distinguish between timers which are "regular" and those which
explicitly ask for the hardirq context.

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 357e0068497c1..eedbd50eb5df3 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -113,14 +113,18 @@ static inline struct task_struct *get_task_struct(struct task_struct *t)

extern void __put_task_struct(struct task_struct *t);

+extern spinlock_t task_put_lock;
+
static inline void put_task_struct(struct task_struct *t)
{
+ might_lock(&task_put_lock);
if (refcount_dec_and_test(&t->usage))
__put_task_struct(t);
}

static inline void put_task_struct_many(struct task_struct *t, int nr)
{
+ might_lock(&task_put_lock);
if (refcount_sub_and_test(nr, &t->usage))
__put_task_struct(t);
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe35418978..2f9c09bc22bdb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -840,6 +840,8 @@ static inline void put_signal_struct(struct signal_struct *sig)
free_signal_struct(sig);
}

+DEFINE_SPINLOCK(task_put_lock);
+
void __put_task_struct(struct task_struct *tsk)
{
WARN_ON(!tsk->exit_state);