Re: [PATCH] Fix data race in mark_rt_mutex_waiters

From: Peter Zijlstra
Date: Thu Jan 26 2023 - 07:20:47 EST


On Thu, Jan 26, 2023 at 10:42:07AM +0100, Hernan Ponce de Leon wrote:
> On 1/24/2023 5:04 PM, Waiman Long wrote:
> >
> > On 1/24/23 10:52, Peter Zijlstra wrote:
> > > On Tue, Jan 24, 2023 at 10:42:24AM -0500, Waiman Long wrote:
> > >
> > > > I would suggest to do it as suggested by PeterZ. Instead of set_bit(),
> > > > however, it is probably better to use atomic_long_or() like
> > > >
> > > > atomic_long_or_relaxed(RT_MUTEX_HAS_WAITERS, (atomic_long_t
> > > > *)&lock->owner)
> > > That function doesn't exist, atomic_long_or() is implicitly relaxed for
> > > not returning a value.
> > >
> > You are right. atomic_long_or() doesn't have variants like some others.
> >
> > Cheers,
> > Longman
> >
>
> When you say "replace the whole of that function", do you mean "barrier
> included"? I argue in the other email that I think this should not affect
> correctness (at least not obviously), but removing the barrier is doing more
> than just fixing the data race as this patch suggests.

Well, set_bit() implies smp_mb(), atomic_long_or() does not and would
need to retain the barrier.

That said, the comments are all batshit. The top comment states relaxed
ordering is suffient since holding lock, the comment with the barrier
utterly fails to explain what it's ordering against.

So all that would need to be updated as well.

That said, looking at 1c0908d8e441 I'm not at all sure we need that
barrier. Even in the try_to_take_rt_mutex(.waiter=NULL) case, where we
skip over the task->pi_lock region, rt_mutex_set_owner(.acquire=true)
will be ACQUIRE.

And in case of rt_mutex_owner(), we fail the trylock (return with 0) and
a failed trylock does not imply any ordering.

So why are we having this barrier?