Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

From: Peter Zijlstra
Date: Thu Jul 27 2017 - 11:20:17 EST



Hi Nick,

See below,

On Thu, Jul 27, 2017 at 03:56:10PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 27, 2017 at 06:08:16AM -0700, Paul E. McKenney wrote:
>
> > > So I think we need either switch_mm() or switch_to() to imply a full
> > > barrier for this to work, otherwise we get:
> > >
> > > CPU0 CPU1
> > >
> > >
> > > lock rq->lock
> > > mb
> > >
> > > rq->curr = A
> > >
> > > unlock rq->lock
> > >
> > > lock rq->lock
> > > mb
> > >
> > > sys_membarrier()
> > >
> > > mb
> > >
> > > for_each_online_cpu()
> > > p = A
> > > // no match no IPI
> > >
> > > mb
> > > rq->curr = B
> > >
> > > unlock rq->lock
> > >
> > >
> > > And that's bad, because now CPU0 doesn't have an MB happening _after_
> > > sys_membarrier() if B matches.
> >
> > Yes, this looks somewhat similar to the scenario that Mathieu pointed out
> > back in 2010: https://marc.info/?l=linux-kernel&m=126349766324224&w=2
>
> Yes. Minus the mm_cpumask() worries.
>
> > > So without audit, I only know of PPC and Alpha not having a barrier in
> > > either switch_*().
> > >
> > > x86 obviously has barriers all over the place, arm has a super duper
> > > heavy barrier in switch_to().
> >
> > Agreed, if we are going to rely on ->mm, we need ordering on assignment
> > to it.
>
> Right, Boqun provided this reordering to show the problem:
>
> CPU0 CPU1
>
>
> <in process X>
> lock rq->lock
> mb
>
> rq->curr = A
>
> unlock rq->lock
>
> <switch to process A>
>
> lock rq->lock
> mb
> read Y(reordered)<---+
> | store to Y
> |
> | sys_membarrier()
> |
> | mb
> |
> | for_each_online_cpu()
> | p = A
> | // no match no IPI
> |
> | mb
> |
> | store to X
> rq->curr = B |
> |
> unlock rq->lock |
> <switch to B> |
> read X |
> |
> read Y --------------+

In order to make this work we need either switch_to() or switch_mm() to
provide smp_mb(). Now you're recently taken that out on PPC and I'm
thinking you're not keen to have to put it back in.

Mathieu was wondering if placing it in switch_mm() would be less onerous
on performance, thinking that address space changes are more expensive
in any case, seeing how they have a tail of cache and translation
misses. I'm thinking you're not happy either way :-)

Opinions?