Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully

From: Andy Lutomirski
Date: Tue Dec 01 2020 - 13:49:41 EST

Next message: Andy Shevchenko: "Re: [PATCH 18/18] ipu3: Add driver for dummy INT3472 ACPI device"
Previous message: Peter Zijlstra: "Re: [GIT pull] locking/urgent for v5.10-rc6"
In reply to: Mathieu Desnoyers: "Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully"
Next in thread: Mathieu Desnoyers: "Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Dec 1, 2020 at 10:29 AM Mathieu Desnoyers
<mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>
> ----- On Dec 1, 2020, at 1:12 PM, Andy Lutomirski luto@xxxxxxxxxx wrote:
>
> > On Tue, Dec 1, 2020 at 6:28 AM Mathieu Desnoyers
> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
> >>
> >> ----- On Dec 1, 2020, at 5:16 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
> >>
> >> > On Mon, Nov 30, 2020 at 09:50:35AM -0800, Andy Lutomirski wrote:
> >> >> membarrier() carefully propagates SYNC_CORE and RSEQ actions to all
> >> >> other CPUs, but there are two issues.
> >> >>
> >> >> - membarrier() does not sync_core() or rseq_preempt() the calling
> >> >> CPU. Aside from the logic being mind-bending, this also means
> >> >> that it may not be safe to modify user code through an alias,
> >> >> call membarrier(), and then jump to a different executable alias
> >> >> of the same code.
> >> >
> >> > I always understood this to be on purpose. The calling CPU can fix up
> >> > itself just fine. The pain point is fixing up the other CPUs, and that's
> >> > where membarrier() helps.
> >>
> >> Indeed, as documented in the man page:
> >>
> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16)
> >> In addition to providing the memory ordering guarantees de‐
> >> scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from
> >> system call the calling thread has a guarantee that all its run‐
> >> ning thread siblings have executed a core serializing instruc‐
> >> tion. This guarantee is provided only for threads in the same
> >> process as the calling thread.
> >>
> >> membarrier sync core guarantees a core serializing instruction on the siblings,
> >> not on the caller thread. This has been done on purpose given that the caller
> >> thread can always issue its core serializing instruction from user-space on
> >> its own.
> >>
> >> >
> >> > That said, I don't mind including self, these aren't fast calls by any
> >> > means.
> >>
> >> I don't mind including self either, but this would require documentation
> >> updates, including man pages, to state that starting from kernel Y this
> >> is the guaranteed behavior. It's then tricky for user-space to query what
> >> the behavior is unless we introduce a new membarrier command for it. So this
> >> could introduce issues if software written for the newer kernels runs on older
> >> kernels.
> >
> > For rseq at least, if we do this now we don't have this issue -- I
> > don't think any released kernel has the rseq mode.
>
> But for rseq, there is no core-sync. And considering that it is invalid
> to issue a system call within an rseq critical section (including membarrier),
> I don't see what we gain by doing a rseq barrier on self ?
>
> The only case where it really changes the semantic is for core-sync I think.
> And in this case, it would be adding an additional core-sync on self. I
> am OK with doing that considering that it will simplify use of the system
> call. I'm just wondering how we should document this change in the man page.
>
> >
> >>
> >> >
> >> >> - membarrier() does not explicitly sync_core() remote CPUs either;
> >> >> instead, it relies on the assumption that an IPI will result in a
> >> >> core sync. On x86, I think this may be true in practice, but
> >> >> it's not architecturally reliable. In particular, the SDM and
> >> >> APM do not appear to guarantee that interrupt delivery is
> >> >> serializing.
> >> >
> >> > Right, I don't think we rely on that, we do rely on interrupt delivery
> >> > providing order though -- as per the previous email.
> >> >
> >> >> On a preemptible kernel, IPI return can schedule,
> >> >> thereby switching to another task in the same mm that was
> >> >> sleeping in a syscall. The new task could then SYSRET back to
> >> >> usermode without ever executing IRET.
> >> >
> >> > This; I think we all overlooked this scenario.
> >>
> >> Indeed, this is an issue which needs to be fixed.
> >>
> >> >
> >> >> This patch simplifies the code to treat the calling CPU just like
> >> >> all other CPUs, and explicitly sync_core() on all target CPUs. This
> >> >> eliminates the need for the smp_mb() at the end of the function
> >> >> except in the special case of a targeted remote membarrier(). This
> >> >> patch updates that code and the comments accordingly.
> >>
> >> I am not confident that removing the smp_mb at the end of membarrier is
> >> an appropriate change, nor that it simplifies the model.
> >
> > Ah, but I didn't remove it. I carefully made sure that every possible
> > path through the function does an smp_mb() or stronger after all the
> > cpu_rq reads. ipi_func(), on_each_cpu(), and the explicit smp_mb()
> > cover the three cases.
> >
> > That being said, if you prefer, I can make the change to skip the
> > calling CPU, in which case I'll leave the smp_mb() at the end alone.
>
> For the memory barrier commands, I prefer skipping self and leaving the
> smp_mb at the very beginning/end of the system call. Those are the key
> before/after points we are synchronizing against, and those are simple
> to document.
>

Is there a reason that doing the barrier at the very end could make an
observable difference? The two models are:

membarrier() {
smp_mb();
read a bunch of cpu_rq memory and make decisions;
execute smp_mb() on relevant cpus including self;
}

versus

membarrier() {
smp_mb();
read a bunch of cpu_rq memory and make decisions;
execute smp_mb() on relevant non-self cpus;
wait for that to finish (acquire-style on the local cpu);
smp_mb();
}

Is the idea that, on a sufficiently weakly ordered architecture, some
remote CPU could do a store before the IPI and a local load after the
membarrier() syscall might not observe the load unless the local
smp_mb() is after the remote smp_mb()? If so, I'm not entirely
convinced that this is observably different from the store simply
occurring after the IPI, but maybe there are some gnarly situations in
which this could happen.

If your concern is something along these lines, I could try to write
up an appropriate comment, and I'll rework the patch.

Next message: Andy Shevchenko: "Re: [PATCH 18/18] ipu3: Add driver for dummy INT3472 ACPI device"
Previous message: Peter Zijlstra: "Re: [GIT pull] locking/urgent for v5.10-rc6"
In reply to: Mathieu Desnoyers: "Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully"
Next in thread: Mathieu Desnoyers: "Re: [PATCH 3/3] membarrier: Propagate SYNC_CORE and RSEQ actions more carefully"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]