Re: [PATCH 0/4] arm64: Support the TSO memory model

From: Sergio Lopez Pascual
Date: Mon May 06 2024 - 07:21:56 EST


Eric Curtin <ecurtin@xxxxxxxxxx> writes:

> On Fri, 19 Apr 2024 at 18:08, Will Deacon <will@xxxxxxxxxx> wrote:
>>
>> On Thu, Apr 11, 2024 at 11:19:13PM +0900, Hector Martin wrote:
>> > On 2024/04/11 22:28, Will Deacon wrote:
>> > > * Some binaries in a distribution exhibit instability which goes away
>> > > in TSO mode, so a taskset-like program is used to run them with TSO
>> > > enabled.
>> >
>> > Since the flag is cleared on execve, this third one isn't generally
>> > possible as far as I know.
>>
>> Ah ok, I'd missed that. Thanks.
>>
>> > > In all these cases, we end up with native arm64 applications that will
>> > > either fail to load or will crash in subtle ways on CPUs without the TSO
>> > > feature. Assuming that the application cannot be fixed, a better
>> > > approach would be to recompile using stronger instructions (e.g.
>> > > LDAR/STLR) so that at least the resulting binary is portable. Now, it's
>> > > true that some existing CPUs are TSO by design (this is a perfectly
>> > > valid implementation of the arm64 memory model), but I think there's a
>> > > big difference between quietly providing more ordering guarantees than
>> > > software may be relying on and providing a mechanism to discover,
>> > > request and ultimately rely upon the stronger behaviour.
>> >
>> > The problem is "just" using stronger instructions is much more
>> > expensive, as emulators have demonstrated. If TSO didn't serve a
>> > practical purpose I wouldn't be submitting this, but it does. This is
>> > basically non-negotiable for x86 emulation; if this is rejected
>> > upstream, it will forever live as a downstream patch used by the entire
>> > gaming-on-Mac-Linux ecosystem (and this is an ecosystem we are very
>> > explicitly targeting, given our efforts with microVMs for 4K page size
>> > support and the upcoming Vulkan drivers).

In addition to the use case Hector exposed here, there's another,
potentially larger one, which is running x86_64 containers on aarch64
systems, using a combination of both Virtualization and emulation.

In this scenario, both not being able to use TSO for emulation
and having to enable it all the time for the whole VM have a very large
impact on performance (~25% on some workloads).

I understand the concern about the risk of userspace fragmentation, but
I was wondering if we could minimize it to an acceptable level by
narrowing down the context. For instance, since both use cases we're
bringing to the table imply the use of Virtualization, we should be able
to restrict PR_SET_MEM_MODEL to only be accepted when running on EL1
(and not in nVHE nor pKVM), returning EINVAL otherwise. This would
heavily discourage users from relying on this feature for native
applications that can run on arbitrary contexts, hence drastically
reducing the fragmentation risk.

We would still need a way to ensure the trap gets to the VMM and for
the VMM to operate on the impdef ACTLR_EL12, but that should be dealt on
a different series.

Thanks,
Sergio.