Re: [PATCH] x86/tsx: fix KVM guest live migration for tsx=on

From: Jon Kohler
Date: Tue Apr 12 2022 - 12:09:15 EST




> On Apr 12, 2022, at 11:54 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 4/12/22 06:36, Jon Kohler wrote:
>> So my theory here is to extend the logical effort of the microcode driven
>> automatic disablement as well as the tsx=auto automatic disablement and
>> have tsx=on force abort all transactions on X86_BUG_TAA SKUs, but leave
>> the CPU features enumerated to maintain live migration.
>>
>> This would still leave TSX totally good on Ice Lake / non-buggy systems.
>>
>> If it would help, I'm working up an RFC patch, and we could discuss there?
>
> Sure. But, it sounds like you really want a new tdx=something rather
> than to muck with tsx=on behavior. Surely someone else will come along
> and complain that we broke their TDX setup if we change its behavior.

Good point, there will always be a squeaky wheel. I’ll work that into the RFC,
I’ll do something like tsx=compat and see how it shapes up.

To be fair though, this commit I’m patching with this series would break
setups as they apply 5.14+ and the microcode update, but you have a
good point for certain.

>
> Maybe you should just pay the one-time cost and move your whole fleet
> over to tsx=off if you truly believe nobody is using it.
>

Trust me, I’d love to do that; however:
We’ve thousands of hosts across thousands of unique customers,
which aren't managed as a centralized service (customers manage them directly),
so doing that would require each individual customer to organize a full power
cycle for all of their VMs prior to an upgrade to tsx=off hosts.

That said, we are marching in that direction, we're shipping a control plane
update that will mask HLE and RTM after power cycles, but that requires
customers to apply that control plane update, then power cycle everything. Just
means that we've begun the feature deprecation now, it will take years to fully
bleed off without having customers to micro manage full power cycles.