Re: [PATCH v3] x86/speculation, KVM: only IBPB for switch_mm_always_ibpb on vCPU load

From: Jon Kohler
Date: Fri Apr 29 2022 - 16:10:57 EST




> On Apr 29, 2022, at 3:32 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote:
>> Selftests IIUC, but there may be other VMMs that do funny stuff. Said
>> another way, I don’t think we actively restrict user space from doing
>> this as far as I know.
>
> "selftests", "there may be"?!
>
> This doesn't sound like a real-life use case to me and we don't do
> changes just because. Sorry.

I appreciate your direct feedback, thank you for helping here.

Let’s separate the discussion into two parts.
1: Bug Fixing -> an IBPB is being unconditionally issued even when
the user selects conditional. That needs to be fixed and this patch fixes
that.

2: Design -> do we even want to have this IBPB here in KVM at all?

In previous discussions (v1/v2 patch) with Sean, we talked about this
not making a whole lot of sense in general; however, we landed on
trying to not regress users who might, for whatever reason, care about
this IBPB.

I’ve shared a bit more detail on our use case below. I’m fine with nuking
this IBPB entirely, just want to be mindful of use cases from the rest of
the community that we might not normally cross in our day to day.

I’m happy to take feedback on this and integrate it into a v4 patch for
both of these parts, in terms of both code and correctness in the change
log.

>
>> The paranoid aspect here is KVM is issuing an *additional* IBPB on
>> top of what already happens in switch_mm().
>
> Yeah, I know how that works.
>
>> IMHO KVM side IBPB for most use cases isn’t really necessarily but
>> the general concept is that you want to protect vCPU from guest A
>> from guest B, so you issue a prediction barrier on vCPU switch.
>>
>> *however* that protection already happens in switch_mm(), because
>> guest A and B are likely to use different mm_struct, so the only point
>> of having this support in KVM seems to be to “kill it with fire” for
>> paranoid users who might be doing some tomfoolery that would
>> somehow bypass switch_mm() protection (such as somehow
>> sharing a struct).
>
> Yeah, no, this all sounds like something highly hypothetical or there's
> a use case of which you don't want to talk about publicly.

We’re an open book here, so I’m happy to share what we’re up to
publicly. Our use case is 100% qemu-kvm, which is all separate
processes/structs and is covered a-ok by the switch_mm() path. We
noticed this bug in one of our scalability tests, which oversubscribes
the host with many 2 vCPU VMs and runs a load runner that increases
load one machine at a time, so that we can see the trend of response
time of an app as host load increases.

Given that the host is heavily oversubscribed, there is a lot of
vCPU switching, and in particular switching in between vCPUs
belonging to different guests, so that we hit this particular barrier in
vcpu_load constantly. Taking this fix helped smoothed out that response
time curve a bit. Happy to share more specific data if you’d like.

>
> Either way, from what I'm reading I'm not in the least convinced that
> this is needed.
>
> --
> Regards/Gruss,
> Boris.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=tctUY3zgYEwUcPZ8E8v-EiXloK4PwYvUVBlR-amoRBEVZym6a2SuqyRYbNGF1_aZ&s=lQjy9h3G6eOqr2qEuAVvtX3LuDxW1kVJHdlkezCy3sU&e=