Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

From: Gavin Shan
Date: Thu May 28 2020 - 02:32:30 EST


Hi Marc,

On 5/27/20 5:37 PM, Marc Zyngier wrote:
On 2020-05-27 05:05, Gavin Shan wrote:

[...]
+struct kvm_vcpu_pv_apf_data {
+ÂÂÂ __u32ÂÂÂ reason;
+ÂÂÂ __u8ÂÂÂ pad[60];
+ÂÂÂ __u32ÂÂÂ enabled;
+};

What's all the padding for?


The padding is ensure the @reason and @enabled in different cache
line. @reason is shared by host/guest, while @enabled is almostly
owned by guest.

So you are assuming that a cache line is at most 64 bytes.
It is actualy implementation defined, and you can probe for it
by looking at the CTR_EL0 register. There are implementations
ranging from 32 to 256 bytes in the wild, and let's not mention
broken big-little implementations here.

[...]


Ok, Thanks for your comments and hints.

+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
+{
+ÂÂÂ u64 vbar, pc;
+ÂÂÂ u32 val;
+ÂÂÂ int ret;
+
+ÂÂÂ if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+ÂÂÂÂÂÂÂ return false;
+
+ÂÂÂ if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+ÂÂÂÂÂÂÂ return false;
+
+ÂÂÂ /* Pending page fault, which ins't acknowledged by guest */
+ÂÂÂ ret = kvm_async_pf_read_cache(vcpu, &val);
+ÂÂÂ if (ret || val)
+ÂÂÂÂÂÂÂ return false;
+
+ÂÂÂ /*
+ÂÂÂÂ * Events can't be injected through data abort because it's
+ÂÂÂÂ * going to clobber ELR_EL1, which might not consued (or saved)
+ÂÂÂÂ * by guest yet.
+ÂÂÂÂ */
+ÂÂÂ vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+ÂÂÂ pc = *vcpu_pc(vcpu);
+ÂÂÂ if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+ÂÂÂÂÂÂÂ return false;

Ah, so that's when this `no_fault_inst_range` is for.

As-is this is not sufficient, and we'll need t be extremely careful
here.

The vectors themselves typically only have a small amount of stub code,
and the bulk of the non-reentrant exception entry work happens
elsewhere, in a mixture of assembly and C code that isn't even virtually
contiguous with either the vectors or itself.

It's possible in theory that code in modules (or perhaps in eBPF JIT'd
code) that isn't safe to take a fault from, so even having a contiguous
range controlled by the kernel isn't ideal.

How does this work on x86?


Yeah, here we just provide a mechanism to forbid injecting data abort. The
range is fed by guest through HVC call. So I think it's guest related issue.
You had more comments about this in PATCH[9]. I will explain a bit more there.

x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected
if it's on. Otherwise, it's forbidden. It's workable because exception is
special interrupt to x86 if I'm correct.

ÂÂÂÂÂÂÂÂÂÂ return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));

I really wish this was relying on an architected exception delivery
mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
Trying to guess based on the PC won't fly. But these signals are
pretty hard to multiplex with anything else. Like any form of
non-architected exception injection, I don't see a good path forward
unless we start considering something like SDEI.

ÂÂÂÂÂÂÂ M.

As Paolo mentioned in another reply. There are two types of notifications
sent from host to guest: page_not_present and page_ready. The page_not_present
notification should be delivered synchronously while page_ready can be
delivered asynchronously. He also suggested to reserve a ESR (or DFSC) subclass
for page_not_present. For page_ready, it can be delivered by interrupt, like PPI.
x86 is changing the code to deliver page_ready by interrupt, which was done by
exception previously.

when we use interrupt, instead of exception for page_ready. We won't need the
game of guessing PC.

I assume you prefer to use SDEI for page_not_present, correct? In that case,
what's the current status of SDEI? I mean it has been fully or partially
supported, or we need develop it from the scratch :)

Thanks,
Gavin