Re: [patch 2/2] x86/kvm: Sanitize kvm_async_pf_task_wait()

From: Andy Lutomirski
Date: Sat Mar 07 2020 - 10:11:03 EST


On Sat, Mar 7, 2020 at 2:01 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Andy Lutomirski <luto@xxxxxxxxxx> writes:
> >> On Mar 6, 2020, at 4:12 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> Aside of that the actual code is a convoluted one fits it all swiss army
> >> knife. It is invoked from different places with different RCU constraints:
> >>
> >> 1) Host side:
> >>
> >> vcpu_enter_guest()
> >> kvm_x86_ops->handle_exit()
> >> kvm_handle_page_fault()
> >> kvm_async_pf_task_wait()
> >>
> >> The invocation happens from fully preemptible context.
> >
> > Iâm a bit baffled as to why the host uses this code at all instead of
> > just sleeping the old fashioned way when the guest takes a (host) page
> > fault. Oh well.
>
> If I can trust the crystal ball which I used to decode this maze then it
> actually makes sense.
>
> Aysnc faults are faults which cannot be handled by the guest, i.e. the
> host either pulled a page away under the guest or did not populate it in
> the first place.
>
> So the reasoning is that if this happens the guest might be in a
> situation where it can schedule other tasks instead of being stopped
> completely by the host until the page arrives.
>
> Now you could argue that this mostly makes sense for CPL 0 faults, but
> there is definitely code in the kernel where it makes sense as well,
> e.g. exec. But of course as this is designed without a proper handshake
> there is no way for the hypervisor to figure out whether it makes sense
> or not.
>
> If the async fault cannot be delivered to the guest (async PF disabled,
> async PF only enabled for CPL 0, IF == 0) then the host utilizes the
> same data structure and wait mechanism. That really makes sense.
>
> The part which does not make sense in the current implementation is the
> kvm_async_pf_task_wait() trainwreck. A clear upfront separation of
> schedulable and non schedulable wait mechanisms would have avoided all
> the RCU duct tape nonsense and also spared me the brain damage caused by
> reverse engineering this completely undocumented mess.
>
> >> +static void kvm_async_pf_task_wait_halt(u32 token)
> >> +{
> >> + struct kvm_task_sleep_node n;
> >> +
> >> + if (!kvm_async_pf_queue_task(token, true, &n))
> >> + return;
> >>
> >> for (;;) {
> >> - if (!n.halted)
> >> - prepare_to_swait_exclusive(&n.wq, &wait, TASK_UNINTERRUPTIBLE);
> >> if (hlist_unhashed(&n.link))
> >> break;
> >> + /*
> >> + * No point in doing anything about RCU here. Any RCU read
> >> + * side critical section or RCU watching section can be
> >> + * interrupted by VMEXITs and the host is free to keep the
> >> + * vCPU scheduled out as long as it sees fit. This is not
> >> + * any different just because of the halt induced voluntary
> >> + * VMEXIT.
> >> + *
> >> + * Also the async page fault could have interrupted any RCU
> >> + * watching context, so invoking rcu_irq_exit()/enter()
> >> + * around this is not gaining anything.
> >> + */
> >> + native_safe_halt();
> >> + local_irq_disable();
> >
> > Whatâs the local_irq_disable() here for? I would believe a
> > lockdep_assert_irqs_disabled() somewhere in here would make sense.
> > (Yes, I see you copied this from the old code. Itâs still nonsense.)
>
> native_safe_halt() does:
>
> STI
> HLT
>
> So the irq disable is required as the loop should exit with interrupts
> disabled.

Oops, should have looked at what native_safe_halt() does.

>
> > I also find it truly bizarre that hlt actually works in this context.
> > Does KVM in fact wake a HLTed guest that HLTed with IRQs off when a
> > pending async pf is satisfied? This would make sense if the wake
> > event were an interrupt, but itâs not according to Paolo.
>
> See above. safe halt enables interrupts, which means IF == 1 and the
> host sanity check for IF == 1 is satisfied.
>
> In fact, if e.g. some regular interrupt arrives before the page becomes
> available and the guest entered the halt loop because the fault happened
> inside a RCU read side critical section with preemption enabled, then
> the interrupt might wake up another task, set need resched and this
> other task can run.

Now I'm confused again. Your patch is very careful not to schedule if
we're in an RCU read-side critical section, but the regular preemption
code (preempt_schedule_irq, etc) seems to be willing to schedule
inside an RCU read-side critical section. Why is the latter okay but
not the async pf case?

Ignoring that, this still seems racy:

STI
nested #PF telling us to wake up
#PF returns
HLT

doesn't this result in putting the CPU asleep for no good reason until
the next interrupt hits?


> > All this being said, the only remotely sane case is when regs->flags
> > has IF==1. Perhaps this code should actually do:
> >
> > WARN_ON(!(regs->flags & X86_EFLAGS_IF));
>
> Yes, that want's to be somewhere early and also cover the async wake
> case. Neither wake nor wait can be injected when IF == 0.

Sadly, wrmsr to turn off async pf will inject wakeups even if IF == 0.