RE: [PATCH] x86/hyper-v: Zero out the VP assist page to fix CPU offlining

From: Thomas Gleixner
Date: Thu Jul 18 2019 - 03:01:06 EST


On Thu, 18 Jul 2019, Dexuan Cui wrote:
> > On Thu, 4 Jul 2019, Dexuan Cui wrote:
> > This is the allocation when the CPU is brought online for the first
> > time. So what effect has zeroing at allocation time vs. offlining and
> > potentially receiving IPIs? That allocation is never freed.
> >
> > Neither the comment nor the changelog make any sense to me.
> > tglx
>
> That allocation was introduced by the commit
> a46d15cc1ae5 ("x86/hyper-v: allocate and use Virtual Processor Assist Pages").
>
> I think it's ok to not free the page when a CPU is offlined: every
> CPU uses only 1 page and CPU offlining is not really a very usual
> operation except for the scenario of hibernation (and suspend-to-memory),
> where the CPUs are quickly onlined again, when we resume from hibernation.
> IMO Vitaly intentionally decided to not free the page for simplicity of the
> code.
>
> When a CPU (e.g. CPU1) is being onlined, in hv_cpu_init(), we allocate the
> VP_ASSIST_PAGE page and enable the PV EOI optimization for this CPU by
> writing the MSR HV_X64_MSR_VP_ASSIST_PAGE. From now on, this CPU
> *always* uses hvp->apic_assist (which is updated by the hypervisor) to
> decide if it needs to write the EOI MSR:
>
> static void hv_apic_eoi_write(u32 reg, u32 val)
> {
> struct hv_vp_assist_page *hvp = hv_vp_assist_page[smp_processor_id()];
>
> if (hvp && (xchg(&hvp->apic_assist, 0) & 0x1))
> return;
>
> wrmsr(HV_X64_MSR_EOI, val, 0);
> }
>
> When a CPU (e.g. CPU1) is being offlined, on this CPU, we do:
> 1. in hv_cpu_die(), we disable the PV EOI optimizaton for this CPU;
> 2. we finish the remaining work of stopping this CPU;
> 3. this CPU is completed stopped.
>
> Between 1 and 3, this CPU can still receive interrupts (e.g. IPIs from CPU0,
> and Local APIC timer interrupts), and this CPU *must* write the EOI MSR for
> every interrupt received, otherwise the hypervisor may not deliver further
> interrupts, which may be needed to stop this CPU completely.
>
> So we need to make sure hvp->apic_assist.bit0 is zero, after we run the line
> "wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);" in hv_cpu_die(). The easiest
> way is what I do in this patch. Alternatively, we can use the below patch:
>
> @@ -188,8 +188,12 @@ static int hv_cpu_die(unsigned int cpu)
> local_irq_restore(flags);
> free_page((unsigned long)input_pg);
>
> - if (hv_vp_assist_page && hv_vp_assist_page[cpu])
> + if (hv_vp_assist_page && hv_vp_assist_page[cpu]) {
> + local_irq_save(flags);
> wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);
> + hvp->apic_assist &= ~1;
> + local_irq_restore(flags);
> + }
>
> if (hv_reenlightenment_cb == NULL)
> return 0;
>
> This second version needs 3+ lines, so I prefer the one-line version. :-)

Those are two different things. The GPF_ZERO allocation makes sense on it's
own but it _cannot_ prevent the following scenario:

cpu_init()
if (!hvp)
hvp = vmalloc(...., GFP_ZERO);
...

hvp->apic_assist |= 1;

#1 cpu_die()
if (....)
wrmsrl(HV_X64_MSR_VP_ASSIST_PAGE, 0);

---> IPI
if (!(hvp->apic_assist & 1))
wrmsr(APIC_EOI); <- PATH not taken

#3 cpu is dead

cpu_init()
if (!hvp)
hvp = vmalloc(....m, GFP_ZERO); <- NOT TAKEN because hvp != NULL

So you have to come up with a better fairy tale why GFP_ZERO 'fixes' this.

Allocating hvp with GFP_ZERO makes sense on it's own so the allocated
memory has a defined state, but that's a different story.

The 3 liner patch above makes way more sense and you can spare the
local_irq_save/restore by moving the whole condition into the
irq_save/restore region above.

Thanks,

tglx