Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self

From: Andy Lutomirski
Date: Tue Dec 06 2016 - 12:50:09 EST

Next message: Adam Manzanares: "[PATCH v7 1/4] block: Add iocontext priority to request"
Previous message: Grygorii Strashko: "Re: [PATCH v4 09/13] net: ethernet: ti: cpts: rework initialization/deinitialization"
In reply to: Borislav Petkov: "Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self"
Next in thread: Jan Beulich: "Re: [Xen-devel] [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Dec 5, 2016 at 11:52 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
>> Aside from being excessively slow, CPUID is problematic: Linux runs
>> on a handful of CPUs that don't have CPUID. Use IRET-to-self
>> instead. IRET-to-self works everywhere, so it makes testing easy.
>>
>> For reference, On my laptop, IRET-to-self is ~110ns,
>> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
>> and MOV-to-CR2 is ~42ns.
>>
>> While we're at it: sync_core() serves a very specific purpose.
>> Document it.
>>
>> Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>> ---
>> arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
>> 1 file changed, 55 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index 64fbc937d586..201a956e345f 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>>
>> #define cpu_relax_lowlatency() cpu_relax()
>>
>> -/* Stop speculative execution and prefetching of modified code. */
>> +/*
>> + * This function forces the icache and prefetched instruction stream to
>> + * catch up with reality in two very specific cases:
>> + *
>> + * a) Text was modified using one virtual address and is about to be executed
>> + * from the same physical page at a different virtual address.
>> + *
>> + * b) Text was modified on a different CPU, may subsequently be
>> + * executed on this CPU, and you want to make sure the new version
>> + * gets executed. This generally means you're calling this in a IPI.
>> + *
>> + * If you're calling this for a different reason, you're probably doing
>> + * it wrong.
>
> "... and think hard before you call this - it is slow."
>
> I'd add that now that it is even slower than CPUID.

But only barely. And it's hugely faster than CPUID under KVM or
similar. And it works on all CPUs.

>
>> + */
>> static inline void sync_core(void)
>> {
>> - int tmp;
>> -
>> -#ifdef CONFIG_X86_32
>> /*
>> - * Do a CPUID if available, otherwise do a jump. The jump
>> - * can conveniently enough be the jump around CPUID.
>> + * There are quite a few ways to do this. IRET-to-self is nice
>> + * because it works on every CPU, at any CPL (so it's compatible
>> + * with paravirtualization), and it never exits to a hypervisor.
>> + * The only down sides are that it's a bit slow (it seems to be
>> + * a bit more than 2x slower than the fastest options) and that
>> + * it unmasks NMIs.
>
> Ewww, I hadn't thought of that angle. Aren't we going to get in all
> kinds of hard to debug issues due to that couple of cycles window of
> unmasked NMIs?
>
> We sync_core in some NMI handler and then right in the middle of it we
> get another NMI. Yeah, we have the nested NMI stuff still but I'd like
> to avoid complications if possible.

I'll counter with:

1. Why on earth would an NMI call sync_core()?

2. We have very careful and code to handle this issue, and NMIs really
do cause page faults which have exactly the same problem.

So I'm not too worried.

>
>> The "push %cs" is needed because, in
>> + * paravirtual environments, __KERNEL_CS may not be a valid CS
>> + * value when we do IRET directly.
>> + *
>> + * In case NMI unmasking or performance every becomes a problem,
>> + * the next best option appears to be MOV-to-CR2 and an
>> + * unconditional jump. That sequence also works on all CPUs,
>> + * but it will fault at CPL3.
>
> Does it really have to be non-priviledged?

Unless we want to declare lguest unsupported, delete it from the tree,
or, sigh, actually maintain it, then yes :( native_write_cr2() would
work on Xen, but it's slow.

>
> If not, there are a couple more serializing insns:
>
> "â Privileged serializing instructions â INVD, INVEPT, INVLPG,
> INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
> exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"
>
> What about INVD? It is expensive too :-)

Only if you write the patch and label it:

Snickered-at-by: Andy Lutomirski <luto@xxxxxxxxxx>

>
> Can't we use, MOV %dr or so? With previously saving/restoring the reg?
>
> WBINVD could be another candidate, albeit a big hammer.
>
> WRMSR maybe too. If it faults, it's fine too because then you have the
> IRET automagically. Hell, we could even make it fault on purpose by
> writing into an invalid MSR but then we're back to the unmasking NMIs...
> :-\

I still like MOV-to-CR2 better than all of these.

--Andy

Next message: Adam Manzanares: "[PATCH v7 1/4] block: Add iocontext priority to request"
Previous message: Grygorii Strashko: "Re: [PATCH v4 09/13] net: ethernet: ti: cpts: rework initialization/deinitialization"
In reply to: Borislav Petkov: "Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self"
Next in thread: Jan Beulich: "Re: [Xen-devel] [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]