Re: [GIT PULL] x86/asm changes for v5.6

From: Borislav Petkov
Date: Wed Jan 29 2020 - 13:34:19 EST


On Wed, Jan 29, 2020 at 09:40:58AM -0800, Linus Torvalds wrote:
> On Wed, Jan 29, 2020 at 9:07 AM Luck, Tony <tony.luck@xxxxxxxxx> wrote:
> >
> > This returns "3" ... not what we want. But swapping the ERMS/FSRM order
> > gets the correct version.
>
> That actually makes sense, and is what I suspected (after I wrote the
> patch) what would happen. It would just be good to have it explicitly
> documented at the macro.

Like this?

---
diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 13adca37c99a..d94bad03bcb4 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -164,6 +164,11 @@ static inline int alternatives_text_reserved(void *start, void *end)
ALTINSTR_REPLACEMENT(newinstr, feature, 1) \
".popsection\n"

+/*
+ * The patching happens in the natural order of first newinstr1 and then
+ * newinstr2, iff the respective feature bits are set. See apply_alternatives()
+ * for details.
+ */
#define ALTERNATIVE_2(oldinstr, newinstr1, feature1, newinstr2, feature2)\
OLDINSTR_2(oldinstr, 1, 2) \
".pushsection .altinstructions,\"a\"\n" \

> That would be bad indeed, but I don't think it should matter
> particularly for this case - it would have been bad before too.
>
> I suspect there is some other issue going on, like the NOP padding
> logic being confused.

Or the cmp $0x20 test missing in the default case, see below.

> In particular, look here, this is the alt_instruction entries:
>
> altinstruction_entry 140b,143f,\feature1,142b-140b,144f-143f,142b-141b
> altinstruction_entry 140b,144f,\feature2,142b-140b,145f-144f,142b-141b
>
> where the arguments are "orig alt feature orig_len alt_len pad_len" in
> that order.
>
> Notice how "pad_len" in both cases is the padding from the _original_
> instruction (142b-141b).

<snip this which I'll take a look later so that we can sort out the
issue at hand first>

> So I'm just hand-waving. Maybe there was some simpler explanation
> (like me just picking the wrong instructions when I did the rough
> conversion and simply breaking things with some stupid bug).

Looks like it. So I did this:

---
diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 7ff00ea64e4f..a670d01570df 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -39,23 +39,19 @@ SYM_FUNC_START(__memmove)
cmp %rdi, %r8
jg 2f

- /* FSRM implies ERMS => no length checks, do the copy directly */
-.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
- ALTERNATIVE "", "movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS
-
/*
- * movsq instruction have many startup latency
- * so we handle small size by general register.
- */
- cmp $680, %rdx
- jb 3f
- /*
- * movsq instruction is only good for aligned case.
+ * Three rep-string alternatives:
+ * - go to "movsq" for large regions where source and dest are
+ * mutually aligned (same in low 8 bits). "label 4"
+ * - plain rep-movsb for FSRM
+ * - rep-movs for > 32 byte for ERMS.
*/
+.Lmemmove_begin_forward:
+ ALTERNATIVE_2 \
+ "cmp $0x20, %rdx; jb 1f; cmp $680, %rdx ; jb 3f ; cmpb %dil, %sil; je 4f", \
+ "cmp $0x20, %rdx; jb 1f; movq %rdx, %rcx; rep movsb; retq", X86_FEATURE_ERMS, \
+ "movq %rdx, %rcx ; rep movsb; retq", X86_FEATURE_FSRM

- cmpb %dil, %sil
- je 4f
3:
sub $0x20, %rdx
/*

---

Notice how the *first* option of the alternative, which is the default
one, has that gotten that additional "cmp $0x20, %rdx; jb 1f" test which
sends us down to the less than 32 bytes copy length.

Your original version didn't have it and here's what I saw:

So I stopped the guest just before that code and the call trace looked
like this:

#0 memmove () at arch/x86/lib/memmove_64.S:43
#1 0xffffffff824448c2 in memblock_insert_region (type=0xffffffff824a8298 <memblock+56>, idx=<optimized out>, base=0,
size=4096, nid=2, flags=MEMBLOCK_NONE) at mm/memblock.c:553
#2 0xffffffff824454f0 in memblock_add_range (type=0xffffffff824a8298 <memblock+56>, base=0, size=<optimized out>,
nid=73400320, flags=<optimized out>) at mm/memblock.c:641
#3 0xffffffff82445627 in memblock_reserve (base=0, size=4096) at mm/memblock.c:830
#4 0xffffffff823ff399 in setup_arch (cmdline_p=0xffffffff82003f28) at arch/x86/kernel/setup.c:798
#5 0xffffffff823f9ae1 in start_kernel () at init/main.c:598
#6 0xffffffff810000d4 in secondary_startup_64 () at arch/x86/kernel/head_64.S:242
#7 0x0000000000000000 in ?? ()

and count in rdx was:

rdx 0x18 24

Without that "cmp $0x20" test above, we do the "cmp $680, %rdx; jb 3f;" test
and we run into the following asm at label 3:

3:
sub $0x20, %rdx
/*
* We gobble 32 bytes forward in each loop.
*/

<--- right here %rdx is:

rdx 0xfffffffffffffff8 -8

and yeeehaaw, we're in the weeds and then end up triplefaulting at some
unmapped source address in %rsi or so.

So now I'm going to play all three variants with pen and paper to make
sure we're still sane.

Thx.

--
Regards/Gruss,
Boris.

SUSE Software Solutions Germany GmbH, GF: Felix ImendÃrffer, HRB 36809, AG NÃrnberg