Re: [PATCH v3 3/4] x86/alternative: Rewrite optimize_nops() some

From: Peter Zijlstra
Date: Wed Feb 08 2023 - 15:29:58 EST


On Wed, Feb 08, 2023 at 07:52:04PM +0000, Andrew.Cooper3@xxxxxxxxxx wrote:
> On 08/02/2023 5:10 pm, Peter Zijlstra wrote:
> > This rewrite address two issues:
> >
> > - it no longer hard requires single byte nop runs, it now accepts
> > any NOP and NOPL encoded instruction (but not the more complicated
> > 32bit NOPs).
> >
> > - it writes a single 'instruction' replacement.
> >
> > Specifically, ORC unwinder relies on the tail NOP of an alternative to
> > be a single instruction, in particular it relies on the inner bytes
> > not being executed.
> >
> > Once we reach the max supported NOP length (currently 8, could easily
> > be extended to 11 on x86_64), switches to JMP.d8 and INT3 padding to
> > achieve the same result.
> >
> > The ORC unwinder uses this guarantee in the analysis of
> > alternative/overlapping CFI state,
> >
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>
> How lucky are you feeling for your game of performance roulette?

Yeah, not very lucky.. I've been talking about this with Boris for a bit
already.

> Unconditional jmps cost branch prediction these days, and won't be
> successfully predicted until taken.

IKR, insane, but that's what it is.

> There is a point after which a jmp is more efficient that brute forcing
> through a line of nops, and where this point is is very uarch specific,
> but it's not a single nop...

Yeah, I know, even very big nops. $I prefetch is good at straight lines
etc..

> Whether you care or not is a different matter, but at least be aware
> doing a jmp like this instead of e.g. 2 or 3 nops, is contrary to the
> prior advice given by the architects.

So, it is either this, or make the CFI conflict analysis done by objtool
a lot more 'interesting'. I figured I'd try this first.

As is this is actually a correctness issue, not a performance issue.
Goes hand in hand with the patches:

https://lkml.kernel.org/r/20230208172245.711471461@xxxxxxxxxxxxx
https://lkml.kernel.org/r/20230208172245.783099843@xxxxxxxxxxxxx

Specifically, for short alternatives objtool adds a single nop of
size difference size at the end. The advantage is that you only get one
CFI entry for that thing and so don't have potential conflict for any of
the inner bytes.

The alternative is making it multiple NOPs and ensuring objtool and the
kernel both agree on the length of them.

As is, there were only a hand full of NOPs that turned into this jmp.d8
thing on the defconfig+kvm_guest.config I build to test this -- it is by
no means a common thing. And about half of them would be gone by
extending the max nop length to at least 10 or so.

In fact, I did that patch once, lemme see if I still have it...