RE: [PATCH v3 3/4] x86/alternative: Rewrite optimize_nops() some

From: David Laight
Date: Thu Feb 09 2023 - 17:27:20 EST


From: Andrew.Cooper3@xxxxxxxxxx
> Sent: 09 February 2023 01:11
...
> >> UNTRAIN_RET -- specifically RESET_CALL_DEPTH
> > 19: 48 c7 c0 80 00 00 00 mov $0x80,%rax
> > 20: 48 c1 e0 38 shl $0x38,%rax
> > 24: 65 48 89 04 25 00 00 00 00 mov %rax,%gs:0x0 29: R_X86_64_32S
> pcpu_hot+0x10
> >
> > Is ofc an atrocity.
> >
> > We can easily trim that by 5 bytes to:
> >
> > 0: b0 80 mov $0x80,%al
> > 2: 48 c1 e0 38 shl $0x38,%rax
> > 6: 65 48 89 04 25 00 00 00 00 mov %rax,%gs:0x0
> >
> > Who cares about the top bytes, we're explicitly shifting them out
> > anyway. But that's still 15 bytes or so.
> >
> > If it weren't for those pesky prefix penalties that would make exactly
> > one instruction :-)
>
> Yeah, but then you're taking a merge penalty instead.
>
> Given that you can't reduce enough anyway, while only a 4 byte reduction
> rather than 5, you're probably better off with:
>
> 0:   31 c0                   xor    %eax,%eax
> 2:   48 0f ba e8 3f          bts    $0x3f,%rax
> 7:   65 48 89 04 25 00 00 00 00      mov    %rax,%gs:0x0
>
> because of the zeroing idiom splitting these 3 instructions away from
> the previous operation on rax.

How about:
31 c0 xor %eax,%eax
f9 stc
48 d1 d8 rcr $1,%rax
So 6 bytes total.
But that might be a partial dependency on flags.
(Although that isn't any worse than the xor.)
It is also a longer dependency chain - so the execution time
rather depends on what else is going on.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)