Re: [PATCHv14 08/17] x86/mm: Reduce untagged_addr() overhead until the first LAM user

From: Nick Desaulniers
Date: Tue Jan 17 2023 - 15:33:49 EST


On Tue, Jan 17, 2023 at 10:34 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Jan 17, 2023 at 10:26 AM Nick Desaulniers
> <ndesaulniers@xxxxxxxxxx> wrote:
> >
> > On Tue, Jan 17, 2023 at 9:29 AM Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > Side note: that's not something new or unusual. It's been the case
> > > since I started testing clang - we have several code-paths where we
> > > use "unlikely()" to try to get very unlikely cases to be out-of-line,
> > > and clang just mostly ignores it, or treats it as a very weak hint. I
> > > think the only way to get clang to treat it as a *strong* hint is to
> > > use PGO.
> >
> > I'd be surprised if that were intentional or by design.
> >
> > Do you guys have a bug report we could look at?
>
> Heh. I actually sent you an example long ago. Let me go fish it out of
> my mail archives and quote some of it below so that you can find it in
> yours..
>
> Linus
>
> [ Time passes. Found this email to you and Bill Wendling from Feb 16,
> 2020, Message-ID
> CAHk-=wigVshsByCMjkUiZyQSR5N5zi2aAeQc+VJCzQV=nm8E7g@xxxxxxxxxxxxxx ]:
>
> Anyway, I'm looking at clang code generation, and comparing it with
> gcc on one of my "this has been optimized to hell and back" functions:
> __d_lookup_rcu().
>
> It looks like clang does frame pointers, and ignores our
> likely/unlikely annotations.
>
> So this code:
>
> if (unlikely(parent->d_flags & DCACHE_OP_COMPARE)) {
> int tlen;
> const char *tname;
> ......
>
> doesn't actually jump out of line, but instead generates the unlikely
> case as the fallthrough:
>
> testb $2, (%r12)
> je .LBB50_9
> ... unlikely code goes here...


Perhaps that was compiler version or config specific?

$ make LLVM=1 -j128 defconfig fs/dcache.o
$ llvm-objdump -d --no-show-raw-insn
--disassemble-symbols=__d_lookup_rcu fs/dcache.o
0000000000003210 <__d_lookup_rcu>:
3210: endbr64
3214: pushq %rbp
3215: pushq %r15
3217: pushq %r14
3219: pushq %r12
321b: pushq %rbx
321c: testb $0x2, (%rdi)
321f: jne 0x32d7 <__d_lookup_rcu+0xc7>
...
32d7: popq %rbx
32d8: popq %r12
32da: popq %r14
32dc: popq %r15
32de: popq %rbp
32df: jmp 0x3300 <__d_lookup_rcu_op_compare>

That looks like what you want, yeah?

Your original report was from nearly 3 years ago; could have fixed a
few instances of branch weights not getting propagated since then.

What's going on in this case in this thread? I know we don't support
hot/cold attributes on labels yet, but if static_branch_likely (or
friends) is being used, we assign the indirect branches a 0%
likeliness/branch-weight.

>
> and then the likely case ends up having unfortunate reloads inside the
> hot loop. Possibly because it has one fewer free registers than gcc
> because of the frame pointer.
>
> I didn't look into _why_ clang generates frame pointers but gcc
> doesn't. It may be just a compiler default, I think we don't end up
> explicitly asking either way.
>
> [ And then Bill replied with this ]
>
> It's not a no-op. We add branch probabilities to the IR, whether
> they're honored or not depends on the transformation. But they
> *should* be honored when available. I've seen in the past that instead
> of moving unlikely blocks out of the way (like gcc, which moves them
> below the function's "ret" instruction), LLVM will do something like
> this:
>
> <normal code>
> <jmp to loop conditional test>
> <unlikely code>
> <more likely code>
> <loop conditional test>
> <...>
>
> I.e. the loop is rotated and the unlikely code is first and the hotter
> code is closer together but between the unlikely and conditional test.
> Could this be what's going on? Otherwise, maybe clang decided that
> it's not beneficial to move the code out-of-line because the benefit
> was minimal? (I'm guessing here.)



--
Thanks,
~Nick Desaulniers