Re: [PATCH -next V7 0/7] riscv: Optimize function trace

From: Guo Ren
Date: Wed Feb 08 2023 - 20:31:47 EST


On Wed, Feb 8, 2023 at 10:46 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> On Wed, Feb 08, 2023 at 10:30:56AM +0800, Guo Ren wrote:
> > Hi Mark,
> >
> > Thx for the thoughtful reply.
> >
> > On Tue, Feb 7, 2023 at 5:17 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > > Note that I'm assuming you will *always* go through a common ftrace_caller
> > > trampoline (even for direct calls), with the trampoline responsible for
> > > recovering the direct trampoline (or ops->func) from the ops pointer.
> > >
> > > That would only require 64-bit alignment on 64-bit (or 32-bit alignment on
> > > 32-bit) to keep the literal naturally-aligned; the rest of the instructions
> > > wouldn't require additional alignment.
> > >
> > > For example, I would expect that (for 64-bit) you'd use:
> > >
> > > # place 2 NOPs *immediately before* the function, and 3 NOPs at the start
> > > -fpatchable-function-entry=5,2
> > >
> > > # Align the function to 8-bytes
> > > -falign=functions=8
> > >
> > > ... and your trampoline in each function could be initialized to:
> > >
> > > # Note: aligned to 8 bytes
> > > addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops
> > > addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops
> > > addr+00 func: mv t0, ra
> > > addr+04 auipc t1, ftrace_caller
> > > addr+08 nop
> > >
> > > ... and when enabled can be set to:
> > >
> > > # Note: aligned to 8 bytes
> > > addr-08 // Literal (first 32-bits) // patched to ops ptr
> > > addr-04 // Literal (last 32-bits) // patched to ops ptr
> > > addr+00 func: mv t0, ra
> > We needn't "mv t0, ra" here because our "jalr" could work with t0 and
> > won't affect ra. Let's do it in the trampoline code, and then we can
> > save another word here.
>
> Ah; I thought JALR always clobbered ra? Or can that specify the register to
> save the link address to?
Yes, that's the feature of riscv :) We could use any register to save
the link address.

>
> I'm not that familiar with riscv asm, so I've probably just got that wrong.
>
> > > addr+04 auipc t1, ftrace_caller
> > > addr+08 jalr ftrace_caller(t1)
> >
> > Here is the call-site:
> > # Note: aligned to 8 bytes
> > addr-08 // Literal (first 32-bits) // patched to ops ptr
> > addr-04 // Literal (last 32-bits) // patched to ops ptr
> > addr+00 auipc t0, ftrace_caller
> > addr+04 jalr ftrace_caller(t0)
Sorry, it should be:
addr+04 jalr t0, ftrace_caller(t0)

>
> I'm a bit confused there; I thought that the `symbol(reg)` addressing mode was
> generating additional bits that the AUPIC didn't -- have I got that wrong?
>
> What specifies which register the JALR will write the link address to?
According to the spec, auipc t1,0x0 should write PC + 0x0<<12 (which
is equal to PC) to t1 and then jalr t0, (t0)0 jumps to the address
stored in t0 + 0x0 and stores the return address to t0.

That means auipc defines xxx << 12 bits, jalr defines lowest 12 bits.

>
> > > Note: this *only* requires patching the literal and NOP<->JALR; the MV and
> > > AUIPC aren't harmful and can always be there. This way, you won't need to use
> > > stop_machine().
> > Yes, simplest nop is better than c.j. I confused.
> >
> > >
> > > With that, the ftrace_caller trampoline can recover the `ops` pointer at a
> > > negative offset from `ra`, and can recover the instrumented function's return
> > > address in `t0`. Using the `ops` pointer, it can figure out whether to branch
> > > to a direct trampoline or whether to save/restore the regs around invoking
> > > ops->func.
> > >
> > > For 32-bit it would be exactly the same, except you'd only need a single nop
> > > before the function, and the offset would be -0x10.
> > Yes, we reduced another 4 bytes & a smaller alignment for better code
> > size when 32-bit.
> > # Note: aligned to 4 bytes
> > addr-04 // Literal (last 32-bits) // patched to ops ptr
> > addr+00 auipc t0, ftrace_caller
> > addr+04 jalr ftrace_caller(t0)
addr+04 jalr t0, ftrace_caller(t0)

> > >
>
> > > That's what arm64 does; the only difference is that riscv would *always* need
> > > to go via the trampoline in order to make direct calls.
> > We need one more trampoline here beside ftrace_caller &
> > ftrace_regs_caller: It's "direct_caller".
> >
> > addr+04 nop -> direct_caller/ftrace_caller/ftrace_regs_caller
>
> I'd strongly recommend that you instead implement FTRACE_WITH_ARGS and
> deprecate FTRACE_WITH_REGS, like arm64 has done, then you only need a single
> ftrace_caller, as I mentioned above. That way there's no risk that you need to
> patch the AUIPC after initialization.
>
> The arm64 FTRACE_WITH_ARGS conversion is in mainline, and arm64's
> FTRACE_WITH_CALL_OPS is based upon that. Florent's DIRECT_CALLS patches add the
> direct call logic to the same ftrace_caller trampoline.
Thx for the suggestion of only keeping the ftrace_caller idea, but
it's another topic.

What I want to point out:
If we keep "auipc (addr+00)" fixed, we could use the different
trampolines at "jalr (addr+0x4)" (All of them must be in one 2k
aligned area).

>
> Thanks,
> Mark.



--
Best Regards
Guo Ren