Re: [PATCH -next V7 0/7] riscv: Optimize function trace

From: Mark Rutland
Date: Wed Feb 08 2023 - 09:46:11 EST


On Wed, Feb 08, 2023 at 10:30:56AM +0800, Guo Ren wrote:
> Hi Mark,
>
> Thx for the thoughtful reply.
>
> On Tue, Feb 7, 2023 at 5:17 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > Note that I'm assuming you will *always* go through a common ftrace_caller
> > trampoline (even for direct calls), with the trampoline responsible for
> > recovering the direct trampoline (or ops->func) from the ops pointer.
> >
> > That would only require 64-bit alignment on 64-bit (or 32-bit alignment on
> > 32-bit) to keep the literal naturally-aligned; the rest of the instructions
> > wouldn't require additional alignment.
> >
> > For example, I would expect that (for 64-bit) you'd use:
> >
> > # place 2 NOPs *immediately before* the function, and 3 NOPs at the start
> > -fpatchable-function-entry=5,2
> >
> > # Align the function to 8-bytes
> > -falign=functions=8
> >
> > ... and your trampoline in each function could be initialized to:
> >
> > # Note: aligned to 8 bytes
> > addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops
> > addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops
> > addr+00 func: mv t0, ra
> > addr+04 auipc t1, ftrace_caller
> > addr+08 nop
> >
> > ... and when enabled can be set to:
> >
> > # Note: aligned to 8 bytes
> > addr-08 // Literal (first 32-bits) // patched to ops ptr
> > addr-04 // Literal (last 32-bits) // patched to ops ptr
> > addr+00 func: mv t0, ra
> We needn't "mv t0, ra" here because our "jalr" could work with t0 and
> won't affect ra. Let's do it in the trampoline code, and then we can
> save another word here.

Ah; I thought JALR always clobbered ra? Or can that specify the register to
save the link address to?

I'm not that familiar with riscv asm, so I've probably just got that wrong.

> > addr+04 auipc t1, ftrace_caller
> > addr+08 jalr ftrace_caller(t1)
>
> Here is the call-site:
> # Note: aligned to 8 bytes
> addr-08 // Literal (first 32-bits) // patched to ops ptr
> addr-04 // Literal (last 32-bits) // patched to ops ptr
> addr+00 auipc t0, ftrace_caller
> addr+04 jalr ftrace_caller(t0)

I'm a bit confused there; I thought that the `symbol(reg)` addressing mode was
generating additional bits that the AUPIC didn't -- have I got that wrong?

What specifies which register the JALR will write the link address to?

> > Note: this *only* requires patching the literal and NOP<->JALR; the MV and
> > AUIPC aren't harmful and can always be there. This way, you won't need to use
> > stop_machine().
> Yes, simplest nop is better than c.j. I confused.
>
> >
> > With that, the ftrace_caller trampoline can recover the `ops` pointer at a
> > negative offset from `ra`, and can recover the instrumented function's return
> > address in `t0`. Using the `ops` pointer, it can figure out whether to branch
> > to a direct trampoline or whether to save/restore the regs around invoking
> > ops->func.
> >
> > For 32-bit it would be exactly the same, except you'd only need a single nop
> > before the function, and the offset would be -0x10.
> Yes, we reduced another 4 bytes & a smaller alignment for better code
> size when 32-bit.
> # Note: aligned to 4 bytes
> addr-04 // Literal (last 32-bits) // patched to ops ptr
> addr+00 auipc t0, ftrace_caller
> addr+04 jalr ftrace_caller(t0)
> >

> > That's what arm64 does; the only difference is that riscv would *always* need
> > to go via the trampoline in order to make direct calls.
> We need one more trampoline here beside ftrace_caller &
> ftrace_regs_caller: It's "direct_caller".
>
> addr+04 nop -> direct_caller/ftrace_caller/ftrace_regs_caller

I'd strongly recommend that you instead implement FTRACE_WITH_ARGS and
deprecate FTRACE_WITH_REGS, like arm64 has done, then you only need a single
ftrace_caller, as I mentioned above. That way there's no risk that you need to
patch the AUIPC after initialization.

The arm64 FTRACE_WITH_ARGS conversion is in mainline, and arm64's
FTRACE_WITH_CALL_OPS is based upon that. Florent's DIRECT_CALLS patches add the
direct call logic to the same ftrace_caller trampoline.

Thanks,
Mark.