Re: [PATCH] tracing: Trace instrumentation begin and end

From: Steven Rostedt
Date: Wed Mar 22 2023 - 08:48:46 EST


On Wed, 22 Mar 2023 12:19:14 +0100
Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:

> Steven!
>
> On Tue, Mar 21 2023 at 21:51, Steven Rostedt wrote:
> > From: "Steven Rostedt (VMware)" <rostedt@xxxxxxxxxxx>
> > produces:
> >
> > 2) 0.764 us | exit_to_user_mode_prepare();
> > 2) | /* page_fault_user: address=0x7fadaba40fd8 ip=0x7fadaba40fd8 error_code=0x14 */
> > 2) 0.581 us | down_read_trylock();
> >
> > The "page_fault_user" event is not encapsulated around any function, which
> > means it probably triggered and went back to user space without any trace
> > to know how long that page fault took (the down_read_trylock() is likely to
> > be part of the page fault function, but that's besides the point).
> >
> > To help bring back the old functionality, two trace points are added. One
> > just after instrumentation begins, and one just before it ends. This way,
> > we can see all the time that the kernel can do something meaningful, and we
> > will trace it.
>
> Seriously? That's completely insane. Have you actually looked how many
> instrumentation_begin()/end() pairs are in the affected code pathes?
>
> Obviously not. It's a total of _five_ for every syscall and at least
> _four_ for every interrupt/exception from user mode.
>
> The number #1 design rule for instrumentation is to be as non-intrusive as
> possible and not to be as lazy as possible.

And it still is. It still uses nops when not enabled. I could even add a
config to only have this compiled in when the config is set, so that
production can disable it if it wants to.

Just in case it's not obvious:

if (tracepoint_enabled(instrumentation_begin))
call_trace_instrumentation_begin(ip, pip);

is equivalent to:

if (static_key_false(&__tracepoint_instrumentation_begin.key))
call_trace_instrumentation_begin(ip, pip);

We have trace points in preempt_enable/disable() I think that's *far* more
intrusive.

>
> instrumentation_begin()/end() is solely meant for objtool validation and
> nothing else.
>
> There are clearly less horrible ways to retrieve the #PF duration, no?

It's not just for #PF, that was just one example. I use to use function
graph tracing max_depth_count=1 to verify NO_HZ_FULL to make sure there's
no entry into the kernel. That doesn't work anymore. Even compat syscalls
are not traced.

I lost a kernel feature with the noinstr push and this is the closest that
comes to bringing it back. And the more we add noinstr, the more the kernel
becomes a black box again.

-- Steve