Re: [RFC PATCH 1/5] tracing: Define new ftrace event "func_repeats"

From: Steven Rostedt
Date: Mon Mar 08 2021 - 13:17:06 EST


On Mon, 8 Mar 2021 16:57:29 +0200
"Yordan Karadzhov (VMware)" <y.karadz@xxxxxxxxx> wrote:

> > If we had:
> >
> > <...>-37 [004] ...1 2022.303820: gc_worker <-process_one_work
> > <...>-37 [004] ...1 2022.303820: ___might_sleep <-gc_worker
> > <...>-37 [004] ...1 2022.303831: ___might_sleep <-gc_worker (last ts: 2022.303828 repeats: 127)
> > <...>-37 [004] ...1 2022.303831: queue_delayed_work_on <-process_one_work
> >
> > We would know the last time __might_sleep was called.
> >
> > That is, not only should we save the ip and pip in the trace_func_repeats
> > structure, but we should also be storing the last time stamp of the last
> > function event that repeated. Otherwise the above looks like the last
> > __might_sleep called above happened when the queue_delayed_work_on
> > happened, where that may not be the case.
>
> If we store the last timestamp, this means we will need to use
> additional 64b on the buffer, every time we record the "func_repeats"
> event. This looks like an overkill to me.
> Can we store only the duration of the repeats (the difference between
> the timestamp)? This way we can use less memory at the price of having
> one extra arithmetic operation.
> Alternative approach can be to store only the least-significant bits of
> the timestamp.
>
> What do you think?

I like the way you're thinking ;-)

Let's take a look at the current event sizes.

The function event is defined as:

name: function
ID: 1
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:unsigned long ip; offset:8; size:8; signed:0;
field:unsigned long parent_ip; offset:16; size:8; signed:0;

Showing that it's total size is 24 bytes (offset:16 + size:8)

and your current repeat event has:

ID: 17
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:unsigned long ip; offset:8; size:8; signed:0;
field:unsigned long pip; offset:16; size:8; signed:0;
field:unsigned long count; offset:24; size:8; signed:0;

Which is 32 bytes. Adding another 8 would make it 40. It's bigger than one
function event, but still smaller than two, and adding just 4 bytes is
still some size more than a single function event.

Since this event is new, we could even do another trick (one that the
preemptirq events do:

ID: 434
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:s32 caller_offs; offset:8; size:4; signed:1;
field:s32 parent_offs; offset:12; size:4; signed:1;

print fmt: "caller=%pS parent=%pS", (void *)((unsigned long)(_stext) + REC->caller_offs), (void *)((unsigned long)(_stext) + REC->parent_offs)

The (_stext) is defined in /proc/kallsyms:

# grep stext /proc/kallsyms
ffffffff8d000000 T _stext

And the offsets are off of that, which we could do the same thing here.

field:s32 ip; offset:8; size:4; signed:1;
field:s32 pip; offset:12; size:4; signed:1;
field:s16 count; offset:20; size:2; signed:1;
field:u16 top_ts; offset:22; size:2; signed:0;
field:u32 bottom_ts; offset:24; size:4; signed:0;

By putting count down to 2 bytes. Having more than 64K repeats is
probably not going to happen, and if it does, we could inject this more
than once ;-)

And as all events must be 4 byte aligned, we could use 6 bytes for the
offset, giving us: 2^40 bits instead of just 2^32 bits, that is, making the
time go from 4 seconds (2^32 nanoseconds) to 18 minutes (2^40 nanoseconds).

I would grab the time stamp after saving the event to make sure that it
doesn't go backwards. It may be a little racy (if an interrupt comes in
between), but it's still a "best effort" approach.

-- Steve