Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers

From: Jiaying Zhang
Date: Fri Mar 06 2009 - 18:29:00 EST

Next message: Mike Miller (OS Dev): "Re: [PATCH 1/2] resubmit cciss: kernel thread to detect changes on MSA2012"
Previous message: Sitsofe Wheeler: "Re: Problems with AGPGART/drm and X11 2D-acceleration (i8xx)"
In reply to: Mathieu Desnoyers: "Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Sorry for getting into this discussion late.

I would like to point out that we think it is really important to have
some very efficient probing mechanism in the kernel for tracing in
production environments. The printf and va_arg based probes are
flexible but less efficient when we want to trace high-throughput events.
Even function calls can add noticeable overhead according to our
measurements. So I think we need to provide a way (mostly via
macro definitions) with which a subsystem can enter an event into
a trace buffer through a short code path. I.e., we should limit the
number of callbacks and avoid format string parsing.

As I understand, Steven's latest TRACE_FIELD patch avoids such
overhead, although it does seem to add complexity for adding new
trace points. It would be nice if we can replace the above sched_switch
declaration with just a couple of macros.

Jiaying

On Fri, Mar 6, 2009 at 11:10 AM, Mathieu Desnoyers
<compudj@xxxxxxxxxxxxxxxxxx> wrote:
> * Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
>>
>> On Wed, 4 Mar 2009, Mathieu Desnoyers wrote:
>> > >
>> > > From previous patches, we have in include/trace/sched_event_types.h:
>> > >
>> > > #undef TRACE_SYSTEM
>> > > #define TRACE_SYSTEM sched
>> > >
>> > > TRACE_EVENT_FORMAT(sched_switch,
>> > > TPPROTO(struct rq *rq, struct task_struct *prev,
>> > > struct task_struct *next),
>> > > TPARGS(rq, prev, next),
>> > > TPFMT("task %s:%d ==> %s:%d",
>> > > prev->comm, prev->pid, next->comm, next->pid),
>> > > TRACE_STRUCT(
>> > > TRACE_FIELD(pid_t, prev_pid, prev->pid)
>> > > TRACE_FIELD(int, prev_prio, prev->prio)
>> > > TRACE_FIELD_SPECIAL(char next_comm[TASK_COMM_LEN],
>> > > next_comm,
>> > > TPCMD(memcpy(TRACE_ENTRY->next_comm,
>> > > next->comm,
>> > > TASK_COMM_LEN)))
>> > > TRACE_FIELD(pid_t, next_pid, next->pid)
>> > > TRACE_FIELD(int, next_prio, next->prio)
>> > > ),
>> > > TPRAWFMT("prev %d:%d ==> next %s:%d:%d")
>> > > );
>> > >
>> >
>> > I fear that putting these user-visible format strings in tracepoint
>> > header files will create a big maintainability issue.
>> >
>> > I'll post the LTTng patchset in a jiffy, where the format string
>> > awareness is done in a tracer-specific module. I don't understand why
>> > Peter Z. is not yelling against your tracepoints modifications : they
>> > are actually presenting to userspace an interface that is meant to
>> > eventually change.
>> >
>> > I used a separate layer for format string presentation for this very
>> > purpose : I don't want to tie the kernel code instrumentation
>> > (tracepoints) to any kind of user-visible API.
>> >
>>
>> Hi Mathieu,
>>
>> Sorry for the late reply, I started to reply but was distracted, and sadly
>> forgot about it :-( (and I rebooted the box that had the reply window
>> open)
>>
>>
>> Just a couple of points to make.
>>
>> 1) The raw format string is a "hint" for userspace to read the buffers.
>> The offset and sizeof is the real data that userspace should use. The
>> print format can be just a way to help them out. But it is by no means
>> something that is expected to be constant.
>>
>> It is also used internal by ftrace because it needs a way to read the
>> "fast C style recording" records. It is better to print some text than to
>> have a bunch of hex print out in the screen. Say you have the C style
>> recording enabled and you take an oops. With ftrace_dumps_on_oops set, the
>> buffers will be printed to the console. This lets ftrace have a means to
>> print out the events.
>>
>> 2) The difference between this and markers is that the format string is
>> bound to the declaration in the trace header. Not in the code itself. The
>> maintainer should only see the trace point function call. It should only
>> look like a function call and not some funky printf string that the
>> maintainer might not care about.
>>
>> If the code changes and it breaks the format string, it is up to the
>> tracing maintainers to fix it. If the format string was embedded within
>> the code, then the maintainer would be burdoned with that duty. But here
>> the format string is in the include/trace directory and any breakage there
>> should definitely be fixed by a some tracing maintainer. Not the
>> maintainer of the subsystem. Unless of course that maintainer wants to fix
>> it ;-)
>>
>>
>> 3) TRACE_EVENT_FORMAT is an extension to TRACE_FORMAT. Developers may
>> choose to use the more simple, but a bit more expensive TRACE_FORMAT if
>> they choose to. Perhaps it may be more prudent to anyway. Recently
>> Peter Zijlstra was tracing locks and just wanted to trace the name.
>> There's really no difference in copying the name via a C style (strcpy) or
>> having a printf formatting do the work. Here is a case where just using
>> TRACE_FORMAT is a reasonable solution.
>>
>> 4) I actually avoided moving the format string into the tracing utility.
>> The tracing utility can ignore it if it wants. I orginially had the
>> headers included in the kernel/trace/events.c file and found that was too
>> cumbersome. I did not like the fact that I needed to go into some tracing
>> infrastructure to add an event. It was a burdon to me to add a new event
>> and I wrote the code! That's why I moved it all into the include/trace
>> directory. One stop shopping.
>>
>> A developer to add new events only needs to add the trace points in their
>> code and then update the files (perhaps add new ones) in include/trace. No
>> need to dig through the kernel source to find out where else they need to
>> go.
>>
>>
>> I see that you published your patch queue. I've looked at most of the
>> patches already. I'll make my comments shortly.
>>
>
> Hi Steve,
>
> I totally agree that we want something like a format string to identify
> the data we export. My disagreement is on the location where such
> identification belongs. I have the funny feeling that you are currently
> creating no less than full-blown callbacks expressed as a weird, custom,
> macros-based, header definition, and I think it's seriously wrong.
>
> To paraphrase Andrew Morton : we are writing in C. Let's keep it that
> way.
>
> This is why I strongly prefer to use the tracepoint callback -> marker
> format string approach rather that putting any sort of format string in
> the tracepoint header.
>
> If what you really worry about is to have the ability to "quickly" add
> debug-style instrumentation when you write code, then you do not want,
> nor should ever, use a tracepoint, because tracepoint location is not
> meant to be decided without careful consideration. Markers are there
> fore this : you can add a quick-and-dirty one-liner :
>
> trace_mark(group, event, "field_identifier1 %u f_ident2 %u", ...);
>
> In your source code and you are all set. I don't see the problem with
> such approach ?
>
> Mathieu
>
>> Thanks,
>>
>> -- Steve
>>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mike Miller (OS Dev): "Re: [PATCH 1/2] resubmit cciss: kernel thread to detect changes on MSA2012"
Previous message: Sitsofe Wheeler: "Re: Problems with AGPGART/drm and X11 2D-acceleration (i8xx)"
In reply to: Mathieu Desnoyers: "Re: [PATCH 0/5] [RFC] binary reading of ftrace ring buffers"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]