RE: [discuss] BTS overflow handling, was: [PATCH] perf_counter:Fix a race on perf_counter_ctx

From: Peter Zijlstra
Date: Tue Sep 01 2009 - 09:53:31 EST

On Tue, 2009-09-01 at 14:32 +0100, Metzger, Markus T wrote:

> >> My current theory is that the BTS buffer fills up so quickly when tracing
> >> the kernel, that the kernel is busy handling overflows and reacting on
> >> other interrupts that pile up while we're handling the BTS overflow.
> >>
> >> When I trace user-mode branches, it works.
> >>
> >> When I do not copy the trace during overflow handling, the kernel does not hang.
> >
> >Agreed, that was my suspicion as well. Would you happen to know where to
> >get these USB debug port cables, and how to find out if a machine
> >supports this?
> I'm sorry but I don't understand what you mean with "these USB debug port cables".

Not sure either, whatever makes all the code from commit:
5c05917e7fe313a187ad6ebb94c1c6cf42862a0b usable I guess.

> >> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.
> >
> >And those pages have to be contiguous too, right? That's an order-6
> >alloc, painful.
> According to an earlier discussion with Roland, they don't have to.
> They still need to be locked, though.
> According to some other discussion with Andrew and Ingo, I still use
> kmalloc to allocate those buffers.

Section 18.18.5 of book 3B says the DS buffer base is a linear address.
This suggests each buffer does need contiguous pages.

48 contiguous pages constitutes an order-6 allocation (64 pages), which
is unreliable at best.

> >> One buffer
> >> to switch in during overflow handling; another to switch in during sched_out
> >> (assuming that we need to schedule out the traced task before we may start
> >> the draining task). Even then, there's a chance that we will lose trace
> >> when the draining task may not start immediately. I would even say that
> >> this is quite likely.
> >
> >Right, is it possible to detect this loss?
> It is. But in order to get the PERF_EVENT_LOST record into the correct place,
> I need to defer logging the lost trace;-)
> And we would loose this very nice feature of fixed-size entries.

Well, if you don't defer there is no possibility of loosing data is
there? So we either defer and have to deal with lost data, or we do not.

> >This makes me wonder how much time it takes to drain these buffers, it
> >is at all possible to optimize that code path into oblivion, or will
> >nothing be fast enough?
> Are you saying that we should rather speed up that code path than try to
> defer all the work? There definitely is a lot of redundant work done on
> the generic path.
> I did a few experiments where I would drain only parts of the buffer.
> I could not drain too much before the system would hang.
> Besides, that does not sound too robust to me. Would it sill work on
> a slower system? Or on a faster one? Or on a fully loaded one?

Base cpu speed is what counts, load is not interesting.

Also it seems a normalizing property, the slower the cpu the less
branches it can process per time unit, so less data to process.

But yes, I was suggesting to optimize this, since the current way of
calling perf_counter_output() multiple times is massively bloated.

> >> There already seems to be something in place regarding deferring work, i..e.
> >> perf_counter_do_pending(). Would it be OK if I added the deferred BTS buffer
> >> draining to that?
> >
> >Yes, note that this pending work runs from hardirq context as well. On
> >x86 we self-ipi to get into IRQ context ASAP after the NMI.
> >
> >So if the remote cpu is blocked waiting on an SMP call, doing the work
> >from hardirq context won't really help things.
> I can't use that, then.
> When I use schedule_work() instead, how would I ensure that the work is done
> before the traced (or tracing) task is rescheduled?

No, basically the only thing left is softirqs, which can be preempted by
hardirqs, but that's a horrid hack too, esp since processing the BTS
outside of the handler will basically result in the BTS tracing its own
processing, generating even more data to process.

> I would need to ensure that the counter does not go away as long as draining
> work is scheduled. I would store a pointer to the counter in the work struct.
> Should perf_counter's be use-counted?

Yes, they currently rely on the reference counting of the filedesc
they're associated with, but that needs to be undone for in-kernel usage

> There's some more review feedback from you that I have not integrated, yet.
> One is that BTS should return an error instead of falling back to generic
> counters.
> Another is that BTS does not provide a counter value.
> How important are those?

Would be very nice to have.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at