Re: [discuss] BTS overflow handling, was: [PATCH] perf_counter:Fix a race on perf_counter_ctx

From: Peter Zijlstra
Date: Tue Sep 01 2009 - 09:01:00 EST

On Tue, 2009-09-01 at 12:17 +0100, Metzger, Markus T wrote:

> My current theory is that the BTS buffer fills up so quickly when tracing
> the kernel, that the kernel is busy handling overflows and reacting on
> other interrupts that pile up while we're handling the BTS overflow.
> When I trace user-mode branches, it works.
> When I do not copy the trace during overflow handling, the kernel does not hang.

Agreed, that was my suspicion as well. Would you happen to know where to
get these USB debug port cables, and how to find out if a machine
supports this?

> When I attach a jtag debugger to a hung system

Sweet, x86 JTAG.. want.. ;-)

> (perf top and perf record
> -e branches -c 1), I find that one core is waiting for an smp call
> response, while the other core is busy emptying the BTS buffer.
> When I then disable branch tracing (the debugger prevents the kernel
> from changing DEBUGCTL to enable tracing again), the system recovers.
> I have a patch that switches buffers during overflow handling and leaves
> the draining for later (which currently never happens) - the kernel does
> not hang, in that case.
> I do need 3 buffers of 2048 entries = 3x48 pages per cpu, though.

And those pages have to be contiguous too, right? That's an order-6
alloc, painful.

> One buffer
> to switch in during overflow handling; another to switch in during sched_out
> (assuming that we need to schedule out the traced task before we may start
> the draining task). Even then, there's a chance that we will lose trace
> when the draining task may not start immediately. I would even say that
> this is quite likely.

Right, is it possible to detect this loss?

This makes me wonder how much time it takes to drain these buffers, it
is at all possible to optimize that code path into oblivion, or will
nothing be fast enough?

> What I do not have, yet, is the actual draining. Draining needs to start
> after the counter has been disabled. But draining needs the perf_counter
> to drain the trace into. The counter will thus be busy after it has been
> disabled - ugly.

Yes, this is a tad weird...

> There already seems to be something in place regarding deferring work, i.e.
> perf_counter_do_pending(). Would it be OK if I added the deferred BTS buffer
> draining to that?

Yes, note that this pending work runs from hardirq context as well. On
x86 we self-ipi to get into IRQ context ASAP after the NMI.

So if the remote cpu is blocked waiting on an SMP call, doing the work
from hardirq context won't really help things.

> Looks like this would guarantee that the counter does not go away as long
> as there is work pending. Is this correct?

Agreed, it waits for all pending bits to complete before destroying the

> In any case, this is getting late for the upcoming merge window.
> Would you rather drop the BTS patch or disable kernel tracing?

I don't think we need to drop it, at worst we could defer the patch
to .33, but I think we can indeed get away with disabling the kernel
tracing for now.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at