Re: [tip:perf/urgent] perf, x86: Catch spurious interrupts afterdisabling counters

From: Stephane Eranian
Date: Wed Sep 29 2010 - 15:42:54 EST


On Wed, Sep 29, 2010 at 8:12 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Wed, Sep 29, 2010 at 07:09:24PM +0200, Robert Richter wrote:
>> On 29.09.10 12:00:35, Stephane Eranian wrote:
>> > Here is a scenario:
>> >
>> > event A -> counter 0, cpuc->running = 0x1 active_mask = 0x1
>> > move A
>> > event A -> counter 1, cpuc->running = 0x3, active_mask = 0x2
>> >
>> > No interrupt, we are just counting for a short period.
>> > Then, you get an NMI interrupt, suppose it is not generated
>> > by the PMU, it is destined for another handler.
>> >
>> > For i=0, you have (active_mask & 0x1) == 0, but (running & 0x1) == 1,
>> > you mark the interrupt as handled, i.e., you swallow it, the actual
>> > handler never gets it.
>>
>> Yes, then changing the counters you will get *one* nmi with 2 handled
>> counters. This is valid as the disabled counter could generate a
>> spurious interrupt. But you get (handled == 2) instead of (handled ==
>> 1) which is not much impact. All following nmis have (handled == 1)
>> then again.
>
> Robert,
>
> I think you missed Stephane's point. ÂSay for example, kgdb is being used
> while we are doing stuff with the perf counter (and say kgdb's handler is
> a lower priority than perf; which isn't true I know, but let's say):
>
Yes, exactly my point. The reality is you cannot afford to have false positive
because you may starve another subsystem from an important notification.

I think it boils down to whether or not we need an error message (Dazed) in
case no subsystem claimed the NMI. If you were to just silently consume the
NMI when no subsystem claims it, then you would not have these issues.

What Don has done is use a heuristic which gets activated when a PMU
interrupt handler signals that more than one counter have overflowed. His
claim is that this situation is likely to trigger back-to-back.

The reason this heuristic works is because it waits until ALL the subsystems
have seen the notification before it declares that the NMI was PMU spurious.
To do that is uses the DIE_NMI_UNKNOWN callchain. Handler on this chain
get call last, after all subsystems have seen the notification once. I believe
that is the only way to safely "consume" a "spurious" NMI and avoid
the 'Dazed' message. Anything else runs the risks of starving the other
subsystems.

> perf NMI comes in, issues pmu_stop 'cleanly' (meaning no spurious
> interrupt). ÂThe cpuc->running bit is never cleared.
>
> kgdb NMI comes in, but the die_chain dictates perf looks at it first.
> perf will see that cpuc->active == 0 and cpuc->running == 1 and bump
> handled. ÂThus returning NOTIFY_STOP. Âkgdb never sees the NMI. :-(
>
> Now I sent a patch last week that can prevent that extra NMI from being
> generated at the cost of another rdmsrl in the non-pmu_stop cases (which I
> will attach below again, obviously P4 would need something similar too).
>
> I think we currently don't see the problems Stephane describes because the
> only thing we test that uses NMIs are perf, which also happens to be a low
> priority on the die_chain.
>
> But it is an interesting scenario that we should look at more.
>
> Cheers,
> Don
>
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 48c6d8d..1642f48 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1175,11 +1175,22 @@ static int x86_pmu_handle_irq(struct pt_regs *regs)
> Â Â Â Â Â Â Â Âhandled++;
>        Âdata.period   = event->hw.last_period;
>
> - Â Â Â Â Â Â Â if (!x86_perf_event_set_period(event))
> - Â Â Â Â Â Â Â Â Â Â Â continue;
> + Â Â Â Â Â Â Â /*
> + Â Â Â Â Â Â Â Â* if period is over, process the overflow
> + Â Â Â Â Â Â Â Â* before reseting the counter, otherwise
> + Â Â Â Â Â Â Â Â* a new overflow could occur before the
> + Â Â Â Â Â Â Â Â* event is stopped
> + Â Â Â Â Â Â Â Â*/
> + Â Â Â Â Â Â Â if (local64_read(&hwc->period_left) <= 0) {
> + Â Â Â Â Â Â Â Â Â Â Â if (perf_event_overflow(event, 1, &data, regs)) {
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â x86_pmu_stop(event, 0);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â continue;
> + Â Â Â Â Â Â Â Â Â Â Â }
> + Â Â Â Â Â Â Â Â Â Â Â /* if the overflow doesn't stop the event, resync */
> + Â Â Â Â Â Â Â Â Â Â Â x86_perf_event_update(event);
> + Â Â Â Â Â Â Â }
>
> - Â Â Â Â Â Â Â if (perf_event_overflow(event, 1, &data, regs))
> - Â Â Â Â Â Â Â Â Â Â Â x86_pmu_stop(event, 0);
> + Â Â Â Â Â Â Â x86_perf_event_set_period(event);
> Â Â Â Â}
>
> Â Â Â Âif (handled)
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/