Re: [RFC] x86, perf: high volume of events produces a flood ofunknown NMIs

From: Stephane Eranian
Date: Thu Apr 21 2011 - 09:40:38 EST


On Thu, Apr 21, 2011 at 3:28 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Wed, Apr 20, 2011 at 09:26:39PM +0200, Stephane Eranian wrote:
>> Don,
>>
>> May I suggest that the interrupt handler for Intel double checks
>> the counters directly to verify that their state actually reflects
>> the bitmask in GLOBAL_OVF_STATUS. I believe on some
>> CPUs they may disagree sometimes.
>
> Hmm, I guess I only tested that suggestion on my Xeon core2. ÂMy Nehalem
> box seems to have lots of false positives. :-/

Yep, I have seen that myself. I think the only reliable source would
be the counters
themselves and not the GLOBAL_OVF_STATUS. I wonder what would happen if
we were to switch to counters to build our own overflow bitmask. Would
that mitigate
the spurious NMI you're seeing?

>
> Cheers,
> Don
>
>>
>>
>>
>> On Wed, Apr 20, 2011 at 8:26 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
>> > Hi,
>> >
>> > Arnaldo pointed me at an NMI problem that happens when he tries to
>> > generate a high volume of perf events. ÂHe receives a flood of unknown
>> > NMIs.
>> >
>> > I have been poking at the problem and came up with a patch, but it doesn't
>> > always work. ÂI was hoping people who understood how the NMI works at a
>> > low level might be able to help me.
>> >
>> > I was able to duplicate this on an AMD Phenom, Pentium 4, Xeon Core2quad,
>> > and Nehalem. ÂThe problem I think is the large generation of back-to-back
>> > NMIs. ÂThe perf nmi handler may accidentally handle some of those
>> > extra/in-flight NMIs in its first pass, leaving the next NMI to be
>> > unhandled and generating an unknown NMI message.
>> >
>> > Robert included logic to check for two back-to-back NMIs, but that falls
>> > short when more then three are generated. ÂI modified his logic to account
>> > for three back-to-back NMIs, but that didn't completely solve the problem.
>> >
>> > I took another approach at catching back-to-back NMIs that seemed to work
>> > on all my machines except for the Xeon core2quad, but I am not entirely
>> > sure if my approach is valid.
>> >
>> > The approach I took was based on the idea that if an NMI is being
>> > generated while currently in an NMI handler, the current NMI when finished
>> > won't continue executing the next instruction before the exception but
>> > instead jump back into another NMI exception frame.
>> >
>> > As a result, the args passed in to the NMI handler should have the same ip
>> > and sp as the previous NMI interrupt. ÂOtherwise one could assume that
>> > some amount of time passed between interrupts (enough to return from the
>> > exception and execute code).
>> >
>> > I thought this would allow me to trap an infinite number of back-to-back
>> > NMIs. ÂLike I said it seemed to work in a number of machines, except for
>> > my Xeon core2quad.
>> >
>> > Does anyone know if my approach is a valid one? ÂOr is there a better way
>> > to catch this condition? ÂOr maybe some other tips or tricks I can use to
>> > help come up with a solution for this?
>> >
>> > Or perhaps we don't care about this because in the end perf can't even
>> > capture the data without spitting out a CPU Overload message.
>> >
>> > Thoughts?
>> >
>> > Attached is the crude patch I was using for testing. ÂIt also includes
>> > another patch the moves the apic LVTPC un-masking to fix the Pentium4
>> > machines.
>> >
>> > The commands I run to generate this problem is
>> >
>> > shell1> mount -t debugfs debugfs /sys/kernel/debug
>> > shell1> cd /sys/kernel/debug/tracing
>> > shell1> cat trace_pipe
>> >
>> > shell2> <grab kernel source>
>> > shell2> <cd to kernel source>
>> > shell2> make -j8
>> >
>> > shell3> perf record grep -r foo / &> /dev/null &
>> > shell3> perf record -F 100000 -g -e cycles -e instructions -e cache-misses
>> > --pid <pidof make>
>> >
>> > takes about a minute or two to duplicate
>> >
>> > Cheers,
>> > Don
>> >
>> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> > index eed3673a..20aa734 100644
>> > --- a/arch/x86/kernel/cpu/perf_event.c
>> > +++ b/arch/x86/kernel/cpu/perf_event.c
>> > @@ -104,7 +104,6 @@ struct cpu_hw_events {
>> > Â Â Â Â */
>> >    Âstruct perf_event    *events[X86_PMC_IDX_MAX]; /* in counter order */
>> >    Âunsigned long      active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
>> > -    unsigned long      running[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
>> >    Âint           enabled;
>> >
>> >    Âint           n_events;
>> > @@ -1160,7 +1159,6 @@ static void x86_pmu_start(struct perf_event *event, int flags)
>> >
>> > Â Â Â Âcpuc->events[idx] = event;
>> > Â Â Â Â__set_bit(idx, cpuc->active_mask);
>> > - Â Â Â __set_bit(idx, cpuc->running);
>> > Â Â Â Âx86_pmu.enable(event);
>> > Â Â Â Âperf_event_update_userpage(event);
>> > Â}
>> > @@ -1284,15 +1282,11 @@ static int x86_pmu_handle_irq(struct pt_regs *regs)
>> >
>> > Â Â Â Âcpuc = &__get_cpu_var(cpu_hw_events);
>> >
>> > + Â Â Â /* chipsets have their own quirks when to unmask */
>> > + Â Â Â apic_write(APIC_LVTPC, APIC_DM_NMI);
>> > +
>> > Â Â Â Âfor (idx = 0; idx < x86_pmu.num_counters; idx++) {
>> > Â Â Â Â Â Â Â Âif (!test_bit(idx, cpuc->active_mask)) {
>> > - Â Â Â Â Â Â Â Â Â Â Â /*
>> > - Â Â Â Â Â Â Â Â Â Â Â Â* Though we deactivated the counter some cpus
>> > - Â Â Â Â Â Â Â Â Â Â Â Â* might still deliver spurious interrupts still
>> > - Â Â Â Â Â Â Â Â Â Â Â Â* in flight. Catch them:
>> > - Â Â Â Â Â Â Â Â Â Â Â Â*/
>> > - Â Â Â Â Â Â Â Â Â Â Â if (__test_and_clear_bit(idx, cpuc->running))
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â handled++;
>> > Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
>> > Â Â Â Â Â Â Â Â}
>> >
>> > @@ -1333,8 +1327,8 @@ void perf_events_lapic_init(void)
>> > Â}
>> >
>> > Âstruct pmu_nmi_state {
>> > -    unsigned int  Âmarked;
>> > -    int       handled;
>> > +    unsigned long  ip;
>> > +    unsigned long  sp;
>> > Â};
>> >
>> > Âstatic DEFINE_PER_CPU(struct pmu_nmi_state, pmu_nmi);
>> > @@ -1344,8 +1338,9 @@ perf_event_nmi_handler(struct notifier_block *self,
>> > Â Â Â Â Â Â Â Â Â Â Â Â unsigned long cmd, void *__args)
>> > Â{
>> > Â Â Â Âstruct die_args *args = __args;
>> > - Â Â Â unsigned int this_nmi;
>> > Â Â Â Âint handled;
>> > + Â Â Â unsigned long ip = __this_cpu_read(pmu_nmi.ip);
>> > + Â Â Â unsigned long sp = __this_cpu_read(pmu_nmi.sp);
>> >
>> > Â Â Â Âif (!atomic_read(&active_events))
>> > Â Â Â Â Â Â Â Âreturn NOTIFY_DONE;
>> > @@ -1353,46 +1348,20 @@ perf_event_nmi_handler(struct notifier_block *self,
>> > Â Â Â Âswitch (cmd) {
>> > Â Â Â Âcase DIE_NMI:
>> > Â Â Â Â Â Â Â Âbreak;
>> > - Â Â Â case DIE_NMIUNKNOWN:
>> > - Â Â Â Â Â Â Â this_nmi = percpu_read(irq_stat.__nmi_count);
>> > - Â Â Â Â Â Â Â if (this_nmi != __this_cpu_read(pmu_nmi.marked))
>> > - Â Â Â Â Â Â Â Â Â Â Â /* let the kernel handle the unknown nmi */
>> > - Â Â Â Â Â Â Â Â Â Â Â return NOTIFY_DONE;
>> > - Â Â Â Â Â Â Â /*
>> > - Â Â Â Â Â Â Â Â* This one is a PMU back-to-back nmi. Two events
>> > - Â Â Â Â Â Â Â Â* trigger 'simultaneously' raising two back-to-back
>> > - Â Â Â Â Â Â Â Â* NMIs. If the first NMI handles both, the latter
>> > - Â Â Â Â Â Â Â Â* will be empty and daze the CPU. So, we drop it to
>> > - Â Â Â Â Â Â Â Â* avoid false-positive 'unknown nmi' messages.
>> > - Â Â Â Â Â Â Â Â*/
>> > - Â Â Â Â Â Â Â return NOTIFY_STOP;
>> > Â Â Â Âdefault:
>> > Â Â Â Â Â Â Â Âreturn NOTIFY_DONE;
>> > Â Â Â Â}
>> >
>> > - Â Â Â apic_write(APIC_LVTPC, APIC_DM_NMI);
>> > -
>> > + Â Â Â __this_cpu_write(pmu_nmi.ip, args->regs->ip);
>> > + Â Â Â __this_cpu_write(pmu_nmi.sp, args->regs->sp);
>> > Â Â Â Âhandled = x86_pmu.handle_irq(args->regs);
>> > - Â Â Â if (!handled)
>> > + Â Â Â if (!handled) {
>> > + Â Â Â Â Â Â Â if ((args->regs->ip == ip) && (args->regs->sp == sp)) {
>> > + Â Â Â Â Â Â Â Â Â Â Â trace_printk("MATCH: ip - 0x%08lx, sp - 0x%08lx\n", ip, sp);
>> > + Â Â Â Â Â Â Â Â Â Â Â return NOTIFY_STOP;
>> > + Â Â Â Â Â Â Â }else
>> > + Â Â Â Â Â Â Â Â Â Â Â trace_printk("nomatch: ip - 0x%08lx (0x%08lx), sp - 0x%08lx (0x%08lx)\n", ip, args->regs->ip, sp, args->regs->sp);
>> > Â Â Â Â Â Â Â Âreturn NOTIFY_DONE;
>> > -
>> > - Â Â Â this_nmi = percpu_read(irq_stat.__nmi_count);
>> > - Â Â Â if ((handled > 1) ||
>> > - Â Â Â Â Â Â Â /* the next nmi could be a back-to-back nmi */
>> > - Â Â Â Â Â ((__this_cpu_read(pmu_nmi.marked) == this_nmi) &&
>> > - Â Â Â Â Â Â(__this_cpu_read(pmu_nmi.handled) > 1))) {
>> > - Â Â Â Â Â Â Â /*
>> > - Â Â Â Â Â Â Â Â* We could have two subsequent back-to-back nmis: The
>> > - Â Â Â Â Â Â Â Â* first handles more than one counter, the 2nd
>> > - Â Â Â Â Â Â Â Â* handles only one counter and the 3rd handles no
>> > - Â Â Â Â Â Â Â Â* counter.
>> > - Â Â Â Â Â Â Â Â*
>> > - Â Â Â Â Â Â Â Â* This is the 2nd nmi because the previous was
>> > - Â Â Â Â Â Â Â Â* handling more than one counter. We will mark the
>> > - Â Â Â Â Â Â Â Â* next (3rd) and then drop it if unhandled.
>> > - Â Â Â Â Â Â Â Â*/
>> > - Â Â Â Â Â Â Â __this_cpu_write(pmu_nmi.marked, this_nmi + 1);
>> > - Â Â Â Â Â Â Â __this_cpu_write(pmu_nmi.handled, handled);
>> > Â Â Â Â}
>> >
>> > Â Â Â Âreturn NOTIFY_STOP;
>> > diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
>> > index 8fc2b2c..99b5151 100644
>> > --- a/arch/x86/kernel/cpu/perf_event_intel.c
>> > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
>> > @@ -937,6 +937,9 @@ static int intel_pmu_handle_irq(struct pt_regs *regs)
>> > Â Â Â Âhandled = intel_pmu_drain_bts_buffer();
>> > Â Â Â Âstatus = intel_pmu_get_status();
>> > Â Â Â Âif (!status) {
>> > + Â Â Â Â Â Â Â /* chipsets have their own quirks when to unmask */
>> > + Â Â Â Â Â Â Â apic_write(APIC_LVTPC, APIC_DM_NMI);
>> > +
>> > Â Â Â Â Â Â Â Âintel_pmu_enable_all(0);
>> > Â Â Â Â Â Â Â Âreturn handled;
>> > Â Â Â Â}
>> > @@ -988,6 +991,9 @@ again:
>> > Â Â Â Â Â Â Â Âgoto again;
>> >
>> > Âdone:
>> > + Â Â Â /* chipsets have their own quirks when to unmask */
>> > + Â Â Â apic_write(APIC_LVTPC, APIC_DM_NMI);
>> > +
>> > Â Â Â Âintel_pmu_enable_all(0);
>> > Â Â Â Âreturn handled;
>> > Â}
>> > diff --git a/arch/x86/kernel/cpu/perf_event_p4.c b/arch/x86/kernel/cpu/perf_event_p4.c
>> > index c2520e1..612bc0e 100644
>> > --- a/arch/x86/kernel/cpu/perf_event_p4.c
>> > +++ b/arch/x86/kernel/cpu/perf_event_p4.c
>> > @@ -921,9 +921,6 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
>> > Â Â Â Â Â Â Â Âint overflow;
>> >
>> > Â Â Â Â Â Â Â Âif (!test_bit(idx, cpuc->active_mask)) {
>> > - Â Â Â Â Â Â Â Â Â Â Â /* catch in-flight IRQs */
>> > - Â Â Â Â Â Â Â Â Â Â Â if (__test_and_clear_bit(idx, cpuc->running))
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â handled++;
>> > Â Â Â Â Â Â Â Â Â Â Â Âcontinue;
>> > Â Â Â Â Â Â Â Â}
>> >
>> > @@ -950,11 +947,17 @@ static int p4_pmu_handle_irq(struct pt_regs *regs)
>> > Â Â Â Â Â Â Â Â Â Â Â Âp4_pmu_disable_event(event);
>> > Â Â Â Â}
>> >
>> > - Â Â Â if (handled) {
>> > - Â Â Â Â Â Â Â /* p4 quirk: unmask it again */
>> > - Â Â Â Â Â Â Â apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED);
>> > + Â Â Â if (handled)
>> > Â Â Â Â Â Â Â Âinc_irq_stat(apic_perf_irqs);
>> > - Â Â Â }
>> > +
>> > + Â Â Â Â/*
>> > + Â Â Â Â* P4 quirks:
>> > + Â Â Â Â* - An overflown perfctr will assert its interrupt
>> > + Â Â Â Â* Â until the OVF flag in its CCCR is cleared.
>> > + Â Â Â Â* - LVTPC is masked on interrupt and must be
>> > + Â Â Â Â* Â unmasked by the LVTPC handler.
>> > + Â Â Â Â*/
>> > + Â Â Â apic_write(APIC_LVTPC, APIC_DM_NMI);
>> >
>> > Â Â Â Âreturn handled;
>> > Â}
>> >
>> >
>> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/