Re: perf: WARNING perfevents: irq loop stuck!

From: Vince Weaver
Date: Fri May 01 2015 - 13:15:25 EST


On Fri, 1 May 2015, Ingo Molnar wrote:

>
> * Vince Weaver <vincent.weaver@xxxxxxxxx> wrote:
>
> > So this is just a warning, and I've reported it before, but the
> > perf_fuzzer triggers this fairly regularly on my Haswell system.
> >
> > It looks like fixed counter 0 (retired instructions) being set to
> > 0000fffffffffffe occasionally causes an irq loop storm and gets
> > stuck until the PMU state is cleared.
>
> So 0000fffffffffffe corresponds to 2 events left until overflow,
> right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we
> allow these super short periods.
>
> Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well,
> one similar to bdw_limit_period()? Something like the patch below?

I spent the morning trying to get a reproducer for this. It turns out to
be complex. It seems in addition to fixed counter 0 being set to -2, at
least one other non-fixed counter must be about to overflow.

For example, in this case gen-PMC2 is also poised to overflow at the same
time.

CPU#0: gen-PMC2 ctrl: 00000003ff96764b
CPU#0: gen-PMC2 count: 0000000000000001
gen-PMC2 left: 0000ffffffffffff
...
[ 2408.612442] CPU#0: fixed-PMC0 count: 0000fffffffffffe


It's not always PMC2 but in the warnings there's at least one other
gen-PMC about to overflow at the exact same time as the fixed one.

Vince
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/