Re: PEBS bug on HSW: "Unexpected number of pebs records 10" (was: Re:[GIT PULL] perf changes for v3.12)

From: Stephane Eranian
Date: Mon Sep 23 2013 - 11:25:30 EST


On Mon, Sep 16, 2013 at 6:29 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Sep 16, 2013 at 05:41:46PM +0200, Ingo Molnar wrote:
>>
>> * Stephane Eranian <eranian@xxxxxxxxxxxxxx> wrote:
>>
>> > Hi,
>> >
>> > Some updates on this problem.
>> > I have been running tests all week-end long on my HSW.
>> > I can reproduce the problem. What I know:
>> >
>> > - It is not linked with callchain
>> > - The extra entries are valid
>> > - The reset values are still zeroes
>> > - The problem does not happen on SNB with the same test case
>> > - The PMU state looks sane when that happens.
>> > - The problem occurs even when restricting to one CPU/core (taskset -c 0-3)
>> >
>> > So it seems like the threshold is ignored. But I don't understand where
>> > there reset values are coming from. So it looks more like a bug in
>> > micro-code where under certain circumstances multiple entries get
>> > written.
>>
>> Either multiple entries are written, or the PMI/NMI is not asserted as it
>> should be?
>
> No, both :-)
>
>> > Something must be happening with the interrupt or HT. I will disable HT
>> > next and also disable the NMI watchdog.
>>
>> Yes, interaction with the NMI watchdog events might also be possible.
>>
>> If it's truly just the threshold that is broken occasionally in a
>> statistically insignificant manner then the bug is relatively benign and
>> we could work it around in the kernel by ignoring excess entries.
>>
>> In that case we should probably not annoy users with the scary kernel
>> warning and instead increase a debug count somewhere so that it's still
>> detectable.
>
> Its not just a broken threshold. When a PEBS event happens it can re-arm
> itself but only if you program a RESET value !0. We don't do that, so
> each counter should only ever fire once.
>
> We must do this because PEBS is broken on NHM+ in that the
> pebs_record::status is a direct copy of the overflow status field at
> time of the assist and if you use the RESET thing nothing will clear the
> status bits and you cannot demux the PEBS events back to the event that
> generated them.
>
Trying to understand this problem better. You are saying that in case you
are sampling multiple PEBS events there is a problem if you allow more
than one record per PEBS buffer because the overflow status is not reset
properly.

For instance, if first record is caused by counter 0, ovfl_status=0x1,
then counter
is reset. Then, if counter 1 is the cause of the next record, then
that record has the
ovfl_status=0x3 instead of ovfl_status=0x2? Is that what you are saying?

If so then yes, I agree this is a serious bug and we need to have Intel fix it.

> Worse, since its the overflow that arms the assist, and the assist
> happens at some undefined amount of cycles after this event it is
> possible for another assist to happen first.
>
> That is, suppose both CNT0 and CNT1 have PEBS enabled and CNT0 overflows
> first it is possible to find the CNT1 entry first in the buffer with
> both of them having status := 0x03.
>
> Complete and utter trainwreck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/