Re: [GIT PULL] perf changes for v3.12

From: Peter Zijlstra
Date: Mon Sep 09 2013 - 06:06:04 EST


On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote:
> This is new for me, but I suspect it is more related to the new
> Haswell CPU I have than necessarily the 3.12 perf pull request.
>
> Regardless, nothing bad happened, but my dmesg has this in it:
>
> Unexpected number of pebs records 10
>
> when I was profiling some git workloads. Full trace appended.

> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 3270 at
> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003
> intel_pmu_drain_pebs_hsw+0x91/0xa0()
> Unexpected number of pebs records 10
> Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4
> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack
> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek
> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq
> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp
> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd
> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915
> i2c_algo_bit drm_kms_helper drm i2c_core video
> CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4
> Hardware name: /DH87RL, BIOS
> RLH8710H.86A.0317.2013.0426.1724 04/26/2013
> 0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8
> ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780
> ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38
> Call Trace:
> <NMI> [<ffffffff815fc637>] dump_stack+0x45/0x56
> [<ffffffff81051e78>] warn_slowpath_common+0x78/0xa0
> [<ffffffff81051ee7>] warn_slowpath_fmt+0x47/0x50
> [<ffffffff8101b051>] intel_pmu_drain_pebs_hsw+0x91/0xa0
> [<ffffffff8101c5d0>] intel_pmu_handle_irq+0x210/0x390
> [<ffffffff81604deb>] perf_event_nmi_handler+0x2b/0x50
> [<ffffffff81604670>] nmi_handle.isra.3+0x80/0x180
> [<ffffffff81604840>] do_nmi+0xd0/0x310
> [<ffffffff81603d37>] end_repeat_nmi+0x1e/0x2e
> <<EOE>> [<ffffffff810167df>] perf_events_lapic_init+0x2f/0x40
> [<ffffffff81016a50>] x86_pmu_enable+0x260/0x310
> [<ffffffff81111d87>] perf_pmu_enable+0x27/0x30
> [<ffffffff81112140>] perf_event_context_sched_in+0x80/0xc0
> [<ffffffff811127eb>] __perf_event_task_sched_in+0x16b/0x180
> [<ffffffff8107c300>] finish_task_switch+0x70/0xa0
> [<ffffffff81600f48>] __schedule+0x368/0x7c0
> [<ffffffff816013c4>] schedule+0x24/0x70
> [<ffffffff816038a4>] retint_careful+0xd/0x21
> ---[ end trace 52dad7a9d8d96b5f ]---

So I'm not entirely sure how this can happen. We do _not_ set
ds->pebs_event_reset[] so once a PEBS assist happens we do not
automatically re-engage. We wait for the PMI to happen, consume the data
and reprogram the hardware.

We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS
entries to be in the buffer so each PEBS capable counter can trigger the
assist once while we 'wait' for the PMI to hit.

There's a little 'funny' with the hsw PEBS code in that it doesn't
unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that
would only make a difference if n actually was < 0 -- again something
that shouldn't ever happen.

That said, there's two peculiar things about your report. First and
foremost HSW; I'm not at all sure how it deals with a transaction being
undone if a PEBS assist had happened during it. Does it reset the
hardware but not remove the data? If so, this could cause excess
entries.

Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013)
seems to suggest not; the way I read it PEBS assist will cause an abort
(and the PMI certainly would) and we'll process the entire thing outside
of transaction context.

That said, 18.10.5.1 mentions that entries from inside a transaction
will only have 0xb0 and 0xb8 aka perf_record::real_ip and
perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm()
will do the right thing for them. In particular, when no
perf_record::status bit is set we'll drop the event on the ground.

Dropping them means we do not do intel_pmu_save_and_restart() to
reprogram the hardware and the counter would get 'stuck'. Then again,
without a status bit we have no idea what event to re-program so we're
pretty much up shit creek without no stinkin' paddle.

Another possibility is that the meaning of pebs_record::status has
changed -- it was 'broken' and Peggy agreed and said she would look into
it, not sure this has been 'fixed' but I couldn't find anything on it.
That said, I think the code always assumed it was 'sane' so it having
been fixed shouldn't matter.

Secondly you seem to have triggered the nested NMI replay path. I'm not
immediately sure how that could have messed up the PEBS state, but it
sure is a complication.

I think we can do something like the below as a pure code reduction --
completely untested:

---
arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++---------------------
1 file changed, 32 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index 63438aa..bfda50e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -178,20 +178,14 @@ struct pebs_record_nhm {
u64 status, dla, dse, lat;
};

-/*
- * Same as pebs_record_nhm, with two additional fields.
- */
struct pebs_record_hsw {
- struct pebs_record_nhm nhm;
- /*
- * Real IP of the event. In the Intel documentation this
- * is called eventingrip.
- */
- u64 real_ip;
- /*
- * TSX tuning information field: abort cycles and abort flags.
- */
- u64 tsx_tuning;
+ u64 flags, ip;
+ u64 ax, bx, cx, dx;
+ u64 si, di, bp, sp;
+ u64 r8, r9, r10, r11;
+ u64 r12, r13, r14, r15;
+ u64 status, dla, dse, lat;
+ u64 real_ip, tsx_tuning;
};

void init_debug_store_on_cpu(int cpu)
@@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs, void *__pebs)
{
/*
- * We cast to pebs_record_nhm to get the load latency data
- * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used
+ * We cast to the biggest pebs_record but are careful not to
+ * unconditionally access the 'extra' entries.
*/
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct pebs_record_nhm *pebs = __pebs;
- struct pebs_record_hsw *pebs_hsw = __pebs;
+ struct pebs_record_hsw *pebs = __pebs;
struct perf_sample_data data;
struct pt_regs regs;
u64 sample_type;
@@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
regs.sp = pebs->sp;

if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) {
- regs.ip = pebs_hsw->real_ip;
+ regs.ip = pebs->real_ip;
regs.flags |= PERF_EFLAGS_EXACT;
} else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(&regs))
regs.flags |= PERF_EFLAGS_EXACT;
@@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
__intel_pmu_pebs_event(event, iregs, at);
}

-static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
- void *top)
+static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct debug_store *ds = cpuc->ds;
struct perf_event *event = NULL;
+ void *at, *top;
u64 status = 0;
- int bit;
+ int bit, n;
+
+ if (!x86_pmu.pebs_active)
+ return;
+
+ at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
+ top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;

ds->pebs_index = ds->pebs_buffer_base;

+ n = (top - at) / x86_pmu.pebs_record_size;
+ if (n <= 0)
+ return;
+
+ /*
+ * Should not happen, we program the threshold at 1 and do not
+ * set a reset value.
+ */
+ WARN_ONCE(n > x86_pmu.max_pebs_events,
+ "Unexpected number of pebs records %d\n", n);
+
for (; at < top; at += x86_pmu.pebs_record_size) {
struct pebs_record_nhm *p = at;

@@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at,
}
}

-static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct debug_store *ds = cpuc->ds;
- struct pebs_record_nhm *at, *top;
- int n;
-
- if (!x86_pmu.pebs_active)
- return;
-
- at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base;
- top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index;
-
- ds->pebs_index = ds->pebs_buffer_base;
-
- n = top - at;
- if (n <= 0)
- return;
-
- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > x86_pmu.max_pebs_events,
- "Unexpected number of pebs records %d\n", n);
-
- return __intel_pmu_drain_pebs_nhm(iregs, at, top);
-}
-
-static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct debug_store *ds = cpuc->ds;
- struct pebs_record_hsw *at, *top;
- int n;
-
- if (!x86_pmu.pebs_active)
- return;
-
- at = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base;
- top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index;
-
- n = top - at;
- if (n <= 0)
- return;
- /*
- * Should not happen, we program the threshold at 1 and do not
- * set a reset value.
- */
- WARN_ONCE(n > x86_pmu.max_pebs_events,
- "Unexpected number of pebs records %d\n", n);
-
- return __intel_pmu_drain_pebs_nhm(iregs, at, top);
-}
-
/*
* BTS, PEBS probe and setup
*/
@@ -1039,7 +994,7 @@ void intel_ds_init(void)
case 2:
pr_cont("PEBS fmt2%c, ", pebs_type);
x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw);
- x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw;
+ x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm;
break;

default:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/