[PATCH] perf: only print PMU state when also WARN()'ing

From: Dave Hansen
Date: Wed May 08 2013 - 10:50:59 EST



From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>

First of all, I'm triggering this warning pretty reliably on a
large system. I'm able to hang my system alsmost immediately
running 'perf top' with 160 online cpus.

If I have fewer CPUs online (about 70), the system will spit out
several of these warnings before hanging. This patch obviously
doesn't fix the source of these, but it does add some sanity to
the warning spew. One example warning:

https://www.sr71.net/~dave/intel/perf-warn-20130508.1.txt

--

intel_pmu_handle_irq() has a warning in it if it does too many
loops inside. It is a WARN_ONCE(), but the
perf_event_print_debug() call beneath it is unconditional. For
the first warning, you get a nice backtrace and message, but
subsequent ones just dump the PMU state with no leading messages.
I doubt this is what was intended.

This patch will only print the PMU state when paired with the
WARN_ON() text. It effectively open-codes WARN_ONCE()'s
one-time-only logic.

My suspicion is that the code really just wants to make sure we
do not sit in the loop and spit out a warning for every loop
iteration after the 100th. From what I've seen, this is very
unlikely to happen since we also clear the PMU state.

After this patch, instead of seeing the PMU state dumped each
time, you will just see:

[57494.894540] perf_event_intel: clearing PMU state on CPU#129
[57579.539668] perf_event_intel: clearing PMU state on CPU#10
[57587.137762] perf_event_intel: clearing PMU state on CPU#134
[57623.039912] perf_event_intel: clearing PMU state on CPU#114
[57644.559943] perf_event_intel: clearing PMU state on CPU#118
...

Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
---

linux.git-davehans/arch/x86/kernel/cpu/perf_event_intel.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/cpu/perf_event_intel.c~debug-perf-hangs arch/x86/kernel/cpu/perf_event_intel.c
--- linux.git/arch/x86/kernel/cpu/perf_event_intel.c~debug-perf-hangs 2013-05-08 07:18:47.766917821 -0700
+++ linux.git-davehans/arch/x86/kernel/cpu/perf_event_intel.c 2013-05-08 07:18:47.770917997 -0700
@@ -1188,8 +1188,12 @@ static int intel_pmu_handle_irq(struct p
again:
intel_pmu_ack_status(status);
if (++loops > 100) {
- WARN_ONCE(1, "perfevents: irq loop stuck!\n");
- perf_event_print_debug();
+ static bool warned = false;
+ if (!warned) {
+ WARN(1, "perfevents: irq loop stuck!\n");
+ perf_event_print_debug();
+ warned = true;
+ }
intel_pmu_reset();
goto done;
}
_
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/