Re: [numbers] perfmon/pfmon overhead of 17%-94%

From: Vince Weaver
Date: Mon Jun 29 2009 - 14:14:45 EST

Next message: Petr Tesarik: "Re: [PATCH] Introduce a boolean "single_bit_set" function."
Previous message: Andrew Patterson: "Re: [PATCH 35/62] drivers/pci/pcie/aer/ecrc.c: Remove unnecessarysemicolons"
In reply to: Ingo Molnar: "[numbers] perfmon/pfmon overhead of 17%-94%"
Next in thread: Ingo Molnar: "Re: [numbers] perfmon/pfmon overhead of 17%-94%"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello

Ingo Molnar <mingo@xxxxxxx> wrote:
Vince Weaver <vince@xxxxxxxxxx> wrote:

That is in the 0.0001% measurement overhead range (per 'perf stat' invocation) for any realistic app that does something worth measuring

I'm just curious about this "app worth measuring" idea.

Do you intend for performance counters to simply be "oprofile done right"
or do you intend it to be a generic way of exposing performance counters to userspace?

For the research my co-workers and I are currently working on the former is uninteresting. If we wanted oprofile, we'd use it.

What matters for us is getting very exact counts of counters on programs that are being run as deterministically as possible. This includes very small programs, and counts like retired_instructions, load/store ratios, uop_counts, etc.

This may be uninteresting to you, but it is important to us. Hence my interest in the capabilities of the infrastructure finally getting merged into the kernel.

Besides, you compare perfcounters to perfmon

what else shoud I be comparing it to?

(which you seem to be a contributor of)

is that not allowed?

workloads? [ In fact in one of the scheduler-tests perfmon has a whopping measurement overhead of _nine billion_ cycles, it increased total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ]

I'm sure the perfmon2 people would welcome any patches you have to fix this problem.

as I said, I am looking for aggregate counts for deterministic programs.
Compared to the ovreheads of 50x for DBI-based tools like Valgrind, or 1000x for "cycle-accurate" simulations, then even overhead of 2x really isn't that bad.

Counting cycles or time is always a dangerous thing when performance counters are involved. Things as trivial as compiler, object link-order,
length of the executable name, number of environment variables, number of ELF auxilliary vectors, etc, can all vastly change what results you get. I'd reccomend the following paper for more details:

"Producing wrong data without doing anything obviously wrong"
by Mytkowicz et al.
http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf

If the 5 thousand cycles measurement overhead _still_ matters to you under such circumstances then by all means please submit the patches to improve it. Despite your claims this is totally fixable with the current perfcounters design, Peter outlined the steps of how to solve it, you can utilize ptrace if you want to.

Is it really "totally" fixible? I don't just mean getting the overhead from ~3000 down to ~100, I mean down to zero.

Here are the more detailed perfmon/pfmon measurement overhead
numbers.

...

I.e. this workload runs 17% slower under pfmon, the measurement
overhead is about 1.45 billion cycles.

..

That's an about 94% measurement overhead, or about 9.2 _billion_
cycles overhead on this test-system.

I'm more interested in very CPU-intensive benchmarks. I ran some experiments with gcc and equake from the spec2k benchmark suite.

This is on a 32-bit AMD Athlon(tm) XP 2000+ machine

gcc.200 (spec2k)

+ 2.6.30-03984-g45e3e19, configured with perf counters disabled

108.44s +/- 0.7

+ 2.6.30-03984-g45e3e19, perf stat -e 0:1:u --

109.17s +/- 0.7

*** For a slowdown of about 0.6%

+ 2.6.29.5 (unpatched)

115.31s +/- 0.5

+ 2.6.29.5 with perfmon2 patches applied, pfmon -e retired_instructions,cpu_clk_unhalted

115.62 +/- 0.5

** For a slowdown of about 0.2%

So in this case, perfmon2 had less overhead, though it's so small overhead as to be lost in the noise. Why the 2.6.30-git kernel seems to be much faster on this hardware, I don't know.

equake (spec2k)

+ 2.6.30-03984-g45e3e19, configured with perf counters disabled

392.77s +/- 1.5

+ 2.6.30-03984-g45e3e19, perf stat -e 0:1:u --

393.45s +/- 0.7

*** For a slowdown of about 0.17%

+ 2.6.29.5 (unpatched)

429.25s +/- 1.7

+ 2.6.29.5 with perfmon2 patches applied, pfmon -e retired_instructions,cpu_clk_unhalted

428.91 +/- 0.8

** For a _speedup_ of about 0.08%

So again the difference in overheads is in the noise. Again I am not sure why 2.6.30-git is so much faster on this hardware.

As for counter results, in this case retired instructions:

gcc.200
perf: 72,618,643,132 +/- 8million
pfmon: 72,618,519,792 +/- 5million

equake
perf: 144,952,319,472 +/- 8000
pfmon: 144,952,327,906 +/- 500

So in the equake case you can easily see that the few thousand instruction overhead from perf can show up even on long-running programs.

In any case, the point I am trying to make is that perf counters are used by a wide variety of people in a wide variety of ways, with lots of different performance/accuracy tradeoffs. Don't limit the API just because you can't envision a use for certain features.

Vince

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Petr Tesarik: "Re: [PATCH] Introduce a boolean "single_bit_set" function."
Previous message: Andrew Patterson: "Re: [PATCH 35/62] drivers/pci/pcie/aer/ecrc.c: Remove unnecessarysemicolons"
In reply to: Ingo Molnar: "[numbers] perfmon/pfmon overhead of 17%-94%"
Next in thread: Ingo Molnar: "Re: [numbers] perfmon/pfmon overhead of 17%-94%"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]