On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:* Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]:
On 10/04/2012 03:07 PM, Peter Zijlstra wrote:On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
Again the numbers are ridiculously high for arch_local_irq_restore.
Maybe there's a bad perf/kvm interaction when we're injecting an
interrupt, I can't believe we're spending 84% of the time running the
popf instruction.
Smells like a software fallback that doesn't do NMI, hrtimer based
sampling typically hits popf where we re-enable interrupts.
Good nose, that's probably it. Raghavendra, can you ensure that the PMU
is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
host will expose it (and a good idea anyway to get best performance).
Hi Avi, you are right. SandyBridge machine result was not proper.
I cleaned up the services, enabled PMU, re-ran all the test again.
Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).
Let me know if you think we can increase the default ple_window
itself to 16k.
I am experimenting with V2 version of undercommit improvement(this) patch
series, But I think if you wish to go for increase of
default ple_window, then we would have to measure the benefit of patches
when ple_window = 16k.
I can respin the whole series including this default ple_window change.
I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.
Improvements
================
16 core PLE machine with 16 vcpu guest
base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)
% improvements w.r.t base (ple_window = 4k)
---------------+---------------+-----------------+-------------------+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---------------+---------------+-----------------+-------------------+
kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
---------------+---------------+-----------------+-------------------+
sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
---------------+---------------+-----------------+-------------------+
ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
---------------+---------------+-----------------+-------------------+
perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
========================================================================
Is the perf data for 1x overcommit?
pleopt ple_gap=0
--------------------
ebizzy : 18131 records/s
63.78% [guest.kernel] [g] _raw_spin_lock_irqsave
5.65% [guest.kernel] [g] smp_call_function_many
3.12% [guest.kernel] [g] clear_page
3.02% [guest.kernel] [g] down_read_trylock
1.85% [guest.kernel] [g] async_page_fault
1.81% [guest.kernel] [g] up_read
1.76% [guest.kernel] [g] native_apic_mem_write
1.70% [guest.kernel] [g] find_vma
Does 'perf kvm top' not give host samples at the same time? Would be
nice to see the host overhead as a function of varying ple window. I
would expect that to be the major difference between 4/16/32k window
sizes.
A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with. I do not think we should
try to optimize such a bad workload.