Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLEhandler

From: Andrew Theurer
Date: Tue Oct 09 2012 - 22:59:57 EST


On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> * Avi Kivity <avi@xxxxxxxxxx> [2012-10-04 17:00:28]:
>
> > On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> > > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> > >>
> > >> Again the numbers are ridiculously high for arch_local_irq_restore.
> > >> Maybe there's a bad perf/kvm interaction when we're injecting an
> > >> interrupt, I can't believe we're spending 84% of the time running the
> > >> popf instruction.
> > >
> > > Smells like a software fallback that doesn't do NMI, hrtimer based
> > > sampling typically hits popf where we re-enable interrupts.
> >
> > Good nose, that's probably it. Raghavendra, can you ensure that the PMU
> > is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
> > host will expose it (and a good idea anyway to get best performance).
> >
>
> Hi Avi, you are right. SandyBridge machine result was not proper.
> I cleaned up the services, enabled PMU, re-ran all the test again.
>
> Here is the summary:
> We do get good benefit by increasing ple window. Though we don't
> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>
> Let me know if you think we can increase the default ple_window
> itself to 16k.
>
> I am experimenting with V2 version of undercommit improvement(this) patch
> series, But I think if you wish to go for increase of
> default ple_window, then we would have to measure the benefit of patches
> when ple_window = 16k.
>
> I can respin the whole series including this default ple_window change.
>
> I also have the perf kvm top result for both ebizzy and kernbench.
> I think they are in expected lines now.
>
> Improvements
> ================
>
> 16 core PLE machine with 16 vcpu guest
>
> base = 3.6.0-rc5 + ple handler optimization patches
> base_pleopt_16k = base + ple_window = 16k
> base_pleopt_32k = base + ple_window = 32k
> base_pleopt_nople = base + ple_gap = 0
> kernbench, hackbench, sysbench (time in sec lower is better)
> ebizzy (rec/sec higher is better)
>
> % improvements w.r.t base (ple_window = 4k)
> ---------------+---------------+-----------------+-------------------+
> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> ---------------+---------------+-----------------+-------------------+
> kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
> kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
> ---------------+---------------+-----------------+-------------------+
> sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
> sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
> sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
> ---------------+---------------+-----------------+-------------------+
> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
> ---------------+---------------+-----------------+-------------------+
>
> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
> ========================================================================

Is the perf data for 1x overcommit?

> pleopt ple_gap=0
> --------------------
> ebizzy : 18131 records/s
> 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave
> 5.65% [guest.kernel] [g] smp_call_function_many
> 3.12% [guest.kernel] [g] clear_page
> 3.02% [guest.kernel] [g] down_read_trylock
> 1.85% [guest.kernel] [g] async_page_fault
> 1.81% [guest.kernel] [g] up_read
> 1.76% [guest.kernel] [g] native_apic_mem_write
> 1.70% [guest.kernel] [g] find_vma

Does 'perf kvm top' not give host samples at the same time? Would be
nice to see the host overhead as a function of varying ple window. I
would expect that to be the major difference between 4/16/32k window
sizes.

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with. I do not think we should
try to optimize such a bad workload.

> kernbench :Elapsed Time 29.4933 (27.6007)
> 5.72% [guest.kernel] [g] async_page_fault
> 3.48% [guest.kernel] [g] pvclock_clocksource_read
> 2.68% [guest.kernel] [g] copy_user_generic_unrolled
> 2.58% [guest.kernel] [g] clear_page
> 2.09% [guest.kernel] [g] page_cache_get_speculative
> 2.00% [guest.kernel] [g] do_raw_spin_lock
> 1.78% [guest.kernel] [g] unmap_single_vma
> 1.74% [guest.kernel] [g] kmem_cache_alloc

>
> pleopt ple_window = 4k
> ---------------------------
> ebizzy: 10176 records/s
> 69.17% [guest.kernel] [g] _raw_spin_lock_irqsave
> 3.34% [guest.kernel] [g] clear_page
> 2.16% [guest.kernel] [g] down_read_trylock
> 1.94% [guest.kernel] [g] async_page_fault
> 1.89% [guest.kernel] [g] native_apic_mem_write
> 1.63% [guest.kernel] [g] smp_call_function_many
> 1.58% [guest.kernel] [g] SetPageLRU
> 1.37% [guest.kernel] [g] up_read
> 1.01% [guest.kernel] [g] find_vma
>
>
> kernbench: 29.9533
> nts: 240K cycles
> 6.04% [guest.kernel] [g] async_page_fault
> 4.17% [guest.kernel] [g] pvclock_clocksource_read
> 3.28% [guest.kernel] [g] clear_page
> 2.57% [guest.kernel] [g] copy_user_generic_unrolled
> 2.30% [guest.kernel] [g] do_raw_spin_lock
> 2.13% [guest.kernel] [g] _raw_spin_lock_irqsave
> 1.93% [guest.kernel] [g] page_cache_get_speculative
> 1.92% [guest.kernel] [g] unmap_single_vma
> 1.77% [guest.kernel] [g] kmem_cache_alloc
> 1.61% [guest.kernel] [g] __d_lookup_rcu
> 1.19% [guest.kernel] [g] find_vma
> 1.19% [guest.kernel] [g] __list_del_entry
>
>
> pleopt: ple_window=16k
> -------------------------
> ebizzy: 16990
> 62.35% [guest.kernel] [g] _raw_spin_lock_irqsave
> 5.22% [guest.kernel] [g] smp_call_function_many
> 3.57% [guest.kernel] [g] down_read_trylock
> 3.20% [guest.kernel] [g] clear_page
> 2.16% [guest.kernel] [g] up_read
> 1.89% [guest.kernel] [g] find_vma
> 1.86% [guest.kernel] [g] async_page_fault
> 1.81% [guest.kernel] [g] native_apic_mem_write
>
> kernbench: 28.5
> 6.24% [guest.kernel] [g] async_page_fault
> 4.16% [guest.kernel] [g] pvclock_clocksource_read
> 3.33% [guest.kernel] [g] clear_page
> 2.50% [guest.kernel] [g] copy_user_generic_unrolled
> 2.08% [guest.kernel] [g] do_raw_spin_lock
> 1.98% [guest.kernel] [g] unmap_single_vma
> 1.89% [guest.kernel] [g] kmem_cache_alloc
> 1.82% [guest.kernel] [g] page_cache_get_speculative
> 1.46% [guest.kernel] [g] __d_lookup_rcu
> 1.42% [guest.kernel] [g] _raw_spin_lock_irqsave
> 1.15% [guest.kernel] [g] __list_del_entry
> 1.10% [guest.kernel] [g] find_vma
>
>
>
> Detailed result for the run
> =============================
> patched = base_pleopt_16k
> +-----------+-----------+-----------+------------+-----------+
> kernbench
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 30.0440 1.1896 29.9167 1.6755 0.42371
> 2x 62.0083 3.4884 62.8825 2.5509 -1.40981
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> sysbench
> +-----------+-----------+-----------+------------+-----------+
> 1x 7.1779 0.0577 7.2442 0.0479 -0.92367
> 2x 15.5362 0.3370 15.8822 0.3591 -2.22706
> 3x 23.8249 0.1513 24.0048 0.1844 -0.75509
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> ebizzy
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000 442.6598 16054.8750 252.5088 54.99976
> 2x 2705.5000 130.0286 2466.5000 120.0024 -8.83386
> +-----------+-----------+-----------+------------+-----------+
>
> patched = base_pleopt_32k
> +-----------+-----------+-----------+------------+-----------+
> kernbench
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 30.0440 1.1896 29.6980 0.6760 1.15164
> 2x 62.0083 3.4884 72.8491 4.4616 -17.48282
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> sysbench
> +-----------+-----------+-----------+------------+-----------+
> 1x 7.1779 0.0577 7.1605 0.0447 0.24241
> 2x 15.5362 0.3370 15.5842 0.1731 -0.30896
> 3x 23.8249 0.1513 23.8024 0.2342 0.09444
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> ebizzy
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000 442.6598 17328.3750 281.4569 67.29460
> 2x 2705.5000 130.0286 1964.6250 143.0793 -27.38403
> +-----------+-----------+-----------+------------+-----------+
>
> patched = base_pleopt_nople
> +-----------+-----------+-----------+------------+-----------+
> kernbench
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 30.0440 1.1896 30.0160 0.7523 0.09320
> 2x 62.0083 3.4884 415.9334 189.9901 -570.77053
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> sysbench
> +-----------+-----------+-----------+------------+-----------+
> 1x 7.1779 0.0577 7.1973 0.0354 -0.27027
> 2x 15.5362 0.3370 15.7344 0.2315 -1.27573
> 3x 23.8249 0.1513 24.5343 0.3437 -2.97756
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> ebizzy
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000 442.6598 18037.5000 315.2074 74.14076
> 2x 2705.5000 130.0286 102.2500 104.3521 -96.22066
> +-----------+-----------+-----------+------------+-----------+
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/