Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoffw/ auto tuning

From: Chegu Vinod
Date: Thu Jan 10 2013 - 17:24:35 EST


On 1/8/2013 2:26 PM, Rik van Riel wrote:
<...>
Performance is within the margin of error of v2, so the graph
has not been update.

Please let me know if you manage to break this code in any way,
so I can fix it...


Attached below is some preliminary data with one of the AIM7 micro-benchmark
workloads (i.e. high_systime). This is a kernel intensive workload which
does tons of forks/execs etc.and stresses quite a few of the same set
of spinlocks and semaphores.

Observed a drop in performance as we go to 40way and 80 way. Wondering
if the back off keeps increasing to such an extent that it actually starts
to hurt given the nature of this workload ? Also in the case of 80way
observed quite a bit of variation from run to run...

Also ran it inside a single KVM guest. There were some perf. dips but
interestingly didn't observe the same level of drop (compared to the
drop in the native case) as the guest size was scaled up to 40vcpu or
80vcpu.

FYI
Vinod



---

Platform : 8 socket (80 Core) Westmere with 1TB RAM.

Workload: AIM7-highsystime microbenchmark - 2000 users & 100 jobs per user.

Values reported are Jobs Per Minute (Higher is better). The values
are average of 3 runs.

1) Native run:
--------------

Config 1: 3.7 kernel
Config 2: 3.7 + Rik's 1-4 patches

------------------------------------------------------------
20way 40way 80way
------------------------------------------------------------
Config 1 ~179K ~159K ~146K
------------------------------------------------------------
Config 2 ~180K ~134K ~21K-43K <- high variation!
------------------------------------------------------------

(Note: Used numactl to restrict workload to
2 sockets (20way) and 4 sockets(40way))

------

2) KVM run :
------------

Single guest of different sizes (No over commit, NUMA enabled in the guest).

Note: This kernel intensive micro benchmark is exposes the PLE handler issue
esp. for large guests. Since Raghu's PLE changes are not yet in upstream
'have just run with current PLE handler & then by disabling
PLE (ple_gap=0).

Config 1 : Host & Guest at 3.7
Config 2 : Host & Guest are at 3.7 + Rik's 1-4 patches

--------------------------------------------------------------------------
20vcpu/128G 40vcpu/256G 80vcpu/512G
(on 2 sockets) (on 4 sockets) (on 8 sockets)
--------------------------------------------------------------------------
Config 1 ~144K ~39K ~10K
--------------------------------------------------------------------------
Config 2 ~143K ~37.5K ~11K
--------------------------------------------------------------------------

Config 3 : Host & Guest at 3.7 AND ple_gap=0
Config 4 : Host & Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0

--------------------------------------------------------------------------
Config 3 ~154K ~131K ~116K
--------------------------------------------------------------------------
Config 4 ~151K ~130K ~115K
--------------------------------------------------------------------------


(Note: Used numactl to restrict qemu to
2 sockets (20way) and 4 sockets(40way))