Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

From: Giovanni Gherdovich
Date: Sat Dec 01 2018 - 09:14:24 EST


On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
>Â
> The venerable menu governor does some thigns that are quite
> questionable in my view.
>Â
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
>Â
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.ÂÂMoreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
>Â
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
>Â
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.ÂÂIt is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
>Â
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
>Â
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
>Â
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.ÂÂIt also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
>Â
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
> ---
>Â
> v5 -> v6:
>ÂÂ* Avoid applying poll_time_limit to non-polling idle states by mistake.
>ÂÂ* Use idle duration measured by the governor for everything (as it likely is
>ÂÂÂÂmore accurate than the one measured by the core).
>ÂÂ* Rename SPIKE to PULSE.
>ÂÂ* Do not run pattern detection upfront.ÂÂInstead, use recent idle duration
>ÂÂÂÂvalues to refine the state selection after finding a candidate idle state.
>ÂÂ* Do not use the expected idle duration as an extra latency constraint
>ÂÂÂÂ(exit latency is less than the target residency for all of the idle states
>ÂÂÂÂknown to me anyway, so this doesn't change anything in practice).
>Â
> v4 -> v5:
>ÂÂ* Avoid using shallow idle states when the tick has been stopped already.
>Â
> v3 -> v4:
>ÂÂ* Make the pattern detection avoid returning too early if the minimum
>ÂÂÂÂsample is too far from the average.
>ÂÂ* Reformat the changelog (as requested by Peter).
>Â
> v2 -> v3:
>ÂÂ* Simplify the pattern detection code and make it return a value
>Â lower than the time to the closest timer if the majority of recent
>Â idle intervals are below it regardless of their variance (that should
>Â cause it to be slightly more aggressive).
>ÂÂ* Do not count wakeups from state 0 due to the time limit in poll_idle()
>ÂÂÂÂas non-timer.
>Â
> [snip]

[NOTE: the tables in this message are quite wide. If this doesn't get to you
properly formatted you can read a copy of this message at the URL
https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]

All performance concerns manifested in v5 are wiped out by v6. Not only v6
improves over v5, but is even better than the baseline (menu) in most
cases. The optimizations in v6 paid off!

The overview of the analysis for v5, from the message
https://lore.kernel.org/lkml/1541877001.17878.5.camel@xxxxxxx , was:

> The quick summary is:
>Â
> ---> sockperf on loopback over UDP, mode "throughput":
>ÂÂÂÂÂÂthis had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely
>ÂÂÂÂÂÂrecovered in v3 and v5. Good stuff.
>Â
> ---> dbench on xfs:
>ÂÂÂÂÂÂthis was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
>ÂÂÂÂÂÂregression. Slight improvement. What's really hurting here is the single
>ÂÂÂÂÂÂclient scenario.
>Â
> ---> netperf-udp on loopback:
>ÂÂÂÂÂÂhad 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
>ÂÂÂÂÂÂhappens in v5.
>Â
> ---> tbench on loopback:
>ÂÂÂÂÂÂwas down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 12%
>ÂÂÂÂÂÂregression. As in dbench, it's at low number of clients that the results
>ÂÂÂÂÂÂare worst. Note that this machine is different from the one that has the
>ÂÂÂÂÂÂdbench regression.

now the situation is overturned:

---> sockperf on loopback over UDP, mode "throughput":
ÂÂÂÂÂNo new problems from 48x-HASWELL-NUMA, which stays put at the level of
ÂÂÂÂÂthe baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the
ÂÂÂÂÂbaseline of 8% and 10% respectively.

---> dbench on xfs:
ÂÂÂÂÂ48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now
ÂÂÂÂÂat 0, i.e. the baseline level. The 1-client case, responsible for the
ÂÂÂÂÂprevious overall degradation (I average results from different number of
ÂÂÂÂÂclients), went from -40% to -20% and is compensated in my table by
ÂÂÂÂÂimprovements with 4, 8, 16 and 32 clients (table below).

---> netperf-udp on loopback:
ÂÂÂÂÂ8x-SKYLAKE-UMA now shows a 9% improvement overÂÂbaseline.
ÂÂÂÂÂ80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%.

---> tbench on loopback:
ÂÂÂÂÂImpressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5
ÂÂÂÂÂto 7% improvement in v6. The problematic 1- and 2-clients cases went from
ÂÂÂÂÂ-25% and -33% to +13% and +10% respectively.

Details below.

Runs are compared against v4.18 with the Menu governor. I know v4.18 is a
little old now but that's where I measured my baseline. My machine pool didn't
change:

* single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
* two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here onwards)
* two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here onwards)


BENCHMARKS WITH NEUTRAL RESULTS
===============================

This is the list of neutral benchmarks, identical to the one for v5. What's
interesting is that the benchmarks showing a degradations in v5 and before
seems now repaired in v6 (and improving baseline!), but the list of neutral
benchmarks didn't move. My take on this is that the list below is not affected
by cpuidle at all, be the gorvernor good or bad. OTOH the benchmarks I discuss
in the next sections are really the ones to use when evaluating cpuidle, as
they are very sensitive to it (frequent idling and waking up, hard-to-predict
interrupt patterns etc).

* pgbench read-only on xfs, pgbench read/write on xfs
ÂÂÂÂ* global-dhp__db-pgbench-timed-ro-small-xfs
ÂÂÂÂ* global-dhp__db-pgbench-timed-rw-small-xfs
* siege
ÂÂÂÂ* global-dhp__http-siege
* hackbench, pipetest
ÂÂÂÂ* global-dhp__scheduler-unbound
* Linux kernel compilation
ÂÂÂÂ* global-dhp__workload_kerndevel-xfs
* NASA Parallel Benchmarks, C-Class (linear algebra; run both with OpenMP
 and OpenMPI, over xfs)
ÂÂÂÂ* global-dhp__nas-c-class-mpi-full-xfs
ÂÂÂÂ* global-dhp__nas-c-class-omp-full
* FIO (Flexible IO) in several configurations
ÂÂÂÂ* global-dhp__io-fio-randread-async-randwrite-xfs
ÂÂÂÂ* global-dhp__io-fio-randread-async-seqwrite-xfs
ÂÂÂÂ* global-dhp__io-fio-seqread-doublemem-32k-4t-xfs
ÂÂÂÂ* global-dhp__io-fio-seqread-doublemem-4k-4t-xfs
* netperf on loopback over TCP
ÂÂÂÂ* global-dhp__network-netperf-unbound
* xfsrepair
ÂÂÂÂ* global-dhp__io-xfsrepair-xfs
* sqlite (insert operations on xfs)
ÂÂÂÂ* global-dhp__db-sqlite-insert-medium-xfs
* schbench
ÂÂÂÂ* global-dhp__workload_schbench
* gitsource on xfs (git unit tests, shell intensive)
ÂÂÂÂ* global-dhp__workload_shellscripts-xfs

Note: global-dhp* are configuration file names for MMTests[1]


PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW
==========================================

* sockperf on loopback over UDP, mode "throughput"
ÂÂÂÂ* global-dhp__network-sockperf-unbound
ÂÂÂÂ48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteo-v1ÂÂÂÂÂÂteo-v2ÂÂÂÂÂÂteo-v3ÂÂÂÂÂÂteo-v5ÂÂÂÂÂÂteo-v6
 -------------------------------------------------------------------------------
 8x-SKYLAKE-UMAÂÂÂÂÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂ10% better
 80x-BROADWELL-NUMAÂÂÂÂ3% betterÂÂÂ2% betterÂÂÂ5% betterÂÂÂ3% worseÂÂÂÂ8% better
 48x-HASWELL-NUMAÂÂÂÂÂÂ4% betterÂÂÂ12% worseÂÂÂno changeÂÂÂno changeÂÂÂno change

* dbench on xfs
ÂÂÂÂ* global-dhp__io-dbench4-async-xfs
ÂÂÂÂ48x-HASWELL-NUMA is fixed wrt v5 and earlier versions.

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteo-v1ÂÂÂÂÂÂteo-v2ÂÂÂÂÂÂteo-v3ÂÂÂÂÂteo-v5ÂÂÂÂÂÂÂteo-v6ÂÂÂ
 -------------------------------------------------------------------------------
 8x-SKYLAKE-UMAÂÂÂÂÂÂÂÂ3% betterÂÂÂ4% betterÂÂÂ6% betterÂÂ4% betterÂÂÂÂ5% better
 80x-BROADWELL-NUMAÂÂÂÂno changeÂÂÂno changeÂÂÂ1% worseÂÂÂ3% worseÂÂÂÂÂ2% better
 48x-HASWELL-NUMAÂÂÂÂÂÂ6% worseÂÂÂÂ16% worseÂÂÂ8% worseÂÂÂ10% worseÂÂÂÂno changeÂ

* netperf on loopback over UDP
ÂÂÂÂ* global-dhp__network-netperf-unbound
ÂÂÂÂ8x-SKYLAKE-UMA fixed.

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteo-v1ÂÂÂÂÂÂteo-v2ÂÂÂÂÂÂteo-v3ÂÂÂÂÂteo-v5ÂÂÂÂÂÂÂteo-v6ÂÂÂ
 -------------------------------------------------------------------------------
 8x-SKYLAKE-UMAÂÂÂÂÂÂÂÂno changeÂÂÂ6% worseÂÂÂÂ4% worseÂÂÂ6% worseÂÂÂÂÂ9% better
 80x-BROADWELL-NUMAÂÂÂÂ1% worseÂÂÂÂ4% worseÂÂÂÂno changeÂÂno changeÂÂÂÂ7% better
 48x-HASWELL-NUMAÂÂÂÂÂÂ3% betterÂÂÂ5% worseÂÂÂÂ7% worseÂÂÂ5% worseÂÂÂÂÂno change

* tbench on loopback
ÂÂÂÂ* global-dhp__network-tbench
ÂÂÂÂMeasurable improvements across all machines, especially 8x-SKYLAKE-UMA.

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteo-v1ÂÂÂÂÂÂteo-v2ÂÂÂÂÂÂteo-v3ÂÂÂÂÂteo-v5ÂÂÂÂÂÂÂteo-v6
 -------------------------------------------------------------------------------
 8x-SKYLAKE-UMAÂÂÂÂÂÂÂÂ1% worseÂÂÂÂ10% worseÂÂÂ11% worseÂÂ12% worseÂÂÂÂ7% better
 80x-BROADWELL-NUMAÂÂÂÂ1% worseÂÂÂÂ1% worseÂÂÂÂno cahngeÂÂ1% worseÂÂÂÂÂ4% better
 48x-HASWELL-NUMAÂÂÂÂÂÂ1% worseÂÂÂÂ2% worseÂÂÂÂ1% worseÂÂÂ1% worseÂÂÂÂÂ5% better


PREVIOUSLY REGRESSING BENCHMARKS: DETAIL
========================================

SOCKPERF-UDP-THROUGHPUT
=======================
NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
ÂÂÂÂmessage size.
MEASURES: Throughput, in MBits/second
HIGHER is better

machine: 8x-SKYLAKE-UMA

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂvanillaÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteoÂÂÂÂÂÂÂÂteo-v2+backportÂÂÂÂÂÂÂÂteo-v3+backportÂÂÂÂÂÂÂÂteo-v5+backportÂÂÂÂÂÂÂÂteo-v6+backport
-------------------------------------------------------------------------------------------------------------------------------------------------------
HmeanÂÂÂÂÂ14ÂÂÂÂÂÂÂÂ70.34 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ69.80 *ÂÂ-0.76%*ÂÂÂÂÂÂÂ69.11 *ÂÂ-1.75%*ÂÂÂÂÂÂÂ69.49 *ÂÂ-1.20%*ÂÂÂÂÂÂÂ69.71 *ÂÂ-0.90%*ÂÂÂÂÂÂÂ77.51 *ÂÂ10.20%*
HmeanÂÂÂÂÂ100ÂÂÂÂÂÂ499.24 (ÂÂÂ0.00%)ÂÂÂÂÂÂ494.26 *ÂÂ-1.00%*ÂÂÂÂÂÂ492.74 *ÂÂ-1.30%*ÂÂÂÂÂÂ494.90 *ÂÂ-0.87%*ÂÂÂÂÂÂ497.43 *ÂÂ-0.36%*ÂÂÂÂÂÂ549.93 *ÂÂ10.15%*
HmeanÂÂÂÂÂ300ÂÂÂÂÂ1489.13 (ÂÂÂ0.00%)ÂÂÂÂÂ1472.39 *ÂÂ-1.12%*ÂÂÂÂÂ1468.45 *ÂÂ-1.39%*ÂÂÂÂÂ1477.74 *ÂÂ-0.76%*ÂÂÂÂÂ1478.61 *ÂÂ-0.71%*ÂÂÂÂÂ1632.63 *ÂÂÂ9.64%*
HmeanÂÂÂÂÂ500ÂÂÂÂÂ2469.62 (ÂÂÂ0.00%)ÂÂÂÂÂ2444.41 *ÂÂ-1.02%*ÂÂÂÂÂ2434.61 *ÂÂ-1.42%*ÂÂÂÂÂ2454.15 *ÂÂ-0.63%*ÂÂÂÂÂ2454.76 *ÂÂ-0.60%*ÂÂÂÂÂ2698.70 *ÂÂÂ9.28%*
HmeanÂÂÂÂÂ850ÂÂÂÂÂ4165.12 (ÂÂÂ0.00%)ÂÂÂÂÂ4123.82 *ÂÂ-0.99%*ÂÂÂÂÂ4100.37 *ÂÂ-1.55%*ÂÂÂÂÂ4111.82 *ÂÂ-1.28%*ÂÂÂÂÂ4120.04 *ÂÂ-1.08%*ÂÂÂÂÂ4521.11 *ÂÂÂ8.55%*

In the report I sent for v5 on this benchmark, I posted the table for
48x-HASWELL-NUMA; that one is now uninteresting (v5 fixed it and v6 didn't
change that), so the table above shows the detail for the improvement on
8x-SKYLAKE-UMA.

DBENCH4
=======
NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
MEASURES: latency (millisecs)
LOWER is better

machine: 48x-HASWELL-NUMA
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂvanillaÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteoÂÂÂÂÂÂÂÂteo-v2+backportÂÂÂÂÂÂÂÂteo-v3+backportÂÂÂÂÂÂÂÂteo-v5+backportÂÂÂÂÂÂÂÂteo-v6+backport
-------------------------------------------------------------------------------------------------------------------------------------------------------
AmeanÂÂÂÂÂÂ1ÂÂÂÂÂÂÂÂ37.15 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ50.10 ( -34.86%)ÂÂÂÂÂÂÂ39.02 (ÂÂ-5.03%)ÂÂÂÂÂÂÂ52.24 ( -40.63%)ÂÂÂÂÂÂÂ51.62 ( -38.96%)ÂÂÂÂÂÂÂ45.24 ( -21.78%)
AmeanÂÂÂÂÂÂ2ÂÂÂÂÂÂÂÂ43.75 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ45.50 (ÂÂ-4.01%)ÂÂÂÂÂÂÂ44.36 (ÂÂ-1.39%)ÂÂÂÂÂÂÂ47.25 (ÂÂ-8.00%)ÂÂÂÂÂÂÂ44.20 (ÂÂ-1.03%)ÂÂÂÂÂÂÂ44.30 (ÂÂ-1.26%)
AmeanÂÂÂÂÂÂ4ÂÂÂÂÂÂÂÂ54.42 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ58.85 (ÂÂ-8.15%)ÂÂÂÂÂÂÂ58.17 (ÂÂ-6.89%)ÂÂÂÂÂÂÂ55.12 (ÂÂ-1.29%)ÂÂÂÂÂÂÂ58.07 (ÂÂ-6.70%)ÂÂÂÂÂÂÂ52.91 (ÂÂÂ2.77%)
AmeanÂÂÂÂÂÂ8ÂÂÂÂÂÂÂÂ75.72 (ÂÂÂ0.00%)ÂÂÂÂÂÂÂ74.25 (ÂÂÂ1.94%)ÂÂÂÂÂÂÂ82.76 (ÂÂ-9.30%)ÂÂÂÂÂÂÂ78.63 (ÂÂ-3.84%)ÂÂÂÂÂÂÂ85.33 ( -12.68%)ÂÂÂÂÂÂÂ70.26 (ÂÂÂ7.22%)
AmeanÂÂÂÂÂÂ16ÂÂÂÂÂÂ116.56 (ÂÂÂ0.00%)ÂÂÂÂÂÂ119.88 (ÂÂ-2.85%)ÂÂÂÂÂÂ164.14 ( -40.82%)ÂÂÂÂÂÂ124.87 (ÂÂ-7.13%)ÂÂÂÂÂÂ124.54 (ÂÂ-6.85%)ÂÂÂÂÂÂ110.95 (ÂÂÂ4.81%)
AmeanÂÂÂÂÂÂ32ÂÂÂÂÂÂ570.02 (ÂÂÂ0.00%)ÂÂÂÂÂÂ561.92 (ÂÂÂ1.42%)ÂÂÂÂÂÂ681.94 ( -19.63%)ÂÂÂÂÂÂ568.93 (ÂÂÂ0.19%)ÂÂÂÂÂÂ571.23 (ÂÂ-0.21%)ÂÂÂÂÂÂ543.10 (ÂÂÂ4.72%)
AmeanÂÂÂÂÂÂ64ÂÂÂÂÂ3185.20 (ÂÂÂ0.00%)ÂÂÂÂÂ3291.80 (ÂÂ-3.35%)ÂÂÂÂÂ4337.43 ( -36.17%)ÂÂÂÂÂ3181.13 (ÂÂÂ0.13%)ÂÂÂÂÂ3382.48 (ÂÂ-6.19%)ÂÂÂÂÂ3186.58 (ÂÂ-0.04%)

The -21% on 1-client may not look exciting but it's leaps and bounds better
than what was on v5, plus most other num-clients improve measurably.

NETPERF-UDP
===========
NOTES: Test run in mode "stream" over UDP. The varying parameter is the
ÂÂÂÂmessage size in bytes. Each measurement is taken 5 times and the
ÂÂÂÂharmonic mean is reported.
MEASURES: Throughput in MBits/second, both on the sender and on the receiver end.
HIGHER is better

machine: 8x-SKYLAKE-UMA
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂvanillaÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteoÂÂÂÂÂÂÂÂteo-v2+backportÂÂÂÂÂÂÂÂteo-v3+backportÂÂÂÂÂÂÂÂteo-v5+backportÂÂÂÂÂÂÂÂteo-v6+backport
--------------------------------------------------------------------------------------------------------------------------------------------------------------
HmeanÂÂÂÂÂsend-64ÂÂÂÂÂÂÂÂÂ362.27 (ÂÂÂ0.00%)ÂÂÂÂÂÂ362.87 (ÂÂÂ0.16%)ÂÂÂÂÂÂ318.85 * -11.99%*ÂÂÂÂÂÂ347.08 *ÂÂ-4.19%*ÂÂÂÂÂÂ333.48 *ÂÂ-7.95%*ÂÂÂÂÂÂ402.61 *ÂÂ11.13%*
HmeanÂÂÂÂÂsend-128ÂÂÂÂÂÂÂÂ723.17 (ÂÂÂ0.00%)ÂÂÂÂÂÂ723.66 (ÂÂÂ0.07%)ÂÂÂÂÂÂ660.96 *ÂÂ-8.60%*ÂÂÂÂÂÂ676.46 *ÂÂ-6.46%*ÂÂÂÂÂÂ650.71 * -10.02%*ÂÂÂÂÂÂ796.78 *ÂÂ10.18%*
HmeanÂÂÂÂÂsend-256ÂÂÂÂÂÂÂ1435.24 (ÂÂÂ0.00%)ÂÂÂÂÂ1427.08 (ÂÂ-0.57%)ÂÂÂÂÂ1346.22 *ÂÂ-6.20%*ÂÂÂÂÂ1359.59 *ÂÂ-5.27%*ÂÂÂÂÂ1323.83 *ÂÂ-7.76%*ÂÂÂÂÂ1590.55 *ÂÂ10.82%*
HmeanÂÂÂÂÂsend-1024ÂÂÂÂÂÂ5563.78 (ÂÂÂ0.00%)ÂÂÂÂÂ5529.90 *ÂÂ-0.61%*ÂÂÂÂÂ5228.28 *ÂÂ-6.03%*ÂÂÂÂÂ5382.04 *ÂÂ-3.27%*ÂÂÂÂÂ5271.99 *ÂÂ-5.24%*ÂÂÂÂÂ6117.42 *ÂÂÂ9.95%*
HmeanÂÂÂÂÂsend-2048ÂÂÂÂÂ10935.42 (ÂÂÂ0.00%)ÂÂÂÂ10809.66 *ÂÂ-1.15%*ÂÂÂÂ10521.14 *ÂÂ-3.79%*ÂÂÂÂ10610.29 *ÂÂ-2.97%*ÂÂÂÂ10544.58 *ÂÂ-3.57%*ÂÂÂÂ11512.14 *ÂÂÂ5.27%*
HmeanÂÂÂÂÂsend-3312ÂÂÂÂÂ16898.66 (ÂÂÂ0.00%)ÂÂÂÂ16539.89 *ÂÂ-2.12%*ÂÂÂÂ16240.87 *ÂÂ-3.89%*ÂÂÂÂ16271.23 *ÂÂ-3.71%*ÂÂÂÂ15968.89 *ÂÂ-5.50%*ÂÂÂÂ17600.72 *ÂÂÂ4.15%*
HmeanÂÂÂÂÂsend-4096ÂÂÂÂÂ19354.33 (ÂÂÂ0.00%)ÂÂÂÂ19185.43 (ÂÂ-0.87%)ÂÂÂÂ18600.52 *ÂÂ-3.89%*ÂÂÂÂ18692.16 *ÂÂ-3.42%*ÂÂÂÂ18408.69 *ÂÂ-4.89%*ÂÂÂÂ20494.07 *ÂÂÂ5.89%*
HmeanÂÂÂÂÂsend-8192ÂÂÂÂÂ32238.80 (ÂÂÂ0.00%)ÂÂÂÂ32275.57 (ÂÂÂ0.11%)ÂÂÂÂ29850.62 *ÂÂ-7.41%*ÂÂÂÂ30066.83 *ÂÂ-6.74%*ÂÂÂÂ29824.62 *ÂÂ-7.49%*ÂÂÂÂ35225.60 *ÂÂÂ9.26%*
HmeanÂÂÂÂÂsend-16384ÂÂÂÂ48146.75 (ÂÂÂ0.00%)ÂÂÂÂ49297.23 *ÂÂÂ2.39%*ÂÂÂÂ48295.51 (ÂÂÂ0.31%)ÂÂÂÂ48800.37 *ÂÂÂ1.36%*ÂÂÂÂ48247.73 (ÂÂÂ0.21%)ÂÂÂÂ53000.20 *ÂÂ10.08%*
HmeanÂÂÂÂÂrecv-64ÂÂÂÂÂÂÂÂÂ362.16 (ÂÂÂ0.00%)ÂÂÂÂÂÂ362.87 (ÂÂÂ0.19%)ÂÂÂÂÂÂ318.82 * -11.97%*ÂÂÂÂÂÂ347.07 *ÂÂ-4.17%*ÂÂÂÂÂÂ333.48 *ÂÂ-7.92%*ÂÂÂÂÂÂ402.60 *ÂÂ11.17%*
HmeanÂÂÂÂÂrecv-128ÂÂÂÂÂÂÂÂ723.01 (ÂÂÂ0.00%)ÂÂÂÂÂÂ723.66 (ÂÂÂ0.09%)ÂÂÂÂÂÂ660.89 *ÂÂ-8.59%*ÂÂÂÂÂÂ676.39 *ÂÂ-6.45%*ÂÂÂÂÂÂ650.63 * -10.01%*ÂÂÂÂÂÂ796.70 *ÂÂ10.19%*
HmeanÂÂÂÂÂrecv-256ÂÂÂÂÂÂÂ1435.06 (ÂÂÂ0.00%)ÂÂÂÂÂ1426.94 (ÂÂ-0.57%)ÂÂÂÂÂ1346.07 *ÂÂ-6.20%*ÂÂÂÂÂ1359.45 *ÂÂ-5.27%*ÂÂÂÂÂ1323.81 *ÂÂ-7.75%*ÂÂÂÂÂ1590.55 *ÂÂ10.84%*
HmeanÂÂÂÂÂrecv-1024ÂÂÂÂÂÂ5562.68 (ÂÂÂ0.00%)ÂÂÂÂÂ5529.90 *ÂÂ-0.59%*ÂÂÂÂÂ5228.28 *ÂÂ-6.01%*ÂÂÂÂÂ5381.37 *ÂÂ-3.26%*ÂÂÂÂÂ5271.45 *ÂÂ-5.24%*ÂÂÂÂÂ6117.42 *ÂÂÂ9.97%*
HmeanÂÂÂÂÂrecv-2048ÂÂÂÂÂ10934.36 (ÂÂÂ0.00%)ÂÂÂÂ10809.66 *ÂÂ-1.14%*ÂÂÂÂ10519.89 *ÂÂ-3.79%*ÂÂÂÂ10610.28 *ÂÂ-2.96%*ÂÂÂÂ10544.58 *ÂÂ-3.56%*ÂÂÂÂ11512.14 *ÂÂÂ5.28%*
HmeanÂÂÂÂÂrecv-3312ÂÂÂÂÂ16898.65 (ÂÂÂ0.00%)ÂÂÂÂ16538.21 *ÂÂ-2.13%*ÂÂÂÂ16240.86 *ÂÂ-3.89%*ÂÂÂÂ16269.34 *ÂÂ-3.72%*ÂÂÂÂ15967.13 *ÂÂ-5.51%*ÂÂÂÂ17598.31 *ÂÂÂ4.14%*
HmeanÂÂÂÂÂrecv-4096ÂÂÂÂÂ19351.99 (ÂÂÂ0.00%)ÂÂÂÂ19183.17 (ÂÂ-0.87%)ÂÂÂÂ18598.33 *ÂÂ-3.89%*ÂÂÂÂ18690.13 *ÂÂ-3.42%*ÂÂÂÂ18407.45 *ÂÂ-4.88%*ÂÂÂÂ20489.99 *ÂÂÂ5.88%*
HmeanÂÂÂÂÂrecv-8192ÂÂÂÂÂ32238.74 (ÂÂÂ0.00%)ÂÂÂÂ32275.13 (ÂÂÂ0.11%)ÂÂÂÂ29850.39 *ÂÂ-7.41%*ÂÂÂÂ30062.78 *ÂÂ-6.75%*ÂÂÂÂ29824.30 *ÂÂ-7.49%*ÂÂÂÂ35221.61 *ÂÂÂ9.25%*
HmeanÂÂÂÂÂrecv-16384ÂÂÂÂ48146.59 (ÂÂÂ0.00%)ÂÂÂÂ49296.23 *ÂÂÂ2.39%*ÂÂÂÂ48295.03 (ÂÂÂ0.31%)ÂÂÂÂ48786.88 *ÂÂÂ1.33%*ÂÂÂÂ48246.71 (ÂÂÂ0.21%)ÂÂÂÂ52993.72 *ÂÂ10.07%*

Recovered!

TBENCH4
=======
NOTES: networking counterpart of dbench. Varies the number of clients up to NUMCPUS*4
MEASURES: Throughput, MB/sec
HIGHER is better

machine: 8x-SKYLAKE-UMA
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ4.18.0
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂvanillaÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂteoÂÂÂÂÂÂÂÂteo-v2+backportÂÂÂÂÂÂÂÂteo-v3+backportÂÂÂÂÂÂÂÂteo-v5+backportÂÂÂÂÂÂÂÂteo-v6+backport
-------------------------------------------------------------------------------------------------------------------------------------------------------------
HmeanÂÂÂÂÂmb/sec-1ÂÂÂÂÂÂÂ620.52 (ÂÂÂ0.00%)ÂÂÂÂÂÂ613.98 *ÂÂ-1.05%*ÂÂÂÂÂÂ502.47 * -19.03%*ÂÂÂÂÂÂ492.77 * -20.59%*ÂÂÂÂÂÂ464.52 * -25.14%*ÂÂÂÂÂÂ705.89 *ÂÂ13.76%*
HmeanÂÂÂÂÂmb/sec-2ÂÂÂÂÂÂ1179.05 (ÂÂÂ0.00%)ÂÂÂÂÂ1112.84 *ÂÂ-5.62%*ÂÂÂÂÂÂ820.57 * -30.40%*ÂÂÂÂÂÂ831.23 * -29.50%*ÂÂÂÂÂÂ780.97 * -33.76%*ÂÂÂÂÂ1303.87 *ÂÂ10.59%*
HmeanÂÂÂÂÂmb/sec-4ÂÂÂÂÂÂ2072.29 (ÂÂÂ0.00%)ÂÂÂÂÂ2040.55 *ÂÂ-1.53%*ÂÂÂÂÂ2036.11 *ÂÂ-1.75%*ÂÂÂÂÂ2016.97 *ÂÂ-2.67%*ÂÂÂÂÂ2019.79 *ÂÂ-2.53%*ÂÂÂÂÂ2164.66 *ÂÂÂ4.46%*
HmeanÂÂÂÂÂmb/sec-8ÂÂÂÂÂÂ4238.96 (ÂÂÂ0.00%)ÂÂÂÂÂ4205.01 *ÂÂ-0.80%*ÂÂÂÂÂ4124.59 *ÂÂ-2.70%*ÂÂÂÂÂ4098.06 *ÂÂ-3.32%*ÂÂÂÂÂ4171.64 *ÂÂ-1.59%*ÂÂÂÂÂ4354.18 *ÂÂÂ2.72%*
HmeanÂÂÂÂÂmb/sec-16ÂÂÂÂÂ3515.96 (ÂÂÂ0.00%)ÂÂÂÂÂ3536.23 *ÂÂÂ0.58%*ÂÂÂÂÂ3500.02 *ÂÂ-0.45%*ÂÂÂÂÂ3438.60 *ÂÂ-2.20%*ÂÂÂÂÂ3456.89 *ÂÂ-1.68%*ÂÂÂÂÂ3688.76 *ÂÂÂ4.91%*
HmeanÂÂÂÂÂmb/sec-32ÂÂÂÂÂ3452.92 (ÂÂÂ0.00%)ÂÂÂÂÂ3448.94 *ÂÂ-0.12%*ÂÂÂÂÂ3428.08 *ÂÂ-0.72%*ÂÂÂÂÂ3369.30 *ÂÂ-2.42%*ÂÂÂÂÂ3430.09 *ÂÂ-0.66%*ÂÂÂÂÂ3574.24 *ÂÂÂ3.51%*

This one, too, not only is fixed but adds a solid improvement over the
baseline.


[1] https://github.com/gormanm/mmtests

Giovanni