Re: [PATCH RFC v2] sched: Minimize the idle cpu selection race window.

From: Matt Fleming
Date: Wed Feb 07 2018 - 08:59:11 EST


On Tue, 05 Dec, at 01:09:07PM, Atish Patra wrote:
> Currently, multiple tasks can wakeup on same cpu from
> select_idle_sibiling() path in case they wakeup simulatenously
> and last ran on the same llc. This happens because an idle cpu
> is not updated until idle task is scheduled out. Any task waking
> during that period may potentially select that cpu for a wakeup
> candidate.
>
> Introduce a per cpu variable that is set as soon as a cpu is
> selected for wakeup for any task. This prevents from other tasks
> to select the same cpu again. Note: This does not close the race
> window but minimizes it to accessing the per-cpu variable. If two
> wakee tasks access the per cpu variable at the same time, they may
> select the same cpu again. But it minimizes the race window
> considerably.
>
> Here are some performance numbers:

I ran this patch through some tests here on the SUSE performance grid
and there's a definite regression for Mike's personal favourite
benchmark, tbench.

Here are the results: vanilla 4.15-rc9 on the left, -rc9 plus this
patch on the right.

tbench4
4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Min mb/sec-1 484.50 ( 0.00%) 463.03 ( -4.43%)
Min mb/sec-2 961.43 ( 0.00%) 959.35 ( -0.22%)
Min mb/sec-4 1789.60 ( 0.00%) 1760.21 ( -1.64%)
Min mb/sec-8 3518.51 ( 0.00%) 3471.47 ( -1.34%)
Min mb/sec-16 5521.12 ( 0.00%) 5409.77 ( -2.02%)
Min mb/sec-32 7268.61 ( 0.00%) 7491.29 ( 3.06%)
Min mb/sec-64 14413.45 ( 0.00%) 14347.69 ( -0.46%)
Min mb/sec-128 13501.84 ( 0.00%) 13413.82 ( -0.65%)
Min mb/sec-192 13237.02 ( 0.00%) 13231.43 ( -0.04%)
Hmean mb/sec-1 505.20 ( 0.00%) 485.81 ( -3.84%)
Hmean mb/sec-2 973.12 ( 0.00%) 970.67 ( -0.25%)
Hmean mb/sec-4 1835.22 ( 0.00%) 1788.54 ( -2.54%)
Hmean mb/sec-8 3529.35 ( 0.00%) 3487.20 ( -1.19%)
Hmean mb/sec-16 5531.16 ( 0.00%) 5437.43 ( -1.69%)
Hmean mb/sec-32 7627.96 ( 0.00%) 8021.26 ( 5.16%)
Hmean mb/sec-64 14441.20 ( 0.00%) 14395.08 ( -0.32%)
Hmean mb/sec-128 13620.40 ( 0.00%) 13569.17 ( -0.38%)
Hmean mb/sec-192 13265.26 ( 0.00%) 13263.98 ( -0.01%)
Max mb/sec-1 510.30 ( 0.00%) 489.89 ( -4.00%)
Max mb/sec-2 989.45 ( 0.00%) 976.10 ( -1.35%)
Max mb/sec-4 1845.65 ( 0.00%) 1795.50 ( -2.72%)
Max mb/sec-8 3574.03 ( 0.00%) 3547.56 ( -0.74%)
Max mb/sec-16 5556.99 ( 0.00%) 5564.80 ( 0.14%)
Max mb/sec-32 7678.18 ( 0.00%) 8098.63 ( 5.48%)
Max mb/sec-64 14463.07 ( 0.00%) 14437.58 ( -0.18%)
Max mb/sec-128 13659.67 ( 0.00%) 13602.65 ( -0.42%)
Max mb/sec-192 13612.01 ( 0.00%) 13832.98 ( 1.62%)

There's a nice little performance bump around the 32-client mark.
Incidentally, my test machine has 2 NUMA nodes with 24 cpus (12 cores,
2 threads) each. So 32 clients is the point at which things no longer
fit on a single node.

It doesn't look like the regression is caused by the schedule() path
being slightly longer (i.e. it's not a latency issue) because schbench
results show improvements for the low-end:

schbench
4.15.0-rc9 4.15.0-rc9
vanillasched-minimize-idle-cpu-window
Lat 50.00th-qrtle-1 46.00 ( 0.00%) 36.00 ( 21.74%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 37.00 ( 24.49%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 38.00 ( 26.92%)
Lat 95.00th-qrtle-1 56.00 ( 0.00%) 41.00 ( 26.79%)
Lat 99.00th-qrtle-1 61.00 ( 0.00%) 46.00 ( 24.59%)
Lat 99.50th-qrtle-1 63.00 ( 0.00%) 48.00 ( 23.81%)
Lat 99.90th-qrtle-1 77.00 ( 0.00%) 64.00 ( 16.88%)
Lat 50.00th-qrtle-2 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-2 47.00 ( 0.00%) 46.00 ( 2.13%)
Lat 90.00th-qrtle-2 50.00 ( 0.00%) 49.00 ( 2.00%)
Lat 95.00th-qrtle-2 53.00 ( 0.00%) 52.00 ( 1.89%)
Lat 99.00th-qrtle-2 58.00 ( 0.00%) 57.00 ( 1.72%)
Lat 99.50th-qrtle-2 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 99.90th-qrtle-2 72.00 ( 0.00%) 69.00 ( 4.17%)
Lat 50.00th-qrtle-4 46.00 ( 0.00%) 45.00 ( 2.17%)
Lat 75.00th-qrtle-4 49.00 ( 0.00%) 48.00 ( 2.04%)
Lat 90.00th-qrtle-4 52.00 ( 0.00%) 51.00 ( 1.92%)
Lat 95.00th-qrtle-4 55.00 ( 0.00%) 53.00 ( 3.64%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 59.00 ( 3.28%)
Lat 99.50th-qrtle-4 63.00 ( 0.00%) 61.00 ( 3.17%)
Lat 99.90th-qrtle-4 69.00 ( 0.00%) 74.00 ( -7.25%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 50.00 ( -4.17%)
Lat 75.00th-qrtle-8 52.00 ( 0.00%) 54.00 ( -3.85%)
Lat 90.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%)
Lat 95.00th-qrtle-8 57.00 ( 0.00%) 61.00 ( -7.02%)
Lat 99.00th-qrtle-8 64.00 ( 0.00%) 68.00 ( -6.25%)
Lat 99.50th-qrtle-8 67.00 ( 0.00%) 72.00 ( -7.46%)
Lat 99.90th-qrtle-8 81.00 ( 0.00%) 81.00 ( 0.00%)
Lat 50.00th-qrtle-16 50.00 ( 0.00%) 47.00 ( 6.00%)
Lat 75.00th-qrtle-16 59.00 ( 0.00%) 57.00 ( 3.39%)
Lat 90.00th-qrtle-16 66.00 ( 0.00%) 65.00 ( 1.52%)
Lat 95.00th-qrtle-16 69.00 ( 0.00%) 68.00 ( 1.45%)
Lat 99.00th-qrtle-16 76.00 ( 0.00%) 75.00 ( 1.32%)
Lat 99.50th-qrtle-16 79.00 ( 0.00%) 79.00 ( 0.00%)
Lat 99.90th-qrtle-16 86.00 ( 0.00%) 89.00 ( -3.49%)
Lat 50.00th-qrtle-23 52.00 ( 0.00%) 52.00 ( 0.00%)
Lat 75.00th-qrtle-23 65.00 ( 0.00%) 65.00 ( 0.00%)
Lat 90.00th-qrtle-23 75.00 ( 0.00%) 74.00 ( 1.33%)
Lat 95.00th-qrtle-23 81.00 ( 0.00%) 79.00 ( 2.47%)
Lat 99.00th-qrtle-23 95.00 ( 0.00%) 90.00 ( 5.26%)
Lat 99.50th-qrtle-23 12624.00 ( 0.00%) 1050.00 ( 91.68%)
Lat 99.90th-qrtle-23 15184.00 ( 0.00%) 13872.00 ( 8.64%)

If you'd like to run these tests on your own machines they're all
available at https://github.com/gormanm/mmtests.git.