Re: [PATCH v2] sched: wake-affine throttle

From: Peter Zijlstra
Date: Wed May 22 2013 - 04:50:26 EST


On Tue, May 21, 2013 at 11:20:18AM +0800, Michael Wang wrote:
>
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
> ping-pong case, and testing show it could benefit hackbench 15% at most.
>
> However, the whole feature is somewhat blindly, load balance is the only factor
> to be guaranteed, and since the stuff itself is time-consuming, some workload
> suffered, and testing show it could damage pgbench 41% at most.
>
> The feature currently settled in mainline, which means the current scheduler
> force sacrificed some workloads to benefit others, that is definitely unfair.
>
> Thus, this patch provide the way to throttle wake-affine stuff, in order to
> adjust the gain and loss according to demand.
>
> The patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
> default value 1ms (default minimum balance interval), which means wake-affine
> will keep silent for 1ms after it's failure.
>
> By turning the new knob, compared with mainline, which currently blindly using
> wake-affine, pgbench show 41% improvement at most.
>
> Link:
> Analysis from Mike Galbraith about the improvement:
> https://lkml.org/lkml/2013/4/11/54
>
> Analysis about the reason of throttle after failed:
> https://lkml.org/lkml/2013/5/3/31
>
> Test:
> Test with 12 cpu X86 server and tip 3.10.0-rc1.
>
> default
> base 1ms interval 10ms interval 100ms interval
> | db_size | clients | tps | | tps | | tps | | tps |
> +---------+---------+-------+ +-------+ +-------+ +-------+
> | 22 MB | 1 | 10828 | | 10850 | | 10795 | | 10845 |
> | 22 MB | 2 | 21434 | | 21469 | | 21463 | | 21455 |
> | 22 MB | 4 | 41563 | | 41826 | | 41789 | | 41779 |
> | 22 MB | 8 | 53451 | | 54917 | | 59250 | | 59097 |
> | 22 MB | 12 | 48681 | | 50454 | | 53248 | | 54881 |
> | 22 MB | 16 | 46352 | | 49627 | +7.07% | 54029 | +16.56% | 55935 | +20.67%
> | 22 MB | 24 | 44200 | | 46745 | +5.76% | 52106 | +17.89% | 57907 | +31.01%
> | 22 MB | 32 | 43567 | | 45264 | +3.90% | 51463 | +18.12% | 57122 | +31.11%
> | 7484 MB | 1 | 8926 | | 8959 | | 8765 | | 8682 |
> | 7484 MB | 2 | 19308 | | 19470 | | 19397 | | 19409 |
> | 7484 MB | 4 | 37269 | | 37501 | | 37552 | | 37470 |
> | 7484 MB | 8 | 47277 | | 48452 | | 51535 | | 52095 |
> | 7484 MB | 12 | 42815 | | 45347 | | 48478 | | 49256 |
> | 7484 MB | 16 | 40951 | | 44063 | +7.60% | 48536 | +18.52% | 51141 | +24.88%
> | 7484 MB | 24 | 37389 | | 39620 | +5.97% | 47052 | +25.84% | 52720 | +41.00%
> | 7484 MB | 32 | 36705 | | 38109 | +3.83% | 45932 | +25.14% | 51456 | +40.19%
> | 15 GB | 1 | 8642 | | 8850 | | 9092 | | 8560 |
> | 15 GB | 2 | 19256 | | 19285 | | 19362 | | 19322 |
> | 15 GB | 4 | 37114 | | 37131 | | 37221 | | 37257 |
> | 15 GB | 8 | 47120 | | 48053 | | 50845 | | 50923 |
> | 15 GB | 12 | 42386 | | 44748 | | 47868 | | 48875 |
> | 15 GB | 16 | 40624 | | 43414 | +6.87% | 48169 | +18.57% | 50814 | +25.08%
> | 15 GB | 24 | 37110 | | 39096 | +5.35% | 46594 | +25.56% | 52477 | +41.41%
> | 15 GB | 32 | 36252 | | 37316 | +2.94% | 45327 | +25.03% | 51217 | +41.28%
>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CC: Mike Galbraith <efault@xxxxxx>
> CC: Alex Shi <alex.shi@xxxxxxxxx>
> Suggested-by: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>

So I utterly hate this patch. I hate it worse than your initial buddy
patch :/

And I know its got a Suggested-by there; but that was when you led me to
believe that wake_affine() itself was expensive to run; its not, its the
result of those runs you don't like.

While we have a ton (too many to be sure) scheduler tunables, users
shouldn't ever need to actually touch those. Its just that every time we
have to make a random choice its as easy to make it a debug knob as to
hardcode it.

The problem with this patch is that users _have_ to frob knobs and while
doing so potentially wreck other workloads.

To make it worse, the knob isn't anything fundamental, its a random
hack.

So I would really either improve the smarts of wake_affine, with for
example your wake buddy relation thing (and simply exempt [Soft]IRQs) or
kill wake_affine and be done with it.

Either avenue has the risk of regressing some workload, but at least
when that happens (and people report it) we'll have a counter-example to
learn from and incorporate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/