Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

From: shrikanth hegde
Date: Thu Feb 16 2023 - 14:58:29 EST




On 2/16/23 3:02 AM, Benjamin Segall wrote:
> shrikanth hegde <sshegde@xxxxxxxxxxxxxxxxxx> writes:
>
>>>>
>>>> 6.2.rc5 with patch
>>>> 1CG power 2CG power | 1CG power 2CG power
>>>> 1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
>>>> 219 43 315 45 | 219 44 244(+22%) 48(-6%)
>>>> |
>>>> 2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
>>>> 109 49 157 52 | 109 49 136(+13%) 56(-7%)
>>>> |
>>>> 4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
>>>> 61 61 90 65 | 62 60 68(+24%) 73(-12%)
>>>> |
>>>> 8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
>>>> 33 77 48 84 | 33 77 38(+21%) 90(-7%)
>>>>
>>>> There is no benefit at higher utilization of 50% or more. There is no
>>>> degradation also.
>>>>
>>>> This is RFC PATCH V2, where the code has been shifted from hrtimer to
>>>> sched. This patch sets an initial value as multiple of period/10.
>>>> Here timers can still align if the time started the cgroup is within the
>>>> period/10 interval. On a real life workload, time gives sufficient
>>>> randomness. There can be a better interleaving by being more
>>>> deterministic. For example, when there are 2 cgroups, they should
>>>> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
>>>> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
>>>> for cgroup addition/deletion and accuracy w.r.t to period/quota.
>>>> If that approach is better here, then will come up with that patch.
>>>
>>> This does seem vaguely reasonable, though the power argument of
>>> consolidating wakeups and such is something that we intentionally do in
>>> other situations.
>>>
>> Thank you Benjamin for taking a look and spending time in reviewing this.
>>> How reasonable do you think it is to just say (and what do the
>>> equivalent numbers look like on your particular benchmark) "put some
>>> variance on your period config if you want variance"?
>>> Run to run variance is expected with this patch as the patch depends
>> on time upto last period/10 as the basis for interleaving.
>> What i could infer from this comment about variance. Please correct if not.
>
> My question is what the numbers look like if you instead prepare the
> cgroups with periods that are something like 97 ms and 103ms instead of
> both 100ms (keeping the quota as the same proportion as the original).

oh ok. If the cgroups are prepared with slightly different timer values, then
timers does interleave. That is expected as the difference would be small at
the beginning, goes to max at some point, then again would align later. Like
below

| /\
| / \
timer | / \
delta | / \
|/________\____

time -->

Did a set of experiments with the these three timer values. Here in all the
cases, each cgroup is allocated 25% of the runtime. There are 8 Core with SMT=8
(64 CPU). Values of 100ms/100ms not same as before, since this is run on
different machine as the previous one was not available. Hence kept 100/100
numbers as well.

6.2.rc6 6.2.rc6 + with patch
Period 1CG power 2CG power | 1CG power 2CG power
97/103 27.8 78 32.9 98 | 27.5 75 33.4 102
97/103 27.3 78 33 101 | 27.9 71 32.8 97

100/100 27.5 82 40.2 93 | 27.5 80 34.2 105
100/100 28 86 40.1 94 | 27.7 78 30.1 110

75/125 27.3 89 32.7 102 | 27.3 84 33 106
75/125 27.1 87 33 105 | 27.1 90 33.1 100

Few observations.
1. We get improved performance when the timers are slightly different from
100ms.
2. If the timers have slight variance, there is no difference with patch.
3. power numbers vary bit more, when the timers have variance. This maybe
because the idle/exit aren't aligning.
4. The best interleaving is still not possible if the timers have variance.
that can happen only with deterministic interleaving. patch can hope to
achieve that. But not always.