Re: [RFC PATCH] sched/fair: Interleave cfs bandwidth timers for improved single thread performance at low utilization

From: shrikanth hegde
Date: Wed Feb 15 2023 - 06:02:25 EST


>>
>> 6.2.rc5 with patch
>> 1CG power 2CG power | 1CG power 2CG power
>> 1Core 218 44 315 46 | 219 45 277(+12%) 47(-2%)
>> 219 43 315 45 | 219 44 244(+22%) 48(-6%)
>> |
>> 2Core 108 48 158 52 | 109 50 114(+26%) 59(-13%)
>> 109 49 157 52 | 109 49 136(+13%) 56(-7%)
>> |
>> 4Core 60 59 89 65 | 62 58 72(+19%) 68(-5%)
>> 61 61 90 65 | 62 60 68(+24%) 73(-12%)
>> |
>> 8Core 33 77 48 83 | 33 77 37(+23%) 91(-10%)
>> 33 77 48 84 | 33 77 38(+21%) 90(-7%)
>>
>> There is no benefit at higher utilization of 50% or more. There is no
>> degradation also.
>>
>> This is RFC PATCH V2, where the code has been shifted from hrtimer to
>> sched. This patch sets an initial value as multiple of period/10.
>> Here timers can still align if the time started the cgroup is within the
>> period/10 interval. On a real life workload, time gives sufficient
>> randomness. There can be a better interleaving by being more
>> deterministic. For example, when there are 2 cgroups, they should
>> have initial value of 0/50ms or 10/60ms so on. When there are 3 cgroups,
>> 0/3/6ms or 1/4/7ms etc. That is more complicated as it has to account
>> for cgroup addition/deletion and accuracy w.r.t to period/quota.
>> If that approach is better here, then will come up with that patch.
>
> This does seem vaguely reasonable, though the power argument of
> consolidating wakeups and such is something that we intentionally do in
> other situations.
>
Thank you Benjamin for taking a look and spending time in reviewing this.
> How reasonable do you think it is to just say (and what do the
> equivalent numbers look like on your particular benchmark) "put some
> variance on your period config if you want variance"?
>Run to run variance is expected with this patch as the patch depends
on time upto last period/10 as the basis for interleaving.
What i could infer from this comment about variance. Please correct if not.

>>
>> Signed-off-by: Shrikanth Hegde<sshegde@xxxxxxxxxxxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 17 ++++++++++++++---
>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ff4dbbae3b10..7b69c329e05d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5939,14 +5939,25 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>>
>> void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> {
>> - lockdep_assert_held(&cfs_b->lock);
>> + struct hrtimer *period_timer = &cfs_b->period_timer;
>> + s64 incr = ktime_to_ns(cfs_b->period) / 10;
>> + ktime_t delta;
>> + u64 orun = 1;
>>
>> + lockdep_assert_held(&cfs_b->lock);
>> if (cfs_b->period_active)
>> return;
>>
>> cfs_b->period_active = 1;
>> - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
>> - hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
>> + delta = ktime_sub(period_timer->base->get_time(),
>> + hrtimer_get_expires(period_timer));
>> + if (unlikely(delta >= cfs_b->period)) {
>
> Probably could have a short comment here that's something like "forward
> the hrtimer by period / 10 to reduce synchronized wakeups"
>
Sure. Will do in the next version of this patch.

>> + orun = ktime_divns(delta, incr);
>> + hrtimer_add_expires_ns(period_timer, incr * orun);
>> + }
>> +
>> + hrtimer_forward_now(period_timer, cfs_b->period);
>> + hrtimer_start_expires(period_timer, HRTIMER_MODE_ABS_PINNED);
>> }
>>
>> static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> --
>> 2.31.1