Re: [PATCH v2] sched/fair: unlink misfit task from cpu overutilized

From: Qais Yousef
Date: Sat Jan 14 2023 - 16:19:32 EST


On 01/13/23 15:28, Vincent Guittot wrote:
> Hi Kajetan,
>
> On Fri, 13 Jan 2023 at 14:50, Kajetan Puchalski
> <kajetan.puchalski@xxxxxxx> wrote:
> >
> > Hi,
> >
> > > By taking into account uclamp_min, the 1:1 relation between task misfit
> > > and cpu overutilized is no more true as a task with a small util_avg of
> > > may not may not fit a high capacity cpu because of uclamp_min constraint.
> > >
> > > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > > a CPU except for the uclamp_min hint which is a performance requirement.
> > >
> > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > > can use this new value to take additional action to select the best CPU
> > > that doesn't match uclamp_min hint.
> >
> > I just wanted to flag some issues I noticed with this patch and the
> > entire topic.
> >
> > I was testing this on a Pixel 6 with a 5.18 android-mainline kernel with
>
> Do you have more details to share on your setup ?
> Android kernel has some hack on top of the mainline. Do you use some ?
> Then, the perf and the power can be largely impacted by the cgroup
> configuration. Have you got details on your setup ?
>
> I'm going to try to reproduce the behavior
>
> > all the relevant uclamp and CFS scheduling patches backported to it from
> > mainline. From what I can see, the 'uclamp fits capacity' patchset
> > introduced some alarming power usage & performance issues that this
> > patch makes even worse.
> >
> > The patch stack for the following tables is as follows:
> >
> > (ufc_patched) sched/fair: unlink misfit task from cpu overutilized
>
> I just sent a v3 which fixes a condition. Wonder if this could have an
> impact on the results both perf and power
>
> > sched/uclamp: Fix a uninitialized variable warnings
> > (baseline_ufc) sched/fair: Check if prev_cpu has highest spare cap in feec()
> > sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition
> > sched/uclamp: Make cpu_overutilized() use util_fits_cpu()
> > sched/uclamp: Make asym_fits_capacity() use util_fits_cpu()
> > sched/uclamp: Make select_idle_capacity() use util_fits_cpu()
> > sched/uclamp: Fix fits_capacity() check in feec()
> > sched/uclamp: Make task_fits_capacity() use util_fits_cpu()
> > sched/uclamp: Fix relationship between uclamp and migration margin
> > (previous 'baseline' was here)
> >
> > I omitted the 3 patches relating directly to capacity_inversion but in

This could lead to confusion. Was there a specific reason for this omission?
Did you hit some problem?

> > the other tests I did with those there were similar issues. It's
> > probably easier to consider the uclamp parts and their effects in
> > isolation.
> >
> > 1. Geekbench 5 (performance regression)
> >
> > +-----------------+----------------------------+--------+-----------+
> > | metric | kernel | value | perc_diff |
> > +-----------------+----------------------------+--------+-----------+
> > | multicore_score | baseline | 2765.4 | 0.0% |
> > | multicore_score | baseline_ufc | 2704.3 | -2.21% | <-- a noticeable score decrease already
> > | multicore_score | ufc_patched | 2443.2 | -11.65% | <-- a massive score decrease
> > +-----------------+----------------------------+--------+-----------+
> >
> > +--------------+--------+----------------------------+--------+-----------+
> > | chan_name | metric | kernel | value | perc_diff |
> > +--------------+--------+----------------------------+--------+-----------+
> > | total_power | gmean | baseline | 2664.0 | 0.0% |
> > | total_power | gmean | baseline_ufc | 2621.5 | -1.6% | <-- worse performance per watt
> > | total_power | gmean | ufc_patched | 2601.2 | -2.36% | <-- much worse performance per watt
> > +--------------+--------+----------------------------+--------+-----------+

Hmm I think I've seen a better score but at the cost of more energy in my
testing in the past.

> >
> > The most likely cause for the regression seen above is the decrease in the amount of
> > time spent while overutilized with these patches. Maximising
> > overutilization for GB5 is the desired outcome as the benchmark for
> > almost its entire duration keeps either 1 core or all the cores
> > completely saturated so EAS cannot be effective. These patches have the
> > opposite from the desired effect in this area.
> >
> > +----------------------------+--------------------+--------------------+------------+
> > | kernel | time | total_time | percentage |
> > +----------------------------+--------------------+--------------------+------------+
> > | baseline | 121.979 | 181.065 | 67.46 |
> > | baseline_ufc | 120.355 | 184.255 | 65.32 |
> > | ufc_patched | 60.715 | 196.135 | 30.98 | <-- !!!
> > +----------------------------+--------------------+--------------------+------------+
>
> I'm not surprised because some use cases which were not overutilized
> were wrongly triggered as overutilized so switching back to
> performance mode. You might have to tune the uclamp value

I remember there were tasks with uclamp_min=512 or something like that in the
system. I wonder if they're making the difference here - they will steal time
from bigger cores and increase energy too.

The big jump with Vincent patches is strange though. How many iterations do you
run? How long do you wait between each iteration?

The original behavior of overutilized in regards to util_avg shouldn't have
changed. It's worth digging a bit more into it.

I looked at my previous results and I was seeing ~57% on android12-5.10 kernel
for both with and without the patch.

>
> >
> > 2. Jankbench (power usage regression)
> >
> > +--------+---------------+---------------------------------+-------+-----------+
> > | metric | variable | kernel | value | perc_diff |
> > +--------+---------------+---------------------------------+-------+-----------+
> > | gmean | mean_duration | baseline_60hz | 14.6 | 0.0% |
> > | gmean | mean_duration | baseline_ufc_60hz | 15.2 | 3.83% |
> > | gmean | mean_duration | ufc_patched_60hz | 14.0 | -4.12% |
> > +--------+---------------+---------------------------------+-------+-----------+
> >
> > +--------+-----------+---------------------------------+-------+-----------+
> > | metric | variable | kernel | value | perc_diff |
> > +--------+-----------+---------------------------------+-------+-----------+
> > | gmean | jank_perc | baseline_60hz | 1.9 | 0.0% |
> > | gmean | jank_perc | baseline_ufc_60hz | 2.2 | 15.39% |
> > | gmean | jank_perc | ufc_patched_60hz | 2.0 | 3.61% |
> > +--------+-----------+---------------------------------+-------+-----------+

How many iterations did you run? Do you think they could be within the noise
region?

> >
> > +--------------+--------+---------------------------------+-------+-----------+
> > | chan_name | metric | kernel | value | perc_diff |
> > +--------------+--------+---------------------------------+-------+-----------+
> > | total_power | gmean | baseline_60hz | 135.9 | 0.0% |
> > | total_power | gmean | baseline_ufc_60hz | 155.7 | 14.61% | <-- !!!
> > | total_power | gmean | ufc_patched_60hz | 157.1 | 15.63% | <-- !!!
> > +--------------+--------+---------------------------------+-------+-----------+
> >
> > With these patches while running Jankbench we use up ~15% more power
> > just to achieve roughly the same results. Here I'm not sure where this
> > issue is coming from exactly but all the results above are very consistent
> > across different runs.

Have you tried to look at uclamp_min/max values of the tasks/cpus?

Do you know which cluster ended up using more energy? Have you looked at freq
residency between the runs?

I think the system could dynamically boost some UI tasks. I'd expect their
residency to increase on the medium/big cores compared to before the patch.
Which could explain what you see.

Did you check your schedutil/rate_limit_us is not using the default 10ms? It
should be 500us.

I had another patch to set transition latency of the cpufreq driver to 500us
instead of 5ms - but I doubt this actually was making any difference.

FWIW, I did compare my series against vanilla Pixel 6 kernel where I used
geekbench, speedometer, pcmark all with and without heavy background activities
to measure the impact of uclamp_max and nothing stood out then.

I sadly lost my original setup now. I doubt I'll be able to recreate it to
re-run these tests again anytime soon :/

Could you try removing all thermal handling from util_fits_cpu() so that my
series behaves like v1 again and see if this makes a difference? It's highlight
if subtle issues with thermal pressure handling are the culprit. Most obvious
is using instantaneous thermal pressure in ufc().


Thanks!

--
Qais Yousef