Re: [RFC PATCH 0/7] sched: cpufreq: Remove magic margins

From: Lukasz Luba
Date: Wed Sep 06 2023 - 05:17:49 EST


Hi Qais,

On 8/28/23 00:31, Qais Yousef wrote:
Since the introduction of EAS and schedutil, we had two magic 0.8 and 1.25
margins applied in fits_capacity() and apply_dvfs_headroom().

As reported two years ago in

https://lore.kernel.org/lkml/1623855954-6970-1-git-send-email-yt.chang@xxxxxxxxxxxx/

these values are not good fit for all systems and people do feel the need to
modify them regularly out of tree.

That is true, in Android kernel those are known 'features'. Furthermore,
in my game testing it looks like higher margins do help to shrink
number of dropped frames, while on other types of workloads (e.g.
those that you have in the link above) the 0% shows better energy.

I remember also the results from MTK regarding the PELT HALF_LIFE

https://lore.kernel.org/all/0f82011994be68502fd9833e499749866539c3df.camel@xxxxxxxxxxxx/

The numbers for 8ms half_life where showing really nice improvement
for the 'min fps' metric. I got similar with higher margin.

IMO we can derive quite important information from those different
experiments:
More sustainable workloads like "Yahoo browser" don't need margin.
More unpredictable workloads like "Fortnite" (shooter game with 'open
world') need some decent margin.

The problem is that the periodic task can be 'noisy'. The low-pass
filter which is our exponentially weighted moving avg PELT will
'smooth' the measured values. It will block sudden 'spikes' since
they are high-frequency changes. Those sudden 'spikes' are
the task activations where we need to compute a bit longer, e.g.
there was explosion in the game. The 25% margin helps us to
be ready for this 'noisy' task - the CPU frequency is higher
(and capacity). So if a sudden need for longer computation
is seen, then we have enough 'idle time' (~25% idle) to serve this
properly and not loose the frame.

The margin helps in two ways for 'noisy' workloads:
1. in fits_capacity() to avoid a CPU which couldn't handle it
and prefers CPUs with higher capacity
2. it asks for longer 'idle time' e.g. 25-40% (depends on margin) to
serve sudden computation need

IIUC, your proposal is to:
1. extend the low-pass filter to some higher frequency, so we
could see those 'spikes' - that's the PELT HALF_LIFE boot
parameter for 8ms
1.1. You are likely to have a 'gift' from the Util_est
which picks the max util_avg values and maintains them
for a while. That's why the 8ms PELT information can last longer
and you can get higher frequency and longer idle time.
2. Plumb in this new idea of dvfs_update_delay as the new
'margin' - this I don't understand

For the 2. I don't see that the dvfs HW characteristics are best
for this problem purpose. We can have a really fast DVFS HW,
but we need some decent spare idle time in some workloads, which
are two independent issues IMO. You might get the higher
idle time thanks to 1.1. but this is a 'side effect'.

Could you explain a bit more why this dvfs_update_delay is
crucial here?

Regards,
Lukasz