Re: [PATCH] sched/deadline: Derive root domain from active cpu in task's cpus_ptr

From: Pierre Gondois

Date: Fri Oct 10 2025 - 12:26:41 EST

On 10/6/25 14:12, Juri Lelli wrote:

On 06/10/25 12:13, Pierre Gondois wrote:

On 9/30/25 11:04, Peter Zijlstra wrote:

On Tue, Sep 30, 2025 at 08:20:06AM +0100, Juri Lelli wrote:

I actually wonder if we shouldn't make cppc_fie a "special" DEADLINE
tasks (like schedutil [1]). IIUC that is how it is thought to behave
already [2], but, since it's missing the SCHED_FLAG_SUGOV flag(/hack),
it is not "transparent" from a bandwidth tracking point of view.

1 -https://elixir.bootlin.com/linux/v6.17/source/kernel/sched/cpufreq_schedutil.c#L661
2 -https://elixir.bootlin.com/linux/v6.17/source/drivers/cpufreq/cppc_cpufreq.c#L198

Right, I remember that hack. Bit sad its spreading, but this CPPC thing
is very much like the schedutil one, so might as well do that I suppose.

IIUC, the sugov thread was switched to deadline to allow frequency updates
when deadline tasks start to run. I.e. there should be no point updating the
freq. after the deadline task finished running, cf [1] and [2]

The CPPC FIE worker should not require to run that quickly as it seems to be
more like a freq. maintenance work (the call comes from the sched tick)

sched_tick()
\-arch_scale_freq_tick() / topology_scale_freq_tick()
\-set_freq_scale() / cppc_scale_freq_tick()
\-irq_work_queue()

OK, but how much bandwidth is enough for it (on different platforms)?
Also, I am not sure the worker follows cpusets/root domain changes.

To share some additional information, I could to reproduce the issue by
creating as many deadline tasks with a huge bandwidth that the platform
allows it:
chrt -d -T 1000000 -P 1000000 0 yes > /dev/null &

Then kexec to another kernel. The available bandwidth of the root domain
gradually decreases with the number of CPUs unplugged.
At some point, there is not enough bandwidth and an overflow is detected.
(Same call stack as in the original message).

So I'm not sure this is really related to the cppc_fie thread.
I think it's more related to checking the available bandwidth in a context
which is not appropriate. The deadline bandwidth might lack when the platform
is reset, but this should not be that important.

---

Question:
Since the cppc_fie worker doesn't have the SCHED_FLAG_SUGOV flag,
is this comment actually correct ?
/*
* Fake (unused) bandwidth; workaround to "fix"
* priority inheritance.
*/

---

On a non-deadline related topic, the CPPC drivers creates a cppc_fie worker in
case the CPPC counters to estimate the current frequency are in PCC channels.
Accessing these channels requires to go through sleeping sections,
that's why a worker is used.

However, CPPC counters might be accessed through FFH, which doesn't go through
sleeping sections. In such case, the cppc_fie worker is never used and never
removed, so it would be nice to remote it.