Re: [RFC v2 0/9] cpu avoid state and push task mechanism

From: Shrikanth Hegde
Date: Thu Jun 26 2025 - 10:41:47 EST

Next message: Mike Looijmans: "[PATCH v2] drm/bridge: ti-sn65dsi83: Improve error reporting and handling"
Previous message: Luo Jie: "[PATCH net-next v5 14/14] MAINTAINERS: Add maintainer for Qualcomm PPE driver"
In reply to: Yury Norov: "Re: [RFC v2 0/9] cpu avoid state and push task mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/26/25 03:25, Yury Norov wrote:

On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:

This is a followup version if [1] with few additions. This is still an RFC
and would like get feedback on the idea and suggestions on improvement.

v1->v2:
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.

This one is not any better to the previous. Why avoid? When avoid?
I already said that: for objects, having positive self-explaining
noun names is much better than negative and/or function-style verb
names. I suggested cpu_paravirt_mask, and I still believe it's a much
better option.

ok. only reason is CPU is always para virtualized in those environment right?
When there is contention for pCPU, only then we want set this mask.
So i thought it might have to reflect that.

I can keep cpu_paravirt_mask. Could you please suggest set/get names which could
go with it? cpu_paravirt(cpu)?

- Used a static key such that no impact to regular case.

Static keys are not free and designed for different purpose. You have
CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
using it.

I don't mind about static keys, if you prefer them, I just want to
have feature-specific code under corresponding config.

Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
Have you any perf numbers to advocate static keys here?

I wanted to see if there could be any other use cases apart from paravirt case.

One I thought was, in SMT systems under low utilization, it could help higher IPC by keeping the tasks on
only 1 thread.. if base_slice is kept low, latency could be relatively low.

Other was, workloads or system usage can be dynamic in nature with peaks and troughs. when it is in trough, one may not want to use all
the cores(instead use SMT siblings), thereby saving some power.

Using CONFIG_PARAVIRT could end up sprinkling a bit of ifdefs. Need to see how I could minimize it.
Let me get back with bloat-o-meter numbers and performance numbers.

- add sysfs file to show avoid CPUs.
- Make RT understand avoid CPUs.
- Add documentation patch
- Took care of reported compile error in [1] when NR_CPUS=1

-----------------
Problem statement
-----------------
vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor is managing these vCPUs from different VMs. When a vCPU
requests for CPU, hypervisor does the job of scheduling them on a pCPU.

So this issue occurs when there are more vCPUs(combined across all VMs)
than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor
can only run a few of them and remaining will be preempted(waiting for pCPU).

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for *limited* vCPUs, it avoids the above overhead and

^
Did this extra whitespace escaped from the previous line, or the following?

Thanks for noticing it.
v

there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more
expensive than the task preemption within the vCPU. So *basic* aim to avoid
vCPU preemption.

So to achieve this, use "CPU Avoid" concept, where it is better
if workload avoids these vCPUs at this moment.
(vCPUs stays online, we don't want the overhead of sched domain rebuild).

Contention is dynamic in nature. When there is contention for pCPU is to be
detected and determined by architecture. Archs needs to update the mask
accordingly.

When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.

-------------------------
To be done and Questions:
-------------------------
1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
code could be modified to do the same. Ran stress-ng --hrtimers, irq
moved out of avoid cpu though. So need to see if changes to irqbalance is
required or not.

2. If a task is spawned by affining to only avoid CPUs. Should that fail
or throw a warning to user.

I think it's possible that existing codebase will do that. And because
you don't want to break userspace, you should not restrict.

ok got it. currently it is allowed.

3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
yet.

4. Performance testing yet to be done. RFC only verified the functional
aspects of whether task move out of avoid CPUs or not. Move happens quite
fast (around 1-2 seconds even on large systems with very high utilization)

5. Haven't come up an infra which could combine all push task related
changes. It is currently spread across rt, dl, fair. Maybe some
consolidation can be done. but which tasks to push/pull still remains in
the class.

6. cpu_avoid_mask may need some sort of locking to ensure read/write is
correct.

[1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@xxxxxxxxxxxxx/

Shrikanth Hegde (9):
sched/docs: Document avoid_cpu_mask and avoid CPU concept
cpumask: Introduce cpu_avoid_mask
sched/core: Don't allow to use CPU marked as avoid
sched/fair: Don't use CPU marked as avoid for wakeup and load balance
sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
sched/core: Push current task out if CPU is marked as avoid
sched: Add static key check for cpu_avoid
sysfs: Add cpu_avoid file
powerpc: add debug file for set/unset cpu avoid

Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
arch/powerpc/include/asm/paravirt.h | 2 ++
arch/powerpc/kernel/smp.c | 50 ++++++++++++++++++++++++++
drivers/base/cpu.c | 8 +++++
include/linux/cpumask.h | 17 +++++++++
kernel/cpu.c | 3 ++
kernel/sched/core.c | 50 +++++++++++++++++++++++++-
kernel/sched/fair.c | 11 +++++-
kernel/sched/rt.c | 9 +++--
kernel/sched/sched.h | 10 ++++++
10 files changed, 181 insertions(+), 4 deletions(-)

--
2.43.0

Next message: Mike Looijmans: "[PATCH v2] drm/bridge: ti-sn65dsi83: Improve error reporting and handling"
Previous message: Luo Jie: "[PATCH net-next v5 14/14] MAINTAINERS: Add maintainer for Qualcomm PPE driver"
In reply to: Yury Norov: "Re: [RFC v2 0/9] cpu avoid state and push task mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]