On Thu, Jun 26, 2025 at 12:40:59AM +0530, Shrikanth Hegde wrote:
This is a followup version if [1] with few additions. This is still an RFC
and would like get feedback on the idea and suggestions on improvement.
v1->v2:
- Renamed to cpu_avoid_mask in place of cpu_parked_mask.
This one is not any better to the previous. Why avoid? When avoid?
I already said that: for objects, having positive self-explaining
noun names is much better than negative and/or function-style verb
names. I suggested cpu_paravirt_mask, and I still believe it's a much
better option.
- Used a static key such that no impact to regular case.
Static keys are not free and designed for different purpose. You have
CONFIG_PARAVIRT, and I don't understand why you're trying to avoid
using it.
I don't mind about static keys, if you prefer them, I just want to
have feature-specific code under corresponding config.
Can you please print bloat-o-meter report for CONFIG_PARAVIRT=n?
Have you any perf numbers to advocate static keys here?
- add sysfs file to show avoid CPUs.^
- Make RT understand avoid CPUs.
- Add documentation patch
- Took care of reported compile error in [1] when NR_CPUS=1
-----------------
Problem statement
-----------------
vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.
A hypervisor is managing these vCPUs from different VMs. When a vCPU
requests for CPU, hypervisor does the job of scheduling them on a pCPU.
So this issue occurs when there are more vCPUs(combined across all VMs)
than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor
can only run a few of them and remaining will be preempted(waiting for pCPU).
If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for *limited* vCPUs, it avoids the above overhead and
Did this extra whitespace escaped from the previous line, or the following?
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more
expensive than the task preemption within the vCPU. So *basic* aim to avoid
vCPU preemption.
So to achieve this, use "CPU Avoid" concept, where it is better
if workload avoids these vCPUs at this moment.
(vCPUs stays online, we don't want the overhead of sched domain rebuild).
Contention is dynamic in nature. When there is contention for pCPU is to be
detected and determined by architecture. Archs needs to update the mask
accordingly.
When there is contention, use limited vCPUs as indicated by arch.
When there is no contention, use all vCPUs.
-------------------------
To be done and Questions:
-------------------------
1. IRQ - still don't understand this cpu_avoid_mask. Maybe irqbalance
code could be modified to do the same. Ran stress-ng --hrtimers, irq
moved out of avoid cpu though. So need to see if changes to irqbalance is
required or not.
2. If a task is spawned by affining to only avoid CPUs. Should that fail
or throw a warning to user.
I think it's possible that existing codebase will do that. And because
you don't want to break userspace, you should not restrict.
3. Other classes such as SCHED_EXT, SCHED_DL won't understand this infra
yet.
4. Performance testing yet to be done. RFC only verified the functional
aspects of whether task move out of avoid CPUs or not. Move happens quite
fast (around 1-2 seconds even on large systems with very high utilization)
5. Haven't come up an infra which could combine all push task related
changes. It is currently spread across rt, dl, fair. Maybe some
consolidation can be done. but which tasks to push/pull still remains in
the class.
6. cpu_avoid_mask may need some sort of locking to ensure read/write is
correct.
[1]: https://lore.kernel.org/all/20250523181448.3777233-1-sshegde@xxxxxxxxxxxxx/
Shrikanth Hegde (9):
sched/docs: Document avoid_cpu_mask and avoid CPU concept
cpumask: Introduce cpu_avoid_mask
sched/core: Don't allow to use CPU marked as avoid
sched/fair: Don't use CPU marked as avoid for wakeup and load balance
sched/rt: Don't select CPU marked as avoid for wakeup and push/pull rt task
sched/core: Push current task out if CPU is marked as avoid
sched: Add static key check for cpu_avoid
sysfs: Add cpu_avoid file
powerpc: add debug file for set/unset cpu avoid
Documentation/scheduler/sched-arch.rst | 25 +++++++++++++
arch/powerpc/include/asm/paravirt.h | 2 ++
arch/powerpc/kernel/smp.c | 50 ++++++++++++++++++++++++++
drivers/base/cpu.c | 8 +++++
include/linux/cpumask.h | 17 +++++++++
kernel/cpu.c | 3 ++
kernel/sched/core.c | 50 +++++++++++++++++++++++++-
kernel/sched/fair.c | 11 +++++-
kernel/sched/rt.c | 9 +++--
kernel/sched/sched.h | 10 ++++++
10 files changed, 181 insertions(+), 4 deletions(-)
--
2.43.0