Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

From: Carl-Elliott Bilodeau-Savaria
Date: Sat Jul 26 2025 - 16:59:50 EST

Next message: Andrei Vagin: "Re: do_change_type(): refuse to operate on unmounted/not ours mounts"
Previous message: Andrew Lunn: "Re: [PATCH net-next v10 06/15] net: phy: Create a phy_port for PHY-driven SFPs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi everyone,

Apologies for the noise. I'm gently pinging on this scheduling question from about 10 days ago as it may have been missed. I have now added the scheduler mailing list and the relevant maintainers to the CC list.

I've also created a small GitHub repo to reproduce the issue: https://github.com/normal-account/sched-wakeup-locality-test

Any insights would be greatly appreciated.

Thanks,
Carl-Elliott

--- [Original Email Below] ---

Hi sched maintainers,

I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.

SYSTEM & TEST SETUP
-------------------------

System Details:
- CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
- CONFIG_PREEMPT=y
- CPU governor: performance
- SMT: Enabled

Workloads:
- continuous-burn: A simple, non-stop while(1) loop.
- intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.

Cgroup Configuration:

parent/ (cpuset.cpus="0-1")
├── lw/ (cpu.weight=1)
│ └── 1x continuous-burn process
└── hw/ (cpu.weight=10000)
└── 2x intermittent-burn processes

The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.

PROBLEM SCENARIO & ANALYSIS
-------------------------------------

The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.

1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
CPU 0 rq: [ (idle) ]
CPU 1 rq: [ continuous-1 (running) ]
(Sleeping tasks: intermittent-0, intermittent-1)

2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
CPU 0 rq: [ intermittent-1 (running) ]
CPU 1 rq: [ continuous-1 (running) ]
(Sleeping tasks: intermittent-0)

3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
CPU 1 rq: [ continuous-1 (running) ]

Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.

This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.

OBSERVED IMPACT
---------------------

With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.

This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.

QUESTIONS
-------------

I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.

However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.

- Is this considered a known-issue/an expected trade-off under EEVDF's design?
- Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?

Thank you for your help!

(Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)

________________________________________
From: Carl-Elliott Bilodeau-Savaria
Sent: Tuesday, July 15, 2025 6:44 PM
To: linux-kernel@xxxxxxxxxxxxxxx
Cc: peterz@xxxxxxxxxxxxx
Subject: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

Hi sched maintainers,

I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.

SYSTEM & TEST SETUP
-------------------------

System Details:
- CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
- CONFIG_PREEMPT=y
- CPU governor: performance
- SMT: Enabled

Workloads:
- continuous-burn: A simple, non-stop while(1) loop.
- intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.

Cgroup Configuration:

parent/ (cpuset.cpus="0-1")
├── lw/ (cpu.weight=1)
│ └── 1x continuous-burn process
└── hw/ (cpu.weight=10000)
└── 2x intermittent-burn processes

The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.

PROBLEM SCENARIO & ANALYSIS
-------------------------------------

The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.

1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
CPU 0 rq: [ (idle) ]
CPU 1 rq: [ continuous-1 (running) ]
(Sleeping tasks: intermittent-0, intermittent-1)

2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
CPU 0 rq: [ intermittent-1 (running) ]
CPU 1 rq: [ continuous-1 (running) ]
(Sleeping tasks: intermittent-0)

3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
CPU 1 rq: [ continuous-1 (running) ]

Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.

This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.

OBSERVED IMPACT
---------------------

With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.

This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.

QUESTIONS
-------------

I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.

However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.

- Is this considered a known-issue/an expected trade-off under EEVDF's design?
- Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?

Thank you for your help!

(Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)

Next message: Andrei Vagin: "Re: do_change_type(): refuse to operate on unmounted/not ours mounts"
Previous message: Andrew Lunn: "Re: [PATCH net-next v10 06/15] net: phy: Create a phy_port for PHY-driven SFPs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]