Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

From: hupu

Date: Fri Oct 10 2025 - 07:37:45 EST

Hi Pierre Gondois,
Thank you very much for your reply, and I’m sorry for getting back to
you late as I was away on vacation recently.

On Tue, Sep 30, 2025 at 3:46 PM Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
>
> Hello Hupu,
>
> On 9/10/25 10:43, hupu wrote:
> > Currently, load_sum to be propagated is estimated from
> > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > runnable_avg as an approximation. This approach can introduce precision
> > loss due to the shift operation, and the error may become more visible
> > when small tasks frequently enter and leave the queue.
> >
> > This patch introduces removed.load_sum to directly accumulate
> > se->avg.load_sum when tasks dequeue, and uses it during load
> > propagation. By doing so:
> >
> > a) Avoid relying on runnable_avg-based approximation and obtain
> > higher precision in load_sum propagation;
> (runnable_sum == load_sum) is not exactly accurate anymore since:
> static inline long se_runnable(struct sched_entity *se)
> {
> if (se->sched_delayed)
> return false;
>
> return !!se->on_rq;
> }
>
> So obtaining load_[sum|avg] from the runnable_avg signal seems compromised.
>

I agree with your point that when a task is in a delayed state
(se->sched_delayed = 1), it still resides on the runqueue and
continues to contribute to the load_sum, but no longer contributes to
the runnable_sum.

Moreover, based on the mathematical relationship, it is also evident
that the two are not equal. As analyzed in my previous email, for a
given se:

runnable_sum = decay(history) + contrib(running + runnable) * 1024
load_sum = decay(history) + contrib(running + runnable)

Here, decay() represents the decayed contribution from history, and
contrib() represents the new contribution from the running/runnable
state. Due to the difference in these formulas, estimating load_avg
from runnable_avg is inherently inaccurate.

> It is possible to compute load_sum value without the runnable_signal, cf.
> 40f5aa4c5eae ("sched/pelt: Fix attach_entity_load_avg() corner case")
> https://lore.kernel.org/all/20220414090229.342-1-kuyo.chang@xxxxxxxxxxxx/T/#u
>
> I.e.:
> + se->avg.load_sum = se->avg.load_avg * divider;
> + if (se_weight(se) < se->avg.load_sum)
> + se->avg.load_sum = div_u64(se->avg.load_sum, se_weight(se));
> + else
> + se->avg.load_sum = 1;
>
> As a side note, as a counterpart of the above patch, the lower the niceness,
> the lower the weight (in sched_prio_to_weight[]) and the lower the task
> load signal.
> This means that the unweighted load_sum value looses granularity.
> E.g.:
> A task with weight=15 can have load_avg values in [0:15]. So all the values
> for load_sum in the range [X * (47742/15) : (X + 1) * (47742/15)]
> are floored to load_avg=X, but load_sum is not reset when computing
> load_avg.
> attach_entity_load_avg() however resets load_sum to X * (47742/15).
>

>From a mathematical perspective, deriving load_sum from load_avg is
indeed feasible.

However, as you pointed out, integer arithmetic may introduce
significant quantization errors, particularly for tasks with low
weights.

For instance, if a task’s weight is 15 and its load_sum values are
3183 and 6364 respectively, both would result in the same load_avg = 1
under this method — resulting in an error of 6364 - 3183 = 3181. This
error increases as the task’s weight decreases.

Therefore, I believe that recomputing the propagated load_sum from
load_avg within update_cfs_rq_load_avg() is not an ideal approach.
Instead, my proposal is to record the load_sum of dequeued tasks
directly in cfs_rq->removed, rather than inferring it indirectly from
other signals such as runnable_sum or load_avg.

> > b) Eliminate precision loss from the shift operation, especially
> > with frequent short-lived tasks;
>
> It might also help aggregating multiple tasks. Right now, if 1.000 tasks
> with load_sum=1 and load_avg=0 are removed, the rq's load signal will
> not be impacted at all.
>

Exactly. As I mentioned in previous email, this is also one of the key
motivations behind this PATCH.

> On the other side, there might also be questions about the PELT windows
> of all these tasks and the rq being aligned, cf.
> https://lore.kernel.org/all/20170512171336.148578996@xxxxxxxxxxxxx/
>

To be honest, I’m not entirely sure I fully understand the potential
issue you are referring to here. I assume your concern is that the
load_sum values stored in cfs_rq->removed may belong to a different
PELT window than the current rq window during propagation.

I understand the necessity of PELT window alignment.
In attach_entity_load_avg(), when an se is enqueued into a cfs_rq, its
PELT window is aligned with that of the cfs_rq, and its *_sum values
are recalculated to keep both in sync.

However, in this PATCH, the cfs_rq->removed mechanism only relates to
the dequeue path — it records the load_sum of a task when it leaves
the runqueue, and later propagates it appropriately. Therefore, I
don’t think window alignment needs to be considered at dequeue time
(at least based on my current understanding), since this
synchronization has already been handled during enqueue.

That said, I may have missed some corner cases, so I’d really
appreciate further discussion on this point.

Thanks,
hupu