Re: [PATCH v1] sched/uclamp: Skip uclamp_rq_dec() for non-final dequeue of delayed tasks
From: Christian Loehle
Date: Thu Jul 03 2025 - 08:25:38 EST
On 7/2/25 01:53, Zihuan Zhang wrote:
> Hi Prateek,
>
> 在 2025/7/1 19:00, K Prateek Nayak 写道:
>> Hello Zihuan Zhang,
>>
>> On 7/1/2025 3:04 PM, Zihuan Zhang wrote:
>>> Currently, uclamp_rq_inc() skips updating the clamp aggregation for
>>> delayed tasks unless ENQUEUE_DELAYED is set, to ensure we only track the
>>> real enqueue of a task that was previously marked as sched_delayed.
>>>
>>> However, the corresponding uclamp_rq_dec() path only checks
>>> sched_delayed, and misses the DEQUEUE_DELAYED flag. As a result, we may
>>> skip dec for a delayed task that is now being truly dequeued, leading to
>>> uclamp aggregation mismatch.
>>>
>>> This patch makes uclamp_rq_dec() consistent with uclamp_rq_inc() by
>>> checking both sched_delayed and DEQUEUE_DELAYED, ensuring correct
>>> accounting symmetry.
>>>
>>> Fixes: 90ca9410dab2 ("sched/uclamp: Align uclamp and util_est and call before freq update")
>>> Signed-off-by: Zihuan Zhang <zhangzihuan@xxxxxxxxxx>
>>> ---
>>> kernel/sched/core.c | 9 +++++----
>>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 8988d38d46a3..99f1542cff7d 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1781,7 +1781,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p, int flags
>>> rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
>>> }
>>> -static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>>> +static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p, int flags)
>>> {
>>> enum uclamp_id clamp_id;
>>> @@ -1797,7 +1797,8 @@ static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
>>> if (unlikely(!p->sched_class->uclamp_enabled))
>>> return;
>>> - if (p->se.sched_delayed)
>>> + /* Skip dec if this is a delayed task not being truly dequeued */
>>> + if (p->se.sched_delayed && !(flags & DEQUEUE_DELAYED))
>>> return;
>>
>> Consider the following case:
>>
>> - p is a fair task with uclamp constraints set.
>>
>>
>> - P blocks and dequeue_task() calls uclamp_rq_dec() and later
>> p->sched_class->dequeue_task() sets "p->se.sched_delayed" to 1.
>>
>> uclamp_rq_dec() is called for the first time here and has already
>> decremented the clamp_id from the hierarchy.
>>
>>
>> - Before P can be completely dequeued, P is moved to an RT class
>> with p->se.sched_delayed still set to 1 which invokes the following
>> call-chain:
>> __sched_setscheduler() {
>> dequeue_task(DEQUEUE_DELAYED) {
>> uclamp_rq_dec() {
>> if (p->se.sched_delayed && !(flags & DEQUEUE_DELAYED)) /* false */
>> return;
>>
>> /* !! Decrements clamp_id again !! */
>> }
>> /* Dequeues from fair class */
>> }
>> /* Enqueues onto the RT class */
>> }
>>
>>
>> From my reading, the current code is correct and the special handling in
>> uclamp_rq_inc() is required because enqueue_task() does a
>> uclamp_rq_inc() first before calling p->sched_class->enqueue_task()
>> which will clear "p->se.sched_delayed" if ENQUEUE_DELAYED is set.
>>
>> dequeue_task() already does a uclamp_rq_dec() before task is delayed and
>> any further dequeue of a delayed task should not decrement the
>> uclamp_id.
>>
>> Please let me know if I've missed something.
>>
Thank you Prateek for the detailed explanation, I've stresstested the entire
uclamp delayed dequeue path the last couple days and I also couldn't trigger
any imbalance FWIW.