Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

From: Valentin Schneider
Date: Fri Aug 08 2025 - 07:45:27 EST


On 08/08/25 18:13, Aaron Lu wrote:
> On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
>> On 15/07/25 15:16, Aaron Lu wrote:
>> > From: Valentin Schneider <vschneid@xxxxxxxxxx>
>> >
>> > In current throttle model, when a cfs_rq is throttled, its entity will
>> > be dequeued from cpu's rq, making tasks attached to it not able to run,
>> > thus achiveing the throttle target.
>> >
>> > This has a drawback though: assume a task is a reader of percpu_rwsem
>> > and is waiting. When it gets woken, it can not run till its task group's
>> > next period comes, which can be a relatively long time. Waiting writer
>> > will have to wait longer due to this and it also makes further reader
>> > build up and eventually trigger task hung.
>> >
>> > To improve this situation, change the throttle model to task based, i.e.
>> > when a cfs_rq is throttled, record its throttled status but do not remove
>> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>> > they get picked, add a task work to them so that when they return
>> > to user, they can be dequeued there. In this way, tasks throttled will
>> > not hold any kernel resources. And on unthrottle, enqueue back those
>> > tasks so they can continue to run.
>> >
>>
>> Moving the actual throttle work to pick time is clever. In my previous
>> versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
>> but this makes the code a lot more palatable. I'd like to see how this
>> impacts performance though.
>>
>
> Let me run some scheduler benchmark to see how it impacts performance.
>
> I'm thinking maybe running something like hackbench on server platforms,
> first with quota not set and see if performance changes; then also test
> with quota set and see how performance changes.
>
> Does this sound good to you? Or do you have any specific benchmark and
> test methodology in mind?
>

Yeah hackbench is pretty good for stressing the EQ/DQ paths.

>> > + if (throttled_hierarchy(cfs_rq) &&
>> > + !task_current_donor(rq_of(cfs_rq), p)) {
>> > + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>> > + return true;
>> > + }
>> > +
>> > + /* we can't take the fast path, do an actual enqueue*/
>> > + p->throttled = false;
>>
>> So we clear p->throttled but not p->throttle_node? Won't that cause issues
>> when @p's previous cfs_rq gets unthrottled?
>>
>
> p->throttle_node is already removed from its previous cfs_rq at dequeue
> time in dequeue_throttled_task().
>
> This is done so because in enqueue time, we may not hold its previous
> rq's lock so can't touch its previous cfs_rq's limbo list, like when
> dealing with affinity changes.
>

Ah right, the DQ/EQ_throttled_task() functions are when DQ/EQ is applied to an
already-throttled task and it does the right thing.

Does this mean we want this as enqueue_throttled_task()'s prologue?

/* @p should have gone through dequeue_throttled_task() first */
WARN_ON_ONCE(!list_empty(&p->throttle_node));

>> > @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>> > */
>> > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > {
>> > + if (unlikely(task_is_throttled(p))) {
>> > + dequeue_throttled_task(p, flags);
>> > + return true;
>> > + }
>> > +
>>
>> Handling a throttled task's move pattern at dequeue does simplify things,
>> however that's quite a hot path. I think this wants at the very least a
>>
>> if (cfs_bandwidth_used())
>>
>> since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
>> hurt, but we can probably find some people who really care about that to
>> run it for us ;)
>>
>
> No problem.
>
> I'm thinking maybe I can add this cfs_bandwidth_used() in
> task_is_throttled()? So that other callsites of task_is_throttled() can
> also get the benefit of paying less cost when cfs bandwidth is not
> enabled.
>

Sounds good to me; just drop the unlikely and let the static key do its
thing :)