Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

From: Valentin Schneider
Date: Fri Aug 08 2025 - 05:13:08 EST


On 15/07/25 15:16, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@xxxxxxxxxx>
>
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
>
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets woken, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
>
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued there. In this way, tasks throttled will
> not hold any kernel resources. And on unthrottle, enqueue back those
> tasks so they can continue to run.
>

Moving the actual throttle work to pick time is clever. In my previous
versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
but this makes the code a lot more palatable. I'd like to see how this
impacts performance though.

> Throttled cfs_rq's PELT clock is handled differently now: previously the
> cfs_rq's PELT clock is stopped once it entered throttled state but since
> now tasks(in kernel mode) can continue to run, change the behaviour to
> stop PELT clock only when the throttled cfs_rq has no tasks left.
>

I haven't spent much time looking at the PELT stuff yet; I'll do that next
week. Thanks for all the work you've been putting into this, and sorry it
got me this long to get a proper look at it!

> Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Suggested-by: Chengming Zhou <chengming.zhou@xxxxxxxxx> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@xxxxxxxxxx>
> Signed-off-by: Aaron Lu <ziqianlu@xxxxxxxxxxxxx>

> +static bool enqueue_throttled_task(struct task_struct *p)
> +{
> + struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> +
> + /*
> + * If the throttled task is enqueued to a throttled cfs_rq,
> + * take the fast path by directly put the task on target
> + * cfs_rq's limbo list, except when p is current because
> + * the following race can cause p's group_node left in rq's
> + * cfs_tasks list when it's throttled:
> + *
> + * cpuX cpuY
> + * taskA ret2user
> + * throttle_cfs_rq_work() sched_move_task(taskA)
> + * task_rq_lock acquired
> + * dequeue_task_fair(taskA)
> + * task_rq_lock released
> + * task_rq_lock acquired
> + * task_current_donor(taskA) == true
> + * task_on_rq_queued(taskA) == true
> + * dequeue_task(taskA)
> + * put_prev_task(taskA)
> + * sched_change_group()
> + * enqueue_task(taskA) -> taskA's new cfs_rq
> + * is throttled, go
> + * fast path and skip
> + * actual enqueue
> + * set_next_task(taskA)
> + * __set_next_task_fair(taskA)
> + * list_move(&se->group_node, &rq->cfs_tasks); // bug
> + * schedule()
> + *
> + * And in the above race case, the task's current cfs_rq is in the same
> + * rq as its previous cfs_rq because sched_move_task() doesn't migrate
> + * task so we can use its current cfs_rq to derive rq and test if the
> + * task is current.
> + */

OK I think I got this; I slightly rephrased things to match similar
comments in the sched code:

--->8---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a7c86c5b8a2b..8566ee0399527 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5827,37 +5827,38 @@ static bool enqueue_throttled_task(struct task_struct *p)
struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);

/*
- * If the throttled task is enqueued to a throttled cfs_rq,
- * take the fast path by directly put the task on target
- * cfs_rq's limbo list, except when p is current because
- * the following race can cause p's group_node left in rq's
- * cfs_tasks list when it's throttled:
+ * If the throttled task @p is enqueued to a throttled cfs_rq,
+ * take the fast path by directly putting the task on the
+ * target cfs_rq's limbo list.
+
+ * Do not do that when @p is current because the following race can
+ * cause @p's group_node to be incorectly re-insterted in its in rq's
+ * cfs_tasks list, despite being throttled:
*
* cpuX cpuY
- * taskA ret2user
- * throttle_cfs_rq_work() sched_move_task(taskA)
- * task_rq_lock acquired
- * dequeue_task_fair(taskA)
- * task_rq_lock released
- * task_rq_lock acquired
- * task_current_donor(taskA) == true
- * task_on_rq_queued(taskA) == true
- * dequeue_task(taskA)
- * put_prev_task(taskA)
- * sched_change_group()
- * enqueue_task(taskA) -> taskA's new cfs_rq
- * is throttled, go
- * fast path and skip
- * actual enqueue
- * set_next_task(taskA)
- * __set_next_task_fair(taskA)
- * list_move(&se->group_node, &rq->cfs_tasks); // bug
+ * p ret2user
+ * throttle_cfs_rq_work() sched_move_task(p)
+ * LOCK task_rq_lock
+ * dequeue_task_fair(p)
+ * UNLOCK task_rq_lock
+ * LOCK task_rq_lock
+ * task_current_donor(p) == true
+ * task_on_rq_queued(p) == true
+ * dequeue_task(p)
+ * put_prev_task(p)
+ * sched_change_group()
+ * enqueue_task(p) -> p's new cfs_rq
+ * is throttled, go
+ * fast path and skip
+ * actual enqueue
+ * set_next_task(p)
+ * list_move(&se->group_node, &rq->cfs_tasks); // bug
* schedule()
*
- * And in the above race case, the task's current cfs_rq is in the same
- * rq as its previous cfs_rq because sched_move_task() doesn't migrate
- * task so we can use its current cfs_rq to derive rq and test if the
- * task is current.
+ * In the above race case, @p current cfs_rq is in the same
+ * rq as its previous cfs_rq because sched_move_task() only moves a task
+ * to a different group from the same rq, so we can use its current
+ * cfs_rq to derive rq and test if the * task is current.
*/
if (throttled_hierarchy(cfs_rq) &&
!task_current_donor(rq_of(cfs_rq), p)) {
---8<---

> + if (throttled_hierarchy(cfs_rq) &&
> + !task_current_donor(rq_of(cfs_rq), p)) {
> + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> + return true;
> + }
> +
> + /* we can't take the fast path, do an actual enqueue*/
> + p->throttled = false;

So we clear p->throttled but not p->throttle_node? Won't that cause issues
when @p's previous cfs_rq gets unthrottled?

> + return false;
> +}
> +

> @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> */
> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> + if (unlikely(task_is_throttled(p))) {
> + dequeue_throttled_task(p, flags);
> + return true;
> + }
> +

Handling a throttled task's move pattern at dequeue does simplify things,
however that's quite a hot path. I think this wants at the very least a

if (cfs_bandwidth_used())

since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
hurt, but we can probably find some people who really care about that to
run it for us ;)

> if (!p->se.sched_delayed)
> util_est_dequeue(&rq->cfs, p);
>