[PATCH] sched/core: Hotplug fixes to pick_next_task()

From: Joel Fernandes (Google)
Date: Tue Sep 01 2020 - 00:56:36 EST


The follow 3 cases need to be handled to avoid crashes in pick_next_task() when
CPUs in a core are going offline or coming online.

1. The stopper task is switching into idle when it is brought down by CPU
hotplug. It is not in the cpu_smt_mask so nothing need be selected for it.
Further, the current code ends up not selecting anything for it, not even idle.
This ends up causing crashes in set_next_task(). Just do the __pick_next_task()
selection which will select the idle task. No need to do core-wide selection as
other siblings will handle it for themselves when they call schedule.

2. The rq->core_pick for a sibling in a core can be NULL if no selection was
made for it because it was either offline or went offline during a sibling's
core-wide selection. In this case, do a core-wide selection. In this case, we
have to completely ignore the checks:
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
rq->core->core_pick_seq != rq->core_sched_seq)

Otherwise, it would again end up crashing like #1.

3. The 'Rescheduling siblings' loop of pick_next_task() is quite fragile. It
calls various functions on rq->core_pick which could very well be NULL because:
An online sibling might have gone offline before a task could be picked for it,
or it might be offline but later happen to come online, but its too late and
nothing was picked for it. Just ignore the siblings for which nothing could be
picked. This avoids any crashes that may occur in this loop that assume
rq->core_pick is not NULL.

Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
---
kernel/sched/core.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 717122a3dca1..4966e9f14f39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4610,13 +4610,24 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
if (!sched_core_enabled(rq))
return __pick_next_task(rq, prev, rf);

+ cpu = cpu_of(rq);
+
+ /* Stopper task is switching into idle, no need core-wide selection. */
+ if (cpu_is_offline(cpu))
+ return __pick_next_task(rq, prev, rf);
+
/*
* If there were no {en,de}queues since we picked (IOW, the task
* pointers are all still valid), and we haven't scheduled the last
* pick yet, do so now.
+ *
+ * rq->core_pick can be NULL if no selection was made for a CPU because
+ * it was either offline or went offline during a sibling's core-wide
+ * selection. In this case, do a core-wide selection.
*/
if (rq->core->core_pick_seq == rq->core->core_task_seq &&
- rq->core->core_pick_seq != rq->core_sched_seq) {
+ rq->core->core_pick_seq != rq->core_sched_seq &&
+ !rq->core_pick) {
WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);

next = rq->core_pick;
@@ -4629,7 +4640,6 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

put_prev_task_balance(rq, prev, rf);

- cpu = cpu_of(rq);
smt_mask = cpu_smt_mask(cpu);

/*
@@ -4761,7 +4771,15 @@ next_class:;
for_each_cpu(i, smt_mask) {
struct rq *rq_i = cpu_rq(i);

- WARN_ON_ONCE(!rq_i->core_pick);
+ /*
+ * An online sibling might have gone offline before a task
+ * could be picked for it, or it might be offline but later
+ * happen to come online, but its too late and nothing was
+ * picked for it. That's Ok - it will pick tasks for itself,
+ * so ignore it.
+ */
+ if (!rq_i->core_pick)
+ continue;

if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
rq_i->core_forceidle = true;
--
2.28.0.402.g5ffc5be6b7-goog