Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

From: Vineeth Pillai
Date: Tue Sep 01 2020 - 17:23:09 EST


Hi Joel,

On 9/1/20 1:30 PM, Joel Fernandes wrote:
I think we can come here when hotplug thread is scheduled during online, but
mask is not yet updated. Probably can add it with this comment as well.

I don't see how that is possible. Because the cpuhp threads run during the
CPU onlining process, the boot thread for the CPU coming online would have
already updated the mask.
Sorry my mistake. I got confused with the online state ordering.

Another unrelated, but related note :-)
Besides this, I think we need to retain on more change from the previous
patch. We would need to make core_pick_seq per sibling instead of per
core. Having it per core might lead to unfairness. For eg: When a cpu
sees that its sibling's core_pick is the one which is already running, it
will not send IPI. but core_pick remains set and core->core_pick_seq is
incremented. Now if the sibling is preempted due to a high priority task
Then don't keep the core_pick set then. If you don't send it IPI and if
core_pick is already running, then NULL it already. I don't know why we add
to more corner cases by making assumptions. We have enough open issues that
are not hotplug related. Here's my suggestion :

1. Keep the ideas consistent, forget about the exact code currently written
and just understand the pick_seq is for siblings knowing that something was
picked for the whole core. So if their pick_seq != sched_seq, then they have
to pick what was selected.
I was trying to keep the ideas consistent. The requirement of core_pick
was to let the scheduled cpu know that a pick has been made. And
initial idea was to have the counter core wide. But I found this gap
that pick is not always core wide and assuming it to be core wide can
cause fairness issues. So I was proposing the idea of changing it from
core wide to per sibling. In other words, I was trying to make sure core_pick,
along with task_seq and sched_seq is trying to serve its purpose of letting
a sibling know that a new task pick has been made for it. I cannot think of
a reason, why core_pick should be core wide. I might be missing something.

2. If core_pick should be NULL, then NULL it in some path. If you keep some
core_pick and you increment pick_seq, then you are automatically asking the
sibling to pick that task up then next time it enters schedule(). See if [1]
will work?

Note that, we have added logic in this patch that does a full selection if
rq->core_pick == NULL.

I agree, setting rq->core_pick = NULL is another way to solve this issue, but
still I feel its semantically incorrect to think that a pick is core wide when it
could actually be to only a subset of siblings in the core. If there is a valid
reasoning for having core_pick to be core wide, I completely agree with the
fix of resetting core_pick.

or its time slice expired, it enters schedule. But it goes to fast path and
selects the running task there by starving the high priority task. Having
the core_pick_seq per sibling will avoid this. It might also help in some
hotplug corner cases as well.
That can be a separate patch IMHO. It has nothing to do with
stability/crashing of concurrent and rather infrequent CPU hotplug
operations.
Agree. Sorry for the confusion, my intention was to not have the logic in
this patch.

Also, Peter said pick_seq is for core-wide picking. If you want to add
another semantic, then maybe add another counter which has a separate
meaning and justify why you are adding it.
I think just one counter is enough. Unless, there is a need to keep the counter
to track core wide pick, I feel it is worth to change the design and make the
counter serve its purpose. Will think through this and send it as a separate
patch if needed.

Thanks,
Vineeth