Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

From: Rik van Riel
Date: Mon May 06 2024 - 14:58:25 EST

Next message: Andy Shevchenko: "Re: [PATCH 03/13] mfd: pm8008: deassert reset on probe"
Previous message: Christophe JAILLET: "Re: [PATCH v2] perf/ring_buffer: Prefer struct_size over open coded arithmetic"
In reply to: Tejun Heo: "Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class"
Next in thread: Tejun Heo: "Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 2024-05-03 at 10:52 +0200, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote:
> > Hello, Peter.
> >
> > On Thu, May 02, 2024 at 10:48:00AM +0200, Peter Zijlstra wrote:
> > > Can you please put your efforts and the touted Google
> > > collaboration in
> > > fixing the existing cgroup mess?
> >
> > I suppose you're referring to Rik's flattened hierarchy patchset.
> >
> >
> > https://lore.kernel.org/all/20190822021740.15554-1-riel@xxxxxxxxxxx
> >
>
> You guys Google/Facebook got us the cgroup thing, Google did a lot of
> the work for cpu-cgroup, and now you Facebook say you can't live with
> it
> because it's too expensive. Yes Rik did put a lot of effort into it,
> but
> Google shot it down. What am I to do?

I believe the issues that Paul pointed out with my
flattened cgroup code are fixable. I ended up not
getting back to this code because it took me a few
months to think of ways to fix the issues Paul found,
and by then I had moved on to other projects.

For reference, Paul found these two (very real) issues
with my implementation.

1) Thundering herd problem. If many tasks in a low
priority cgroup wake up at the same time, they can
end up swamping a CPU.

I believe this can be solved with the same idea
I had for reimplementing CONFIG_CFS_BANDWIDTH.
Specifically, the code that determines the time
slice length for a task already has a way to
determine whether a CPU is "overloaded", and
time slices need to be shortened. Once we reach
that situation, we can place woken up tasks on
a secondary heap of per-cgroup runqueues, from
which we do not directly run tasks, but pick
the lowest vruntime task from the lowest vruntime
cgroup and put that on the main runqueue, if
the previously running task has a vruntime that
is higher than that of a task in the secondary
group. If a task is woken up in a cgroup that
already has tasks on that secondary queue, we
wake up the task onto that secondary queue.

This means on overloaded CPUs, we move back to
a task selection mechanism closer to what we
currently have, while in the non-overloaded
situation we use a flat runqueue.

This same scheme could be used to implement
CFS bandwidth control. A task belonging to a
throttled group would be placed on the group's
queue, not the CPU's flat runqueue.

2) The vruntime for a task can be advanced by way
to much at once. If we have tasks A & B running,
and task B has a priority that is 1/100th of that
of task A, its vruntime would be advanced 100x
as much as task A, when running the same length
time slice.

This creates a big issue if we get a wakeup of
task C, at the same priority as task B, and then
task A goes to sleep. Due to the very far advanced
runtime of task B, task C could get to monopolize
the CPU for a considerable amount of time, and
task B could get starved.

A potential fix for this is to never account more
than the maximum time slice length at a time, while
any excess delta_exec time for the task gets remembered.

At pick_next_entity time, the scheduler can see that
task B has a lot of delta_exec time left, and account
up to the maximum slice length to the task's vruntime,
and place it back in the queue if the next task now has
a lower vruntime.

For a steady state of a high priority task A and a low
priority task B, this makes pick_next_task more expensive,
but when task A disappears and task C appears, CPU time
will continue to be fair between them.

Limiting the total weight of tasks on the flat runqueue,
using the mechanism for thundering herd and CFS bandwidth
outlined above, should keep this overhead bounded to
something reasonable.

Does the above sound like it would work?

Does it sound like code that you would be ok with merging?

Is it a large enough improvement over the current hierarchical
runqueue that it would be worth doing?

This would be a fairly large project, so we should probably discuss
some of the details before investing too much time in it.

--
All Rights Reversed.

Next message: Andy Shevchenko: "Re: [PATCH 03/13] mfd: pm8008: deassert reset on probe"
Previous message: Christophe JAILLET: "Re: [PATCH v2] perf/ring_buffer: Prefer struct_size over open coded arithmetic"
In reply to: Tejun Heo: "Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class"
Next in thread: Tejun Heo: "Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]