Re: [RFC v3 00/12] DRM scheduling cgroup controller

From: Tvrtko Ursulin
Date: Fri Jan 27 2023 - 06:43:48 EST



On 27/01/2023 10:04, Michal Koutný wrote:
On Thu, Jan 26, 2023 at 05:57:24PM +0000, Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> wrote:
So even if the RFC shows just a simple i915 implementation, the controller
itself shouldn't prevent a smarter approach (via exposed ABI).

scan/query + over budget notification is IMO limited in guarantees.

It is yes, I tried to stress out that it is not a hard guarantee in any shape and form and that the "quality" of adhering to the allocated budget will depend on individual hw and sw capabilities.

But it is what I believe is the best approach given a) how different in scheduling capability GPU drivers are and b) the very fact there isn't a central scheduling entity as opposed to the CPU side of things.

It is just no possible to do a hard guarantee system. GPUs do not preempt as nicely and easily as the CPUs do. And the frequency of context switches varies widely from too fast to "never", so again, charging would have several problems to overcome which would make the whole setup IMHO pointless.

And not only that some GPUs do not preempt nicely, but some even can't do any of this, period. Even if we stay within the lineage of hardware supported by only i915, we have three distinct categories: 1) can't do any of this, 2a) can do fine grained priority based scheduling with reasonable preemption capability, 2b) ditto but without reasonable preemption capability, and 3) like 2a) and 2b) but with the scheduler in the firmware and currently supporting coarse priority based scheduling.

Shall I also mention that a single cgroup can contain multiple GPU clients, all using different GPUs with a different mix of the above listed challenges?

The main point is, should someone prove me wrong and come up a smarter way at some point in the future, then "drm.weight" as an ABI remains compatible and the improvement can happen completely under the hood. In the mean time users get external control, and _some_ ability to improve the user experience with the scenarios such as I described yesterday.

[...]
Yes agreed, and to re-stress out, the ABI as proposed does not preclude
changing from scanning to charging or whatever. The idea was for it to be
compatible in concept with the CPU controller and also avoid baking in the
controlling method to individual drivers.
[...]

But I submit to your point of rather not exposing this via cgroup API
for possible future refinements.

Ack.

Secondly, doing this in userspace would require the ability to get some sort
of an atomic snapshot of the whole tree hierarchy to account for changes in
layout of the tree and task migrations. Or some retry logic with some added
ABI fields to enable it.

Note, that the proposed implementation is succeptible to miscount due to
concurrent tree modifications and task migrations too (scanning may not
converge under frequent cgroup layout modifications, and migrating tasks
may be summed 0 or >1 times). While in-kernel implementation may assure
the snapshot view, it'd come at cost. (Read: since the mechanism isn't
precise anyway, I don't suggest a fully synchronized scanning.)

The part that scanning may not converge in my _current implementation_ is true. For instance if clients would be constantly coming and going, for that I took a shortcut of not bothering to accumulate usage on process/client exit, and I just wait for a stable two periods to look at the current state. I reckon this is possibly okay for the real world.

Cgroup tree hierarchy modifications being the reason for not converging can also happen, but I thought I can hand wave that as not a realistic scenario. Perhaps I am not imaginative enough?

Under or over-accounting for migrating tasks I don't think can happen since I am explicitly handling that.

Regards,

Tvrtko