Re: [PATCH drm-misc-next v3] drm/sched: implement dynamic job-flow control

From: Danilo Krummrich
Date: Fri Oct 27 2023 - 10:24:18 EST


On 10/27/23 10:25, Boris Brezillon wrote:
Hi Danilo,

On Thu, 26 Oct 2023 18:13:00 +0200
Danilo Krummrich <dakr@xxxxxxxxxx> wrote:

Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs which can be sent to the
hardware.

This implies that for each job, drivers need to account for the maximum
job size possible in order to not overflow the ring buffer.

However, there are drivers, such as Nouveau, where the job size has a
rather large range. For such drivers it can easily happen that job
submissions not even filling the ring by 1% can block subsequent
submissions, which, in the worst case, can lead to the ring run dry.

In order to overcome this issue, allow for tracking the actual job size
instead of the number of jobs. Therefore, add a field to track a job's
credit count, which represents the number of credits a job contributes
to the scheduler's credit limit.

Signed-off-by: Danilo Krummrich <dakr@xxxxxxxxxx>
---
Changes in V2:
==============
- fixed up influence on scheduling fairness due to consideration of a job's
size
- If we reach a ready entity in drm_sched_select_entity() but can't actually
queue a job from it due to size limitations, just give up and go to sleep
until woken up due to a pending job finishing, rather than continue to try
other entities.
- added a callback to dynamically update a job's credits (Boris)

This callback seems controversial. I'd suggest dropping it, so the
patch can be merged.

I don't think we should drop it just for the sake of moving forward. If there are objections
we'll discuss them. I've seen good reasons why the drivers you are working on require this,
while, following the discussion, I have *not* seen any reasons to drop it. It helps simplifying
driver code and doesn't introduce any complexity or overhead to existing drivers.


Regards,

Boris

- renamed 'units' to 'credits'
- fixed commit message and comments as requested by Luben

Changes in V3:
==============
- rebased onto V7 of the "DRM scheduler changes for Xe" series by Matt
- move up drm_sched_can_queue() instead of adding a forward declaration
(Boris)
- add a drm_sched_available_credits() helper (Boris)
- adjust control flow in drm_sched_rq_select_entity_fifo() to Luben's proposal
- re-phrase a few comments and fix a typo (Luben)
- change naming of all structures credit fields and function parameters to the
following scheme
- drm_sched_job::credits
- drm_gpu_scheduler::credit_limit
- drm_gpu_scheduler::credit_count
- drm_sched_init(..., u32 credit_limit, ...)
- drm_sched_job_init(..., u32 credits, ...)
- add proper documentation for the scheduler's job-flow control mechanism