Re: [PATCH v10 01/15] dma-buf/dma-fence: Add deadline awareness

From: Rob Clark
Date: Tue Mar 21 2023 - 10:35:01 EST


On Tue, Mar 21, 2023 at 6:24 AM Jonas Ådahl <jadahl@xxxxxxxxx> wrote:
>
> On Fri, Mar 17, 2023 at 08:59:48AM -0700, Rob Clark wrote:
> > On Fri, Mar 17, 2023 at 3:23 AM Jonas Ådahl <jadahl@xxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 16, 2023 at 09:28:55AM -0700, Rob Clark wrote:
> > > > On Thu, Mar 16, 2023 at 2:26 AM Jonas Ådahl <jadahl@xxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Mar 15, 2023 at 09:19:49AM -0700, Rob Clark wrote:
> > > > > > On Wed, Mar 15, 2023 at 6:53 AM Jonas Ådahl <jadahl@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 10, 2023 at 09:38:18AM -0800, Rob Clark wrote:
> > > > > > > > On Fri, Mar 10, 2023 at 7:45 AM Jonas Ådahl <jadahl@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Mar 08, 2023 at 07:52:52AM -0800, Rob Clark wrote:
> > > > > > > > > > From: Rob Clark <robdclark@xxxxxxxxxxxx>
> > > > > > > > > >
> > > > > > > > > > Add a way to hint to the fence signaler of an upcoming deadline, such as
> > > > > > > > > > vblank, which the fence waiter would prefer not to miss. This is to aid
> > > > > > > > > > the fence signaler in making power management decisions, like boosting
> > > > > > > > > > frequency as the deadline approaches and awareness of missing deadlines
> > > > > > > > > > so that can be factored in to the frequency scaling.
> > > > > > > > > >
> > > > > > > > > > v2: Drop dma_fence::deadline and related logic to filter duplicate
> > > > > > > > > > deadlines, to avoid increasing dma_fence size. The fence-context
> > > > > > > > > > implementation will need similar logic to track deadlines of all
> > > > > > > > > > the fences on the same timeline. [ckoenig]
> > > > > > > > > > v3: Clarify locking wrt. set_deadline callback
> > > > > > > > > > v4: Clarify in docs comment that this is a hint
> > > > > > > > > > v5: Drop DMA_FENCE_FLAG_HAS_DEADLINE_BIT.
> > > > > > > > > > v6: More docs
> > > > > > > > > > v7: Fix typo, clarify past deadlines
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Rob Clark <robdclark@xxxxxxxxxxxx>
> > > > > > > > > > Reviewed-by: Christian König <christian.koenig@xxxxxxx>
> > > > > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@xxxxxxxxxxxxx>
> > > > > > > > > > Reviewed-by: Bagas Sanjaya <bagasdotme@xxxxxxxxx>
> > > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > Hi Rob!
> > > > > > > > >
> > > > > > > > > > Documentation/driver-api/dma-buf.rst | 6 +++
> > > > > > > > > > drivers/dma-buf/dma-fence.c | 59 ++++++++++++++++++++++++++++
> > > > > > > > > > include/linux/dma-fence.h | 22 +++++++++++
> > > > > > > > > > 3 files changed, 87 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
> > > > > > > > > > index 622b8156d212..183e480d8cea 100644
> > > > > > > > > > --- a/Documentation/driver-api/dma-buf.rst
> > > > > > > > > > +++ b/Documentation/driver-api/dma-buf.rst
> > > > > > > > > > @@ -164,6 +164,12 @@ DMA Fence Signalling Annotations
> > > > > > > > > > .. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > > > > > > > > :doc: fence signalling annotation
> > > > > > > > > >
> > > > > > > > > > +DMA Fence Deadline Hints
> > > > > > > > > > +~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > > > > > +
> > > > > > > > > > +.. kernel-doc:: drivers/dma-buf/dma-fence.c
> > > > > > > > > > + :doc: deadline hints
> > > > > > > > > > +
> > > > > > > > > > DMA Fences Functions Reference
> > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > > > > > > > > > index 0de0482cd36e..f177c56269bb 100644
> > > > > > > > > > --- a/drivers/dma-buf/dma-fence.c
> > > > > > > > > > +++ b/drivers/dma-buf/dma-fence.c
> > > > > > > > > > @@ -912,6 +912,65 @@ dma_fence_wait_any_timeout(struct dma_fence **fences, uint32_t count,
> > > > > > > > > > }
> > > > > > > > > > EXPORT_SYMBOL(dma_fence_wait_any_timeout);
> > > > > > > > > >
> > > > > > > > > > +/**
> > > > > > > > > > + * DOC: deadline hints
> > > > > > > > > > + *
> > > > > > > > > > + * In an ideal world, it would be possible to pipeline a workload sufficiently
> > > > > > > > > > + * that a utilization based device frequency governor could arrive at a minimum
> > > > > > > > > > + * frequency that meets the requirements of the use-case, in order to minimize
> > > > > > > > > > + * power consumption. But in the real world there are many workloads which
> > > > > > > > > > + * defy this ideal. For example, but not limited to:
> > > > > > > > > > + *
> > > > > > > > > > + * * Workloads that ping-pong between device and CPU, with alternating periods
> > > > > > > > > > + * of CPU waiting for device, and device waiting on CPU. This can result in
> > > > > > > > > > + * devfreq and cpufreq seeing idle time in their respective domains and in
> > > > > > > > > > + * result reduce frequency.
> > > > > > > > > > + *
> > > > > > > > > > + * * Workloads that interact with a periodic time based deadline, such as double
> > > > > > > > > > + * buffered GPU rendering vs vblank sync'd page flipping. In this scenario,
> > > > > > > > > > + * missing a vblank deadline results in an *increase* in idle time on the GPU
> > > > > > > > > > + * (since it has to wait an additional vblank period), sending a signal to
> > > > > > > > > > + * the GPU's devfreq to reduce frequency, when in fact the opposite is what is
> > > > > > > > > > + * needed.
> > > > > > > > >
> > > > > > > > > This is the use case I'd like to get some better understanding about how
> > > > > > > > > this series intends to work, as the problematic scheduling behavior
> > > > > > > > > triggered by missed deadlines has plagued compositing display servers
> > > > > > > > > for a long time.
> > > > > > > > >
> > > > > > > > > I apologize, I'm not a GPU driver developer, nor an OpenGL driver
> > > > > > > > > developer, so I will need some hand holding when it comes to
> > > > > > > > > understanding exactly what piece of software is responsible for
> > > > > > > > > communicating what piece of information.
> > > > > > > > >
> > > > > > > > > > + *
> > > > > > > > > > + * To this end, deadline hint(s) can be set on a &dma_fence via &dma_fence_set_deadline.
> > > > > > > > > > + * The deadline hint provides a way for the waiting driver, or userspace, to
> > > > > > > > > > + * convey an appropriate sense of urgency to the signaling driver.
> > > > > > > > > > + *
> > > > > > > > > > + * A deadline hint is given in absolute ktime (CLOCK_MONOTONIC for userspace
> > > > > > > > > > + * facing APIs). The time could either be some point in the future (such as
> > > > > > > > > > + * the vblank based deadline for page-flipping, or the start of a compositor's
> > > > > > > > > > + * composition cycle), or the current time to indicate an immediate deadline
> > > > > > > > > > + * hint (Ie. forward progress cannot be made until this fence is signaled).
> > > > > > > > >
> > > > > > > > > Is it guaranteed that a GPU driver will use the actual start of the
> > > > > > > > > vblank as the effective deadline? I have some memories of seing
> > > > > > > > > something about vblank evasion browsing driver code, which I might have
> > > > > > > > > misunderstood, but I have yet to find whether this is something
> > > > > > > > > userspace can actually expect to be something it can rely on.
> > > > > > > >
> > > > > > > > I guess you mean s/GPU driver/display driver/ ? It makes things more
> > > > > > > > clear if we talk about them separately even if they happen to be the
> > > > > > > > same device.
> > > > > > >
> > > > > > > Sure, sorry about being unclear about that.
> > > > > > >
> > > > > > > >
> > > > > > > > Assuming that is what you mean, nothing strongly defines what the
> > > > > > > > deadline is. In practice there is probably some buffering in the
> > > > > > > > display controller. For ex, block based (including bandwidth
> > > > > > > > compressed) formats, you need to buffer up a row of blocks to
> > > > > > > > efficiently linearize for scanout. So you probably need to latch some
> > > > > > > > time before you start sending pixel data to the display. But details
> > > > > > > > like this are heavily implementation dependent. I think the most
> > > > > > > > reasonable thing to target is start of vblank.
> > > > > > >
> > > > > > > The driver exposing those details would be quite useful for userspace
> > > > > > > though, so that it can delay committing updates to late, but not too
> > > > > > > late. Setting a deadline to be the vblank seems easy enough, but it
> > > > > > > isn't enough for scheduling the actual commit.
> > > > > >
> > > > > > I'm not entirely sure how that would even work.. but OTOH I think you
> > > > > > are talking about something on the order of 100us? But that is a bit
> > > > > > of another topic.
> > > > >
> > > > > Yes, something like that. But yea, it's not really related. Scheduling
> > > > > commits closer to the deadline has more complex behavior than that too,
> > > > > e.g. the need for real time scheduling, and knowing how long it usually
> > > > > takes to create and commit and for the kernel to process.
> > > > >
> > > > > >
> > > > >
> > > > > 8-< *snip* 8-<
> > > > >
> > > > > > > >
> > > > > > > > You need a fence to set the deadline, and for that work needs to be
> > > > > > > > flushed. But you can't associate a deadline with work that the kernel
> > > > > > > > is unaware of anyways.
> > > > > > >
> > > > > > > That makes sense, but it might also a bit inadequate to have it as the
> > > > > > > only way to tell the kernel it should speed things up. Even with the
> > > > > > > trick i915 does, with GNOME Shell, we still end up with the feedback
> > > > > > > loop this series aims to mitigate. Doing triple buffering, i.e. delaying
> > > > > > > or dropping the first frame is so far the best work around that works,
> > > > > > > except doing other tricks that makes the kernel to ramp up its clock.
> > > > > > > Having to rely on choosing between latency and frame drops should
> > > > > > > ideally not have to be made.
> > > > > >
> > > > > > Before you have a fence, the thing you want to be speeding up is the
> > > > > > CPU, not the GPU. There are existing mechanisms for that.
> > > > >
> > > > > Is there no benefit to let the GPU know earlier that it should speed up,
> > > > > so that when the job queue arrives, it's already up to speed?
> > > >
> > > > Downstream we have input notifier that resumes the GPU so we can
> > > > pipeline the 1-2ms it takes to boot up the GPU with userspace. But we
> > > > wait to boost freq until we have cmdstream to submit, since that
> > > > doesn't take as long. What needs help initially after input is all
> > > > the stuff that happens on the CPU before the GPU can start to do
> > > > anything ;-)
> > >
> > > How do you deal with boosting CPU speeds downstream? Does the input
> > > notifier do that too?
> >
> > Yes.. actually currently downstream (depending on device) we have 1 to
> > 3 input notifiers, one for CPU boost, one for early-PSR-exit, and one
> > to get a head start on booting up the GPU.
>
> Would be really nice to upstream these, one way or the other, be it
> actually input event based, or via some uapi to just poke the kernel. I
> realize it's not related to this thread, so this is just me wishing
> things into the void.

There was a drm/input_helper proposed maybe a year or so back, mainly
for the early-PSR-exit but I was planning to build on that for early
GPU wake-up. I guess we should revisit it. Might not be the right
place for cpu boost, but it solves some problems so it's a start.

As far as uapi, I think sysfs already gives you everything or at least
most everything you need. For ex,
/sys/devices/system/cpu/cpufreq/policy*/scaling_min_freq .. on the gpu
side, for drivers using devfreq (ie. panfrost/msm/etc) there is
similar sysfs. I'm not sure what sort of knobs are avail on
intel/amd.

BR,
-R

> >
> > > >
> > > > Btw, I guess I haven't made this clear, dma-fence deadline is trying
> > > > to help the steady-state situation, rather than the input-latency
> > > > situation. It might take a frame or two of missed deadlines for
> > > > gpufreq to arrive at a good steady-state freq.
> > >
> > > I'm just not sure it will help. Missed deadlines set at commit hasn't
> > > been enough in the past to let the kernel understand it should speed
> > > things up before the next frame (which will be a whole frame late
> > > without any triple buffering which should be a last resort), so I don't
> > > see how it will help by adding a userspace hook to do the same thing.
> >
> > So deadline is just a superset of "right now" and "sometime in the
> > future".. and this has been useful enough for i915 that they have both
> > forms, when waiting on GPU via i915 specific ioctls and when pageflip
> > (assuming userspace isn't deferring composition decision and instead
> > just pushing it all down to the kernel). But this breaks down in a
> > few cases:
> >
> > 1) non pageflip (for ex. ping-ponging between cpu and gpu) use cases
> > when you wait via polling on fence fd or wait via drm_syncobj instead
> > of DRM_IOCTL_I915_GEM_WAIT
> > 2) when userspace decides late in frame to not pageflip because app
> > fence isn't signaled yet
>
> It breaks down in practice today, because we do entering the low-freq
> feedback loop that triple buffering today effectively works around.
> That is even with non-delayed page flipping, and a single pipeline
> source (compositor only rendering) or only using already signaled ready
> client buffers when compositing.
>
> Anyway, I don't doubt its usefulness, just a bit pessimistic.
>
> >
> > And this is all done in a way that doesn't help for situations where
> > you have separate kms and render devices. Or the kms driver doesn't
> > bypass atomic helpers (ie. uses drm_atomic_helper_wait_for_fences()).
> > So the technique has already proven to be useful. This series just
> > extends it beyond driver specific primitives (ie.
> > dma_fence/drm_syncojb)
> >
> > > I think input latency and steady state target frequency here is tightly
> > > linked; what we should aim for is to provide enough information at the
> > > right time so that it does *not* take a frame or two to of missed
> > > deadlines to arrive at the target frequency, as those missed deadlines
> > > either means either stuttering and/or lag.
> >
> > If you have some magic way for a gl/vk driver to accurately predict
> > how many cycles it will take to execute a sequence of draws, I'm all
> > ears.
> >
> > Realistically, the best solution on sudden input is to overshoot and
> > let freqs settle back down.
> >
> > But there is a lot more to input latency than GPU freq. In UI
> > workloads, even fullscreen animation, I don't really see the GPU going
> > above the 2nd lowest OPP even on relatively small things like a618.
> > UI input latency (touch scrolling, on-screen stylus / low-latency-ink,
> > animations) are a separate issue from what this series addresses, and
> > aren't too much to do with GPU freq.
> >
> > > That it helps with the deliberately late commit I do understand, but we
> > > don't do that yet, but intend to when there is kernel uapi to lets us do
> > > so without negative consequences.
> > >
> > > >
> > > > > >
> > > > > > TBF I'm of the belief that there is still a need for input based cpu
> > > > > > boost (and early wake-up trigger for GPU).. we have something like
> > > > > > this in CrOS kernel. That is a bit of a different topic, but my point
> > > > > > is that fence deadlines are just one of several things we need to
> > > > > > optimize power/perf and responsiveness, rather than the single thing
> > > > > > that solves every problem under the sun ;-)
> > > > >
> > > > > Perhaps; but I believe it's a bit of a back channel of intent; the piece
> > > > > of the puzzle that has the information to know whether there is need
> > > > > actually speed up is the compositor, not the kernel.
> > > > >
> > > > > For example, pressing 'p' while a terminal is focused does not need high
> > > > > frequency clocks, it just needs the terminal emulator to draw a 'p' and
> > > > > the compositor to composite that update. Pressing <Super> may however
> > > > > trigger a non-trivial animation moving a lot of stuff around on screen,
> > > > > maybe triggering Wayland clients to draw and what not, and should most
> > > > > arguably have the ability to "warn" the kernel about the upcoming flood
> > > > > of work before it is already knocking on its door step.
> > > >
> > > > The super key is problematic, but not for the reason you think. It is
> > > > because it is a case where we should boost on key-up instead of
> > > > key-down.. and the second key-up event comes after the cpu-boost is
> > > > already in it's cool-down period. But even if suboptimal in cases
> > > > like this, it is still useful for touch/stylus cases where the
> > > > slightest of lag is much more perceptible.
> > >
> > > Other keys are even more problematic. Alt, for example, does nothing,
> > > Alt + Tab does some light rendering, but Alt + KeyAboveTab will,
> > > depending on the current active applications, suddenly trigger N Wayland
> > > surfaces to start rendering at the same time.
> > >
> > > >
> > > > This is getting off topic but I kinda favor coming up with some sort
> > > > of static definition that userspace could give the kernel to let the
> > > > kernel know what input to boost on. Or maybe something could be done
> > > > with BPF?
> > >
> > > I have hard time seeing any static information can be enough, it's
> > > depends too much on context what is expected to happen. And can a BPF
> > > program really help? Unless BPF programs that pulls some internal kernel
> > > strings to speed things up whenever userspace wants I don't see how it
> > > is that much better.
> > >
> > > I don't think userspace is necessarily too slow to actively particitpate
> > > in providing direct scheduling hints either. Input processing can, for
> > > example, be off loaded to a real time scheduled thread, and plumbing any
> > > hints about future expectations from rendering, windowing and layout
> > > subsystems will be significantly easier to plumb to a real time input
> > > thread than translated into static informations or BPF programs.
> >
> > I mean, the kernel side input handler is called from irq context long
> > before even the scheduler gets involved..
> >
> > But I think you are over-thinking the Alt + SomeOtherKey case. The
> > important thing isn't what the other key is, it is just to know that
> > Alt is a modifier key (ie. handle it on key-up instead of key-down).
> > No need to over-complicate things. It's probably enough to give the
> > kernel a list of modifier+key combo's that do _something_..
>
> Perhaps I'm over thinking it, it just seems all so unnecessary to
> complicate the kernel so that it's able to predict when GUI animations
> will happen instead of the GUI itself doing it when it is actually
> beneficial. All it'd take (naively) is uapi for the three kind of boosts
> downstream now does automatically from input events.
>
> >
> > And like I've said before, keyboard input is the least problematic in
> > terms of latency. It is a _lot_ easier to notice lag with touch
> > scrolling or stylus (on screen). (The latter case, I think wayland
> > has some catching up to do compared to CrOS or android.. you really
> > need a way to allow the app to do front buffer rendering to an overlay
> > for the stylus case, because even just 16ms delay is _very_
> > noticeable.)
>
> Sure, but here too userpsace (rt thread in the compositor) is probably a
> good enough place to predict when to boost since it will be the one
> proxies e.g. the stylus input events to the application.
>
> Front buffering on the other hand is a very different topic ;)
>
>
> Jonas
>
> >
> > BR,
> > -R