Re: [RFC 00/10] V4L2 explicit synchronization support

From: Javier Martinez Canillas
Date: Wed Apr 05 2017 - 13:13:39 EST


Hello Gustavo,

On 04/05/2017 11:09 AM, Gustavo Padovan wrote:
> 2017-04-03 Javier Martinez Canillas <javier@xxxxxxxxxxxxxxx>:
>
>> Hello Mauro and Gustavo,
>>
>> On 04/03/2017 07:16 AM, Mauro Carvalho Chehab wrote:
>>> Hi Gustavo,
>>>
>>> Em Mon, 13 Mar 2017 16:20:25 -0300
>>> Gustavo Padovan <gustavo@xxxxxxxxxxx> escreveu:
>>>
>>>> From: Gustavo Padovan <gustavo.padovan@xxxxxxxxxxxxx>
>>>>
>>>> Hi,
>>>>
>>>> This RFC adds support for Explicit Synchronization of shared buffers in V4L2.
>>>> It uses the Sync File Framework[1] as vector to communicate the fences
>>>> between kernel and userspace.
>>>
>>> Thanks for your work!
>>>
>>> I looked on your patchset, and I didn't notice anything really weird
>>> there. So, instead on reviewing patch per patch, I prefer to discuss
>>> about the requirements and API, as depending on it, the code base will
>>> change a lot.
>>>
>>
>> Agree that's better to first set on an uAPI and then implement based on that.
>>
>>> I'd like to do some tests with it on devices with mem2mem drivers.
>>> My plan is to use an Exynos board for such thing, but I guess that
>>> the DRM driver for it currently doesn't. I'm seeing internally if someone
>>> could be sure that Exynos driver upstream will become ready for such
>>> tests.
>>>
>>
>> Not sure if you should try to do testing before agreeing on an uAPI and
>> implementation.
>>
>>> Javier wrote some patches last year meant to implement implicit
>>> fences support. What we noticed is that, while his mechanism worked
>>> fine for pure capture and pure output devices, when we added a mem2mem
>>> device, on a DMABUF+fences pipeline, e. g.:
>>>
>>> sensor -> [m2m] -> DRM
>>>
>>> End everything using fences/DMABUF, the fences mechanism caused dead
>>> locks on existing userspace apps.
>>>
>>> A m2m device has both capture and output devnodes. Both should be
>>> queued/dequeued. The capture queue is synchronized internally at the
>>> driver with the output buffer[1].
>>>
>>> [1] The names here are counter-intuitive: "capture" is a devnode
>>> where userspace receives a video stream; "output" is a devnode where
>>> userspace feeds a video stream.
>>>
>>> The problem is that adding implicit fences changed the behavior of
>>> the ioctls, causing gstreamer to wait forever for buffers to be ready.
>>>
>>
>> The problem was related to trying to make user-space unaware of the implicit
>> fences support, and so it tried to QBUF a buffer that had already a pending
>> fence. A workaround was to block the second QBUF ioctl if the buffer had a
>> pending fence, but this caused the mentioned deadlock since GStreamer wasn't
>> expecting the QBUF ioctl to block.
>>
>>> I suspect that, even with explicit fences, the behavior of Q/DQ
>>> will be incompatible with the current behavior (or will require some
>>> dirty hacks to make it identical).
>
> For QBUF the only difference is that we set flags for fences and pass
> and receives in and out fences. For DQBUF the behavior is exactly the
> same. What incompatibles or hacks do you see?
>
> I had the expectation that the flags would be for userspace to learn
> about any different behavior.
>
>>>
>>> So, IMHO, the best would be to use a new set of ioctls, when fences are
>>> used (like VIDIOC_QFENCE/DQFENCE).
>>>
>>
>> For explicit you can check if there's an input-fence so is different than
>> implicit, but still I agree that it would be better to have specific ioctls.
>
> I'm pretty new to v4l2 so I don't know all use cases yet, but what I
> thought was to just add extra flags to QBUF to mark when using fences
> instead of having userspace to setup completely new ioctls for fences.
> The burden for userspace should be smaller with flags.
>

Yes, you are right. So I guess its better indeed to just extend the current
ioctls as you propose and only move to new ones if really needed.

>>
>>>>
>>>> I'm sending this to start the discussion on the best approach to implement
>>>> Explicit Synchronization, please check the TODO/OPEN section below.
>>>>
>>>> Explicit Synchronization allows us to control the synchronization of
>>>> shared buffers from userspace by passing fences to the kernel and/or
>>>> receiving them from the the kernel.
>>>>
>>>> Fences passed to the kernel are named in-fences and the kernel should wait
>>>> them to signal before using the buffer. On the other side, the kernel creates
>>>> out-fences for every buffer it receives from userspace. This fence is sent back
>>>> to userspace and it will signal when the capture, for example, has finished.
>>>>
>>>> Signalling an out-fence in V4L2 would mean that the job on the buffer is done
>>>> and the buffer can be used by other drivers.
>>>>
>>>> Current RFC implementation
>>>> --------------------------
>>>>
>>>> The current implementation is not intended to be more than a PoC to start
>>>> the discussion on how Explicit Synchronization should be supported in V4L2.
>>>>
>>>> The first patch proposes an userspace API for fences, then on patch 2
>>>> we prepare to the addition of in-fences in patch 3, by introducing the
>>>> infrastructure on vb2 to wait on an in-fence signal before queueing the buffer
>>>> in the driver.
>>>>
>>>> Patch 4 fix uvc v4l2 event handling and patch 5 configure q->dev for vivid
>>>> drivers to enable to subscribe and dequeue events on it.
>>>>
>>>> Patches 6-7 enables support to notify BUF_QUEUED events, i.e., let userspace
>>>> know that particular buffer was enqueued in the driver. This is needed,
>>>> because we return the out-fence fd as an out argument in QBUF, but at the time
>>>> it returns we don't know to which buffer the fence will be attached thus
>>>> the BUF_QUEUED event tells which buffer is associated to the fence received in
>>>> QBUF by userspace.
>>>>
>>>> Patches 8 and 9 add more fence infrastructure to support out-fences and finally
>>>> patch 10 adds support to out-fences.
>>>>
>>>> TODO/OPEN:
>>>> ----------
>>>>
>>>> * For this first implementation we will keep the ordering of the buffers queued
>>>> in videobuf2, that means we will only enqueue buffer whose fence was signalled
>>>> if that buffer is the first one in the queue. Otherwise it has to wait until it
>>>> is the first one. This is not implmented yet. Later we could create a flag to
>>>> allow unordered queing in the drivers from vb2 if needed.
>>>
>>> The V4L2 spec doesn't warrant that the buffers will be dequeued at the
>>> queue order.
>>>
>>> In practice, however, most drivers will not reorder. Yet, mem2mem codec
>>> drivers may reorder the buffers at the output, as the luminance information
>>> (Y) usually comes first on JPEG/MPEG-like formats.
>>>
>>>> * Should we have out-fences per-buffer or per-plane? or both? In this RFC, for
>>>> simplicity they are per-buffer, but Mauro and Javier raised the option of
>>>> doing per-plane fences. That could benefit mem2mem and V4L2 <-> GPU operation
>>>> at least on cases when we have Capture hw that releases the Y frame before the
>>>> other frames for example. When using V4L2 per-plane out-fences to communicate
>>>> with KMS they would need to be merged together as currently the DRM Plane
>>>> interface only supports one fence per DRM Plane.
>>>
>>> That's another advantage of using a new set of ioctls for queues: with that,
>>> queuing/dequeing per plane will be easier. On codec drivers, doing it per
>>> plane could bring performance improvements.
>>>
>>
>> You don't really need to Q/DQ on a per plane basis AFAICT. Since on QBUF you
>> can get a set of out-fences that can be passed to the other driver and so it
>> should be able to wait per fence.
>>
>>>> In-fences should be per-buffer as the DRM only has per-buffer fences, but
>>
>> I'm not that familiar with DRM, but I thought DRM fences was also per plane
>> and not per buffer.
>
> DRM plane is a different thing, its a representation of a region on the
> screen and there is only one buffer for each DRM plane.
>

Yes, but you can still have different DRM planes, right? For example a RGB
primary plane and a YUV overlay plane for video display. That's why I thought
there would be some mapping between the v4l2 planes and DRM planes since there's
a way for DRM to import dma-bufs exported by v4l2 and vice-versa.

> One of the questions I raised was: how to match V4L2 per-plane fences to
> DRM per-buffer fences?
>

I don't think that DRM fences should be per-buffer but instead per dma-buf.
It's OK if there's a 1:1 mapping between DRM buffer and dma-buf, but that's
not the case for v4l2. You can have many dma-bufs in a v4l2_buffer, one per
each v4l2 plane.

IOW, it shouldn't be about DRM buffers waiting for fences associated with
v4l2 buffers, but instead about DRM waiting for fences associated with
dma-bufs that were exported by v4l2 and imported by DRM.

>>
>> How this works without fences? For V4L2 there's a dma-buf fd per plane and
>> so I was expecting the DRM API to also import a dma-buf fd per DRM plane.
>
> Yes. It should do something similar behind the Framebuffer abstraction.
>
>>
>> I only have access to an Exynos board whose display controller supports
>> single plane formats, so I don't know how this works for multi planar.
>>
>>>> in case of mem2mem operations per-plane fences might be useful?
>>>>
>>>> So should we have both ways, per-plane and per-buffer, or just one of them
>>>> for now?
>>>
>>> The API should be flexible enough to support both usecases. We could
>>> implement just per-buffer in the beginning, but, on such case, we
>>> should deploy an API that will allow to later add per-plane fences
>>> without breaking userspace.
>
> I believe we can just extend the per-plane parts of QBUF for their
> fences. We could even use plane[0] for the per-buffer case.
>

Agreed.

>>>
>>> So, I prefer that, on multiplane fences, we have one fence per plane,
>>> even if, at the first implementation, all fences will be released
>>> at the same time, when the buffer is fully filled. That would allow
>>> us to later improve it, without changing userspace.
>>
>> It's true that vb2 can't currently signal fences per plane since the interface
>> between vb2 and drivers is per vb2_buffer. But the uAPI shouldn't be restricted
>> by this implementation detail (that can be changed) and should support per plane
>> fences IMHO.
>>
>> That's for example the case with the V4L2 dma-buf API. There is a dma-buf fd per
>> plane, and internally for vb2 single planar buffers use the dma-buf associated
>> with plane 0.
>>
>> Now when mentioning this, I noticed that in your implementation the fences are
>> not associated with a dma-buf. I thought the idea was for the fences to be
>> associated with a dma-buf's reservation object. If we do that, then fences will
>> be per fence since the dma-buf/reservation objet are also per fence in v4l2/vb2.
>

Argh, I noticed that I made a lot of typos here. What I meant was:

"then fences will be per v4l2 plane since the dma-buf/reservation object are also
per v4l2 plane in v4l2/vb2."

So what I tried to say is that each v4l2 plane has an associated dma-buf, and so
if fences are per plane, it will also be per dma-buf. I guess at the DRM side is
similar, you have dma-buf associated with DRM planes that fences should protect.

> Can you explain what you were thinking on the relation between fences
> and reservation objects? Not sure I follow.
>

As mentioned above, I believe that fences should be associated with dma-bufs.
And a dma-buf already has a reservation object that can be used to associate
fences to a dma-buf. You can add shared or exclusive fences and that be looked
up by the other driver.

That's what implicit fences uses and I believe explicit fences should also use
this infrastructure and export these fences with sync_file_create() as you do.

> Gustavo
>

Best regards,
--
Javier Martinez Canillas
Open Source Group
Samsung Research America