Re: [RFC PATCH v1] media: uvcvideo: Cache URB header data before processing

From: Tomasz Figa
Date: Wed Aug 08 2018 - 00:09:23 EST


On Tue, Jul 31, 2018 at 1:00 AM Laurent Pinchart
<laurent.pinchart@xxxxxxxxxxxxxxxx> wrote:
>
> Hi Keiichi,
>
> (CC'ing Alan, Ezequiel and Matwey)
>
> Thank you for the patch.
>
> On Wednesday, 27 June 2018 13:34:08 EEST Keiichi Watanabe wrote:
> > On some platforms with non-coherent DMA (e.g. ARM), USB drivers use
> > uncached memory allocation methods. In such situations, it sometimes
> > takes a long time to access URB buffers. This can be a cause of video
> > flickering problems if a resolution is high and a USB controller has
> > a very tight time limit. (e.g. dwc2) To avoid this problem, we copy
> > header data from (uncached) URB buffer into (cached) local buffer.
> >
> > This change should make the elapsed time of the interrupt handler
> > shorter on platforms with non-coherent DMA. We measured the elapsed
> > time of each callback of uvc_video_complete without/with this patch
> > while capturing Full HD video in
> > https://webrtc.github.io/samples/src/content/getusermedia/resolution/.
> > I tested it on the top of Kieran Bingham's Asynchronous UVC series
> > https://www.mail-archive.com/linux-media@xxxxxxxxxxxxxxx/msg128359.html.
> > The test device was Jerry Chromebook (RK3288) with Logitech Brio 4K.
> > I collected data for 5 seconds. (There were around 480 callbacks in
> > this case.) The following result shows that this patch makes
> > uvc_video_complete about 2x faster.
> >
> > | average | median | min | max | standard deviation
> >
> > w/o caching| 45319ns | 40250ns | 33834ns | 142625ns| 16611ns
> > w/ caching| 20620ns | 19250ns | 12250ns | 56583ns | 6285ns
> >
> > In addition, we confirmed that this patch doesn't make it worse on
> > coherent DMA architecture by performing the same measurements on a
> > Broadwell Chromebox with the same camera.
> >
> > | average | median | min | max | standard deviation
> >
> > w/o caching| 21026ns | 21424ns | 12263ns | 23956ns | 1932ns
> > w/ caching| 20728ns | 20398ns | 8922ns | 45120ns | 3368ns
>
> This is very interesting, and it seems related to https://
> patchwork.kernel.org/patch/10468937/. You might have seen that discussion as
> you got CC'ed at some point.
>
> I wonder whether performances couldn't be further improved by allocating the
> URB buffers cached, as that would speed up the memcpy() as well. Have you
> tested that by any chance ?

We haven't measure it, but the issue being solved here was indeed
significantly reduced by using cached URB buffers, even without
Kieran's async series. After we discovered the latter, we just
backported it and decided to further tweak the last remaining bit, to
avoid playing too much with the DMA API in code used in production on
several different platforms (including both ARM and x86).

If you think we could change the driver to use cached buffers instead
(as the pwc driver mentioned in another thread), I wouldn't have
anything against it obviously.

Best regards,
Tomasz