Re: [RFC][PATCH 0/2] Avoiding DWC3 transfer stalls/hangs when using adb over f_fs

From: John Stultz
Date: Wed Feb 05 2020 - 16:04:05 EST


On Thu, Jan 23, 2020 at 9:46 AM Felipe Balbi <balbi@xxxxxxxxxx> wrote:
> >>>>>
> >>>>> Since ~4.20, when the functionfs gadget enabled scatter-gather
> >>>>> support, we have seen problems with adb connections stalling and
> >>>>> stopping to function on hardware with dwc3 usb controllers.
> >>>>> Specifically, HiKey960, Dragonboard 845c, and Pixel3 devices.
...
> >>
> >> I'm pretty sure this should be solved at the DMA API level, just want to confirm.
> >
> > I have sent you the tracepoints long time ago. Also my analysis of the
> > problem (BTW, I don't think the tracepoints helped much). It's
> > basically a logic problem in function dwc3_gadget_ep_reclaim_trb_sg().
>
> AFAICT, this is caused by DMA API merging pages together when map an
> sglist for DMA. While doing that, it does *not* move the SG_END flag
> which sg_is_last() checks.
>
> I consider that an overlook on the DMA API, wouldn't you? Why should DMA
> API users care if pages were merged or not while mapping the sglist? We
> have for_each_sg() and sg_is_last() for a reason.
>

>From an initial look, I agree this is pretty confusing. dma_map_sg()
can coalesce entries in the sg list, modifying the sg entires
themselves, however, in doing so it doesn't modify the number of
entries in the sglist (nor the end state bit). That's pretty subtle!

My initial naive attempt to fix the dma-iommu path to set the end bit
at the tail of __finalize_sg() which does the sg entry modifications,
only caused trouble elsewhere, as there's plenty of logic that expects
the number of entries to not change, so having sg_next() return NULL
before that point results in lots of null pointer traversals.

Further, looking at the history, while apparently quirky, this has
been the documented behavior in DMA-API.txt for over almost 14 years
(at least). It clearly states that that dma_map_api can return fewer
mapped entries then sg entries, and one should loop only that many
times (for_each_sg() having a max number of entries to iterate over
seems specifically for this purpose). Additionally, it says one must
preserve the original number of entries (not # mapped entries) for
dma_unmap_sg().

So I'm not sure that sg_is_last() is really valid for iterating on
mapped sg lists.

Should it be? Probably (at least with my unfamiliar eyes), but
sg_is_last() has been around for almost as long coexisting with this
behavioral quirk, so I'm also not sure this is the best hill for the
dwc3 driver to die on. :)

The fix here:
https://lore.kernel.org/lkml/20200122222645.38805-3-john.stultz@xxxxxxxxxx/
Or maybe the slightly cleaner varient here:
https://git.linaro.org/people/john.stultz/android-dev.git/commit/?h=dev/db845c-mainline-WIP&id=61a5816aa71ec719e77df9f2260dbbf6bcec7c99
seems like it would correctly address things following the
documentation and avoid the failures we're seeing.

As to if dma_map_sg() should fixup the state bits or properly shrink
the sg list when it coalesces entries, that seems like it would be a
much more intrusive change across quite a bit of the kernel that was
written to follow the documented method. So my confidence that such a
change would make it upstream in a reasonable amount of time isn't
very high, and it seems like a bad idea to block the driver from
working properly in the meantime.

Pulling in Christoph and Jens as I suspect they have more context on
this, and maybe can explain thats its not so quirky with the right
perspective?

Thoughts? Maybe there is an easier way to make it less quirky then
what I imagine?

thanks
-john