Re: DMA Engine: Transfer From Userspace

From: Peter Ujfalusi
Date: Tue Jun 30 2020 - 08:30:38 EST




On 29/06/2020 18.18, Thomas Ruf wrote:
>
>> On 26 June 2020 at 12:29 Peter Ujfalusi <peter.ujfalusi@xxxxxx> wrote:
>>
>> On 24/06/2020 16.58, Thomas Ruf wrote:
>>>
>>>> On 24 June 2020 at 14:07 Peter Ujfalusi <peter.ujfalusi@xxxxxx> wrote:
>>>> On 24/06/2020 12.38, Vinod Koul wrote:
>>>>> On 24-06-20, 11:30, Thomas Ruf wrote:
>>>>>
>>>>>> To make it short - i have two questions:
>>>>>> - what are the chances to revive DMA_SG?
>>>>>
>>>>> 100%, if we have a in-kernel user
>>>>
>>>> Most DMAs can not handle differently provisioned sg_list for src and dst.
>>>> Even if they could handle non symmetric SG setup it requires entirely
>>>> different setup (two independent channels sending the data to each
>>>> other, one reads, the other writes?).
>>>
>>> Ok, i implemented that using zynqmp_dma on a Xilinx Zynq platform (obviously ;-) and it works nicely for us.
>>
>> I see, if the HW does not support it then something along the lines of
>> what the atc_prep_dma_sg did can be implemented for most engines.
>>
>> In essence: create a new set of sg_list which is symmetric.
>
> Sorry, not sure if i understand you right?
> You suggest that in case DMA_SG gets revived we should restrict the support to symmetric sg_lists?

No, not at all. That would not make much sense.

> Just had a glance at the deleted code and the *_prep_dma_sg of these drivers had code to support asymmetric lists and by that "unaligend" memory (relative to page start):
> at_hdmac.c
> dmaengine.c
> dmatest.c
> fsldma.c
> mv_xor.c
> nbpfaxi.c
> ste_dma40.c
> xgene-dma.c
> xilinx/zynqmp_dma.c
>
> Why not just revive that and keep this nice functionality? ;-)

What I'm saying is that the drivers (at least at_hdmac) in essence
creates aligned sg_list out from the received non aligned ones.
It does this w/o actually creating the sg_list itself, but that's just a
small detail.

In a longer run what might make sense is to have a helper function to
convert two non symmetric sg_list into two symmetric ones so drivers
will not have to re-implement the same code and they will only need to
care about symmetric sg lists.

Note, some DMAs can actually handle non symmetric src and dst lists, but
I believe it is rare.

>> What might be plausible is to introduce hw offloading support for memcpy
>> type of operations in a similar fashion how for example crypto does it?
>
> Sounds good to me, my proxy driver implementation could be a good start for that, too!

It needs to find it's place as well... I'm not sure where that would be.
Simple block-copy offload, sg copy offload, interleaved offload (frame
extraction) offload, dmabuf copy offload comes to mind as candidates.

>> The issue with a user space implemented logic is that it is not portable
>> between systems with different DMAs. It might be that on one DMA the
>> setup takes longer than do a CPU copy of X bytes, on the other DMA it
>> might be significantly less or higher.
>
> Fully agree with that!
> I was also unsure how my approach will perform but in our case the latency was increased by ~20%, cpu load roughly stayed the same, of course this was the benchmark from user memory to user memory.
> From uncached to user memory the DMA was around 15 times faster.

It depends on the size of the transfer. Lots of small individual
transfers might be worst via DMA do the the setup time, completion
handling, etc.

>> Using CPU vs DMA for a copy in certain lengths and setups should not be
>> a concern of the user space.
>
> Also fully agree with that!

There is one and big issue with the fallback to CPU copy... If you used
DMA then you might need to do cache operation to get things in their
right place.
If you have done it with CPU then you most like do not need to care
about it.
Handling this should be done in level where we are aware which path is
taken.

>> Yes, you have a closed system with controlled parameters, but a generic
>> mem2mem_offload framework should be usable on other setups and the same
>> binary should be working on different DMAs where one is not efficient
>> for <512 bytes, the other shows benefits under 128bytes.
>
> Usable: of course
> "Faster": not necessarily as long as it is an option
>
> Thanks for your valuable input and suggestions!
>
> best regards,
> Thomas
>

- PÃter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki