RE: [PATCH v7 1/8] Talitos: Support for async_tx XOR offload

From: Liu Qiang-B32616
Date: Wed Sep 12 2012 - 05:45:04 EST

> >> Will this engine be coordinating with another to handle memory copies?
> >> The dma mapping code for async_tx/raid is broken when dma mapping
> >> requests overlap or cross dma device boundaries [1].
> >>
> >> [1]:
> > Yes, it needs fsl-dma to handle memcpy copies.
> > I read your link, the unmap address is stored in talitos hwdesc, the
> address will be unmapped when async_tx ack this descriptor, I know fsl-
> dma won't wait this ack flag in current kernel, so I fix it in fsl-dma
> patch 5/8. Do you mean that?
> Unfortunately no. I'm open to other suggestions. but as far as I can
> see it requires deeper changes to rip out the dma mapping that happens
> in async_tx and the automatic unmapping done by drivers. It should
> all be pushed to the client (md).
> Currently async_tx hides hardware details from md such that it doesn't
> even care if the operation is offloaded to hardware at all, but that
> takes things too far. In the worst case an copy->xor chain handled by
> multiple channels results in :
> 1/ dma_map(copy_chan...)
> 2/ dma_map(xor_chan...)
> 3/ <exec copy>
> 4/ dma_unmap(copy_chan...)
> 5/ <exec xor> <---initiated by the copy_chan
> 6/ dma_unmap(xor_chan...)
> Step 2 violates the dma api since the buffers belong to the xor_chan
> until unmap. Step 5 also causes the random completion context of the
> copy channel to bleed into submission context of the xor channel which
> is problematic. So the order needs to be:
> 1/ dma_map(copy_chan...)
> 2/ <exec copy>
> 3/ dma_unmap(copy_chan...)
> 4/ dma_map(xor_chan...)
> 5/ <exec xor> <--initiated by md in a static context
> 6/ dma_unmap(xor_chan...)
> Also, if xor_chan and copy_chan lie with the same dma mapping domain
> (iommu or parent device) then we can map the stripe once and skip the
> extra maintenance for the duration of the chain of operations. This
> dumps a lot of hardware details on md, but I think it is the only way
> to get consistent semantics when arbitrary offload devices are
> involved.
Thanks for your answer and links, I did some investigate these days,

first, powerpc processor should be hardware assured cache coherency, it should
be ok for hardware when in step 5 (but I will avoid map same address on different

second, I have a workaround to make dma_map/unmap by order when using 2 different
device to offload, I will submit next descriptor until current descriptor complete,
if (submit->flags & ASYNC_TX_ACK)
if (depend_tx)

+ /* do more check to support 2 devices offload? */
+ if (dma_wait_for_async_tx(tx) == DMA_ERROR)
+ panic("%s: DMA_ERROR waiting for tx\n", __func__);

Also use your example,
1/ dma_map(copy_chan...)
2/ tx->submit(tx); async_tx_ack(tx);
3/ dma_unmap(copy_chan...)
4/ dma_map(xor_chan...)
5/ <exec xor> <-- initialized by tx->submit(tx);
6/ dma_unmap(xor_chan...)

Under this way, actually dma_run_dependency() is useless, so this can make sure copy
and xor with same page processed by order, and only one descriptor per channel is
served. dma_unmap in driver is controlled by client (tx->flags)

How's you thinking or any suggestions? I test it on our powerpc, I don't know whether
it does work on other architecture.


> --
> Dan

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at