Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes assupported by the hardware

From: Joerg Roedel
Date: Fri Nov 11 2011 - 09:18:25 EST


On Fri, Nov 11, 2011 at 01:27:28PM +0000, David Woodhouse wrote:
> On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote:
> > For AMD IOMMU there is a feature called not-present cache. It says that
> > the IOMMU caches non-present entries as well and needs an IOTLB flush
> > when something is mapped (meant for software implementations of the
> > IOMMU).
> > So it can't be really taken out of the fast-path. But the IOMMU driver
> > can optimize the function so that it only flushes the IOTLB when there
> > was an unmap-call before.
>
> We have exactly the same situation with the Intel IOMMU (we call it
> 'Caching Mode') for the same reasons.
>
> I'd be wary about making the IOMMU driver *track* whether there was an
> unmap call before â that seems like hard work and more cache contention,
> especially if the ->commit() call happens on a CPU other than the one
> that just did the unmap.

Well, the IOMMU driver does not need to track this globally or
per-IOMMU. It basically tracks it per-domain. This lowers the
cache-bouncing a bit.

> I'm also not sure exactly when you'd call the ->commit() function when
> the DMA API is being used, and which 'side' of that API the
> deferred-flush optimisations would live.

The commit function is called every time before one of the DMA-API
functions return.

> Would the optimisation be done on the generic side, only calling
> ->commit when it absolutely *has* to happen? (Or periodically after
> unmaps have happened to avoid entries hanging around for ever?)
>
> Or would the optimisation be done in the IOMMU driver, thus turning the
> ->commit() function into more of a *hint*? You could add a 'simon_says'
> boolean argument to it, I suppose...?

The IOMMU driver can't do that because special knowledge about the
address allocator is needed. So it has to happen in the generic part.
Okay, that optimization does not really work with the ->commit() idea,
as I think about it. With the requirement to call commit() on the
map-side would result in too many flushes to happen.

So it may be better to just call it iommu_flush_tlb(domain) or
something. The user of the IOMMU-API can assume that the flush is only
necessary on the unmap-path. Stricter flushing rules need to be handled
in the IOMMU driver itself.

> > It is also an improvement over the current
> > situation where every iommu_unmap call results in a flush implicitly.
> > This pretty much a no-go for using IOMMU-API in DMA mapping at the
> > moment.
>
> Right. That definitely needs to be handled. We just need to work out the
> (above and other) details.

Yes, I am convinced now that the commit() idea is not the right solution
:)
Some less generic semantics like a iommu_flush_tlb call-back would do
better with such an optimization.

> Certainly your concerns are valid, but I think we can cope with them
> fairly reasonably. If we *do* have large number of CPUs allocating for a
> given domain, we can move to a per-node rather than per-CPU allocator.
> And we can have dynamically sized allocation regions, so we aren't
> wasting too much space on unused bitmaps if you map just *one* page from
> each of your 4096 CPUs.

It also requires somewhat that the chunks are reasonably small. On a
typical small system the amount of allocated DMA memory is rather small
(GPUs are the exception, of course).

Btw, do you have numbers how much time is spent in spinlocks for
multi-node machines and the VT-d driver for some workloads? I remember
that I did this measurement some time ago on a 24 core AMD machine with
netperf on a 10GBit network card and the time spent spinning came out
at ~5%.
For 1Gbit network and disk-load it was around 0.5% on that machine.

> > How much lock contention will be lowered also depends on the work-load.
> > If dma-handles are frequently freed from another cpu than they were
> > allocated from the same problem re-appears.
>
> The idea is that dma handles are *infrequently* freed, in batches. So
> we'll bounce the lock's cache line occasionally, but not all the time.

Hmm, btw, how does this batch-freeing works? Is the freeing done when
the address space is filled up or is another approach used?

> > But in the end we have to try it out and see what works best :)
>
> Indeed. I'm just trying to work out if I should try to do the allocator
> thing purely inside the Intel code first, and then try to move it out
> and make it generic â or if I should start with making the DMA API work
> with a wrapper around the IOMMU API, with your ->commit() and other
> necessary changes. I think I'd prefer the latter, if we can work out how
> it should look.

I think it is the best to start with a generic DMA-mapping
implementation, benchmark it and then find the bottlenecks and try out
possible optimizations. In the beginning it is hard enough to come up
with an implementation that fits X86 and ARM at the same time. When this
is worked out we can continue to optimize :)


Joerg

--
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/