Re: Intel IOMMU (and IOMMU for Virtualization) performances

From: FUJITA Tomonori
Date: Fri Jun 06 2008 - 00:45:16 EST


On Thu, 05 Jun 2008 14:01:28 -0500
James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:

> On Thu, 2008-06-05 at 11:34 -0700, Grant Grundler wrote:
> > On Thu, Jun 5, 2008 at 7:49 AM, FUJITA Tomonori
> > <fujita.tomonori@xxxxxxxxxxxxx> wrote:
> > ...
> > >> You can easily emulate SSD drives by doing sequential 4K reads
> > >> from a normal SATA HD. That should result in ~7-8K IOPS since the disk
> > >> will recognize the sequential stream and read ahead. SAS/SCSI/FC will
> > >> probably work the same way with different IOP rates.
> > >
> > > Yeah, probabaly right. I thought that 10GbE give the IOMMU more
> > > workloads than SSD does and tried to emulate something like that.
> >
> > 10GbE might exercise a different code path. NICs typically use map_single
>
> map_page, actually, but effectively the same thing. However, all
> they're really doing is their own implementation of sg list mapping.

Yeah, they are nearly same. map_single allocates only one DMA address
while sg_map does allocates a DMA address again and again.


> > and storage devices typically use map_sg. But they both exercise the same
> > underlying resource management code since it's the same IOMMU they poke at.
> >
> > ...
> > >> Sorry, I didn't see a replacement for the deferred_flush_tables.
> > >> Mark Gross and I agree this substantially helps with unmap performance.
> > >> See http://lkml.org/lkml/2008/3/3/373
> > >
> > > Yeah, I can add a nice trick in parisc sba_iommu uses. I'll try next
> > > time.
> > >
> > > But it probably gives the bitmap method less gain than the RB tree
> > > since clear the bitmap takes less time than changing the tree.
> > >
> > > The deferred_flush_tables also batches flushing TLB. The patch flushes
> > > TLB only when it reaches the end of the bitmap (it's a trick that some
> > > IOMMUs like SPARC does).
> >
> > The batching of the TLB flushes is the key thing. I was being paranoid
> > by not marking the resource free until after the TLB was flushed. If we
> > know the allocation is going to be circular through the bitmap, flushing
> > the TLB once per iteration through the bitmap should be sufficient since
> > we can guarantee the IO Pdir resource won't get re-used until a full
> > cycle through the bitmap has been completed.
>
> Not necessarily ... there's a safety vs performance issue here. As long
> as the iotlb mapping persists, the device can use it to write to the
> memory. If you fail to flush, you lose the ability to detect device dma
> after free (because the iotlb may still be valid). On standard systems,
> this happens so infrequently as to be worth the tradeoff. However, in
> virtualised systems, which is what the intel iommu is aimed at, stale
> iotlb entries can be used by malicious VMs to gain access to memory
> outside of their VM, so the intel people at least need to say whether
> they're willing to accept this speed for safety tradeoff.

Agreed.

The current Intel IOMMU scheme is a bit unbalanced. It invalidates the
translation table every time dma_unmap_* is called, but it does the
batching of the TLB flushes. But it's what the most of Linux's IOMMU
code does.

I think that only PARISC (and IA64, of course) IOMMUs do the batching
of invalidating the translation table entries.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/