Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.

From: Thomas Hellstrom
Date: Mon Jan 10 2011 - 15:50:17 EST


On 01/10/2011 05:45 PM, Konrad Rzeszutek Wilk wrote:
. snip ..
2) What about accounting? In a *non-Xen* environment, will the
number of coherent pages be less than the number of DMA32 pages, or
will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
The code in the IOMMUs end up calling __get_free_pages, which ends up
in alloc_pages. So the call doe ends up in alloc_page(flags).


native SWIOTLB (so no IOMMU): GFP_DMA32
GART (AMD's old IOMMU): GFP_DMA32:

For the hardware IOMMUs:

AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32.
If it is in DMA translation mode (normal mode) it allocates a page
with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately
translates the bus address.

The flags change a bit:
VT-d: if there is no identity mapping, nor the PCI device is one of the special ones
(GFX, Azalia), then it will pass it with GFP_DMA32.
If it is in identity mapping state, and the device is a GFX or Azalia sound
card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate
the buss address.

However, the interesting thing is that I've passed in the 'NULL' as
the struct device (not intentionally - did not want to add more changes
to the API) so all of the IOMMUs end up doing GFP_DMA32.

But it does mess up the accounting with the AMD-VI and VT-D as they strip
of the __GFP_DMA32 flag off. That is a big problem, I presume?
Actually, I don't think it's a big problem. TTM allows a small
discrepancy between allocated pages and accounted pages to be able
to account on actual allocation result. IIRC, This means that a
DMA32 page will always be accounted as such, or at least we can make
it behave that way. As long as the device can always handle the
page, we should be fine.
Excellent.
3) Same as above, but in a Xen environment, what will stop multiple
guests to exhaust the coherent pages? It seems that the TTM
accounting mechanisms will no longer be valid unless the number of
available coherent pages are split across the guests?
Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to
four guests. Lets also assume that we are doing heavy operations in all
of the guests. Since there are no communication between each TTM
accounting in each guest you could end up eating all of the 4GB physical
memory that is available to each guest. It could end up that the first
guess gets a lion share of the 4GB memory, while the other ones are
less so.

And if one was to do that on baremetal, with four ATI Radeon cards, the
TTM accounting mechanism would realize it is nearing the watermark
and do.. something, right? What would it do actually?

I think the error path would be the same in both cases?
Not really. The really dangerous situation is if TTM is allowed to
exhaust all GFP_KERNEL memory. Then any application or kernel task
Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then
this should be OK?

No, Unless I miss something, on a machine with 4GB or less, GFP_DMA32 and GFP_KERNEL are allocated from the same pool of pages?


What *might* be possible, however, is that the GFP_KERNEL memory on
the host gets exhausted due to extensive TTM allocations in the
guest, but I guess that's a problem for XEN to resolve, not TTM.
Hmm. I think I am missing something here. The GFP_KERNEL is any memory
and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start
using the PCI-API, what happens underneath (so under Linux) is that
"real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark
get swizzled in for the guest's PFNs (this is for the PCI devices
that have the dma_mask set to 32bit). However, that is a Xen MMU
accounting issue.


So I was under the impression that when you allocate coherent memory in the guest, the physical page comes from DMA32 memory in the host. On a 4GB machine or less, that would be the same as kernel memory. Now, if 4 guests think they can allocate 2GB of coherent memory each, you might run out of kernel memory on the host?


Another thing that I was thinking of is what happens if you have a huge gart and allocate a lot of coherent memory. Could that potentially exhaust IOMMU resources?

/Thomas

*) I think gem's flink still is vulnerable to this, though, so it
Is there a good test-case for this?


Not put in code. What you can do (for example in an openGL app) is to write some code that tries to flink with a guessed bo name until it succeeds. Then repeatedly from within the app, try to flink the same name until something crashes. I don't think the linux OOM killer can handle that situation. Should be fairly easy to put together.

/Thomas

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/