Re: [RFC PATCH 2/4] iommu/vt-d: Add first level page table interfaces

From: Lu Baolu
Date: Sat Sep 28 2019 - 04:25:43 EST


Hi Peter,

On 9/27/19 1:34 PM, Peter Xu wrote:
Hi, Baolu,

On Fri, Sep 27, 2019 at 10:27:24AM +0800, Lu Baolu wrote:
+ spin_lock(&(domain)->page_table_lock); \

Is this intended to lock here instead of taking the lock during the
whole page table walk? Is it safe?

Taking the example where nm==PTE: when we reach here how do we
guarantee that the PMD page that has this PTE is still valid?

We will always keep the non-leaf pages in the table,

I see. Though, could I ask why? It seems to me that the existing 2nd
level page table does not keep these when unmap, and it's not even use
locking at all by leveraging cmpxchg()?

I still need some time to understand how cmpxchg() solves the race issue
when reclaims pages. For example.

Thread A Thread B
-A1: check all PTE's empty -B1: up-level PDE valid
-A2: clear the up-level PDE
-A3: reclaim the page -B2: populate the PTEs

Both (A1,A2) and (B1,B2) should be atomic. Otherwise, race could happen.

I'm not sure of this, but IMHO it is similarly because we need to
allocate the iova ranges from iova allocator first, so thread A (who's
going to unmap pages) and thread B (who's going to map new pages)
should never have collapsed regions if happening concurrently. I'm

Although they don't collapse, they might share a same pmd entry. If A
cleared the pmd entry and B goes ahead with populating the pte's. It
will crash.

referring to intel_unmap() in which we won't free the iova region
before domain_unmap() completes (which should cover the whole process
of A1-A3) so the same iova range to be unmapped won't be allocated to
any new pages in some other thread.

There's also a hint in domain_unmap():

/* we don't need lock here; nobody else touches the iova range */


Actually, the iova allocator always packs IOVA ranges close to the top
of the address space. This results in requiring a minimal number of
pages to map the allocated IOVA ranges, which makes memory onsumption
by IOMMU page tables tolerable. Hence, we don't need to reclaim the
pages until the whole page table is about to tear down. The real data
on my test machine also improves this.

Do you mean you have run the code with a 1st-level-supported IOMMU
hardware? IMHO any data point would be good to be in the cover letter
as reference.

Yes. Sure! Let me do this since the next version.


[...]

+static struct page *
+mmunmap_pte_range(struct dmar_domain *domain, pmd_t *pmd,
+ unsigned long addr, unsigned long end,
+ struct page *freelist, bool reclaim)
+{
+ int i;
+ unsigned long start;
+ pte_t *pte, *first_pte;
+
+ start = addr;
+ pte = pte_offset_kernel(pmd, addr);
+ first_pte = pte;
+ do {
+ set_pte(pte, __pte(0));
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+
+ domain_flush_cache(domain, first_pte, (void *)pte - (void *)first_pte);
+
+ /* Add page to free list if all entries are empty. */
+ if (reclaim) {

Shouldn't we know whether to reclaim if with (addr, end) specified as
long as they cover the whole range of this PMD?

Current policy is that we don't reclaim any pages until the whole page
table will be torn down.

Ah OK. But I saw that you're passing in relaim==!start_addr.
Shouldn't that errornously trigger if one wants to unmap the 1st page
as well even if not the whole address space?

IOVA 0 is assumed to be reserved by the allocator. Otherwise, we have no
means to check whether a IOVA is valid.

Is this an assumption of the allocator? Could that change in the future?

Yes. And I think it should keep unless no consumer depends on this
optimization.


IMHO that's not necessary if so, after all it's as simple as replacing
(!start_addr) with (start == 0 && end == END). I see that in
domain_unmap() it has a similar check when freeing pgd:

if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw))


Yours looks better. Thank you!

Thanks,


Best regards,
Baolu