Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs

From: Robin Murphy
Date: Tue May 31 2022 - 09:55:34 EST


On 2022-05-31 04:02, Baolu Lu wrote:
On 2022/5/30 20:14, Jason Gunthorpe wrote:
On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote:

 From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001
From: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage

The domain_translation_struct debugfs node is used to dump static
mappings of PCI devices. It potentially races with setting new
domains to devices and the iommu_map/unmap() interfaces. The existing
code tries to use the global spinlock device_domain_lock to avoid the
races, but this is problematical as this lock is only used to protect
the device tracking lists of the domains.

Instead of using an immature lock to cover up the problem, it's better
to explicitly restrict the use of this debugfs node. This also makes
device_domain_lock static.

What does "explicitly restrict" mean?

I originally thought about adding restrictions on this interface to a
document. But obviously, this is a naive idea. :-) I went over the code
again. The races exist in two paths:

1. Dump the page table in use while setting a new page table to the
   device.
2. A high-level page table entry has been marked as non-present, but the
   dumping code has walked down to the low-level tables.

For case 1, we can try to solve it by dumping tables while holding the
group->mutex.

For case 2, it is a bit weird. I tried to add a rwsem lock to make the
iommu_unmap() and dumping tables in debugfs exclusive. This does not
work because debugfs may depend on the DMA of the devices to work. It
seems that what we can do is to allow this race, but when we traverse
the page table in debugfs, we will check the validity of the physical
address retrieved from the page table entry. Then, the worst case is to
print some useless information.

The real code looks like this:

From 3feb0727f9d7095729ef75ab1967270045b3a38c Mon Sep 17 00:00:00 2001
From: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>
Date: Sun, 29 May 2022 10:18:56 +0800
Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage

The domain_translation_struct debugfs node is used to dump the DMAR page
tables for the PCI devices. It potentially races with setting domains to
devices and the iommu_unmap() interface. The existing code uses a global
spinlock device_domain_lock to avoid the races, but this is problematical
as this lock is only used to protect the device tracking lists of each
domain.

This replaces device_domain_lock with group->mutex to protect the traverse
of the page tables from setting a new domain and always check the physical
address retrieved from the page table entry before traversing to the next-
level page table.

As a cleanup, this also makes device_domain_lock static.

Signed-off-by: Lu Baolu <baolu.lu@xxxxxxxxxxxxxxx>
---
 drivers/iommu/intel/debugfs.c | 42 ++++++++++++++++++++++-------------
 drivers/iommu/intel/iommu.c   |  2 +-
 drivers/iommu/intel/iommu.h   |  1 -
 3 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c
index d927ef10641b..e6f4835b8d9f 100644
--- a/drivers/iommu/intel/debugfs.c
+++ b/drivers/iommu/intel/debugfs.c
@@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde,
             continue;

         path[level] = pde->val;
-        if (dma_pte_superpage(pde) || level == 1)
+        if (dma_pte_superpage(pde) || level == 1) {
             dump_page_info(m, start, path);
-        else
-            pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)),
+        } else {
+            unsigned long phys_addr;
+
+            phys_addr = (unsigned long)dma_pte_addr(pde);
+            if (!pfn_valid(__phys_to_pfn(phys_addr)))

Given that pte_present(pde) passed just above, it was almost certainly a valid entry, so it seems unlikely that the physical address it pointed to could have disappeared in the meantime. If you're worried about the potential case where we've been preempted during this walk for long enough that the page has already been freed by an unmap, reallocated, and filled with someone else's data that happens to look like valid PTEs, this still isn't enough, since that data could just as well happen to look like valid physical addresses too.

I imagine that if you want to safely walk pagetables concurrently with them potentially being freed, you'd probably need to get RCU involved.

+                break;
+            pgtable_walk_level(m, phys_to_virt(phys_addr),

Also, obligatory reminder that pfn_valid() only means that pfn_to_page() gets you a valid struct page. Whether that page is direct-mapped kernel memory or not is a different matter.

                        level - 1, start, path);
+        }
         path[level] = 0;
     }
 }

-static int show_device_domain_translation(struct device *dev, void *data)
+static int __show_device_domain_translation(struct device *dev, void *data)
 {
     struct device_domain_info *info = dev_iommu_priv_get(dev);
     struct dmar_domain *domain = info->domain;
     struct seq_file *m = data;
     u64 path[6] = { 0 };

-    if (!domain)
-        return 0;
-
     seq_printf(m, "Device %s @0x%llx\n", dev_name(dev),
            (u64)virt_to_phys(domain->pgd));
     seq_puts(m, "IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n");
@@ -359,20 +362,27 @@ static int show_device_domain_translation(struct device *dev, void *data)
     pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path);
     seq_putc(m, '\n');

-    return 0;
+    return 1;
 }

-static int domain_translation_struct_show(struct seq_file *m, void *unused)
+static int show_device_domain_translation(struct device *dev, void *data)
 {
-    unsigned long flags;
-    int ret;
+    struct iommu_group *group;

-    spin_lock_irqsave(&device_domain_lock, flags);
-    ret = bus_for_each_dev(&pci_bus_type, NULL, m,
-                   show_device_domain_translation);
-    spin_unlock_irqrestore(&device_domain_lock, flags);
+    group = iommu_group_get(dev);
+    if (group) {
+        iommu_group_for_each_dev(group, data,
+                     __show_device_domain_translation);

Why group_for_each_dev? If there *are* multiple devices in the group then by definition they should be attached to the same domain, so dumping that domain's mappings more than once seems pointless. Especially given that the outer bus_for_each_dev iteration will already visit each individual device anyway, so this would only make the redundancy even worse than it already is.

Robin.

+        iommu_group_put(group);
+    }

-    return ret;
+    return 0;
+}
+
+static int domain_translation_struct_show(struct seq_file *m, void *unused)
+{
+    return bus_for_each_dev(&pci_bus_type, NULL, m,
+                show_device_domain_translation);
 }
 DEFINE_SHOW_ATTRIBUTE(domain_translation_struct);

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 1af4b6562266..cacae8bdaa65 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -314,7 +314,7 @@ static int iommu_skip_te_disable;
 #define IDENTMAP_GFX        2
 #define IDENTMAP_AZALIA        4

-DEFINE_SPINLOCK(device_domain_lock);
+static DEFINE_SPINLOCK(device_domain_lock);
 static LIST_HEAD(device_domain_list);

 /*
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index a22adfbdf870..8a6d64d726c0 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -480,7 +480,6 @@ enum {
 #define VTD_FLAG_SVM_CAPABLE        (1 << 2)

 extern int intel_iommu_sm;
-extern spinlock_t device_domain_lock;

 #define sm_supported(iommu)    (intel_iommu_sm && ecap_smts((iommu)->ecap))
 #define pasid_supported(iommu)    (sm_supported(iommu) &&

Best regards,
baolu