RE: [PATCH v3 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

From: Tian, Kevin
Date: Fri Aug 15 2025 - 05:49:52 EST


> From: Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx>
> Sent: Friday, August 15, 2025 5:17 PM
>
> On 8/8/2025 10:57 AM, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> >> Sent: Friday, August 8, 2025 3:52 AM
> >>
> >> On Thu, Aug 07, 2025 at 10:40:39PM +0800, Baolu Lu wrote:
> >>> +static void kernel_pte_work_func(struct work_struct *work)
> >>> +{
> >>> + struct page *page, *next;
> >>> +
> >>> + iommu_sva_invalidate_kva_range(0, TLB_FLUSH_ALL);
> >>> +
> >>> + guard(spinlock)(&kernel_pte_work.lock);
> >>> + list_for_each_entry_safe(page, next, &kernel_pte_work.list, lru) {
> >>> + list_del_init(&page->lru);
> >>
> >> Please don't add new usages of lru, we are trying to get rid of this. :(
> >>
> >> I think the memory should be struct ptdesc, use that..
> >>
> >
> > btw with this change we should also defer free of the pmd page:
> >
> > pud_free_pmd_page()
> > ...
> > for (i = 0; i < PTRS_PER_PMD; i++) {
> > if (!pmd_none(pmd_sv[i])) {
> > pte = (pte_t *)pmd_page_vaddr(pmd_sv[i]);
> > pte_free_kernel(&init_mm, pte);
> > }
> > }
> >
> > free_page((unsigned long)pmd_sv);
> >
> > Otherwise the risk still exists if the pmd page is repurposed before the
> > pte work is scheduled.
>
> You're right that freeing high-level page table pages also requires an
> IOTLB flush before the pages are freed. But I question the practical
> risk of the race given the extremely small time window. If this is a

It's already extremely difficult to conduct a real attack even w/o this
fix. I'm not sure the criteria how small we consider acceptable in this
specific case. but leaving an incomplete fix in code doesn't sound clean...

> real concern, a potential mitigation would be to clear the U/S bits in
> all page table entries for kernel address space? But I am not confident
> in making that change at this time as I am unsure of the side effects it
> might cause.

I think there was already consensus that clearing U/S bits in all entries
doesn't prevent the IOMMU caching them and setting A/D bits on
the freed pagetable.

>
> >
> > another observation - pte_free_kernel is not used in remove_pagetable ()
> > and __change_page_attr(). Is it straightforward to put it in those paths
> > or do we need duplicate some deferring logic there?
>
> The remove_pagetable() function is called in the path where memory is
> hot-removed from the system, right? If so, there should be no issue, as
> the threat model here is a page table page being freed and repurposed
> while it's still cached in the IOTLB. In the hot-remove case, the memory
> is removed and will not be reused, so that's fine as far as I can see.

what about the page is hot-added back while the stale entry pointing to
it is still valid in the IOMMU, theoretically? 😊

>
> The same to __change_page_attr(), which only changes the attributes of a
> page table entry while the underlying page remains in use.
>

it may lead to cpa_collapse_large_pages() if changing attribute leads to
all adjacent 4k pages in 2M range are with same attribute. Then page
table might be freed:

cpa_collapse_large_pages():
list_for_each_entry_safe(ptdesc, tmp, &pgtables, pt_list) {
list_del(&ptdesc->pt_list);
__free_page(ptdesc_page(ptdesc));
}