Re: [PATCH v2 1/1] iommu/sva: Invalidate KVA range on kernel TLB flush

From: Baolu Lu
Date: Wed Jul 16 2025 - 21:45:30 EST


On 7/16/25 20:08, Jason Gunthorpe wrote:
On Wed, Jul 16, 2025 at 02:34:04PM +0800, Baolu Lu wrote:
@@ -654,6 +656,9 @@ struct iommu_ops {

int (*def_domain_type)(struct device *dev);

+ void (*paging_cache_invalidate)(struct iommu_device *dev,
+ unsigned long start, unsigned long end);

How would you even implement this in a driver?

You either flush the whole iommu, in which case who needs a rage, or
the driver has to iterate over the PASID list, in which case it
doesn't really improve the situation.

The Intel iommu driver supports flushing all SVA PASIDs with a single
request in the invalidation queue.

How? All PASID !=0 ? The HW has no notion about a SVA PASID vs no-SVA
else. This is just flushing almost everything.

The intel iommu driver allocates a dedicated domain id for all sva
domains. It can flush all cache entries with that domain id tagged.


If this is a concern I think the better answer is to do a defered free
like the mm can sometimes do where we thread the page tables onto a
linked list, flush the CPU cache and push it all into a work which
will do the iommu flush before actually freeing the memory.

Is it a workable solution to use schedule_work() to queue the KVA cache
invalidation as a work item in the system workqueue? By doing so, we
wouldn't need the spinlock to protect the list anymore.

Maybe.

MM is also more careful to pull the invalidation out some of the
locks, I don't know what the KVA side is like..
How about something like the following? It's compiled but not tested.

struct kva_invalidation_work_data {
struct work_struct work;
unsigned long start;
unsigned long end;
bool free_on_completion;
};

static void invalidate_kva_func(struct work_struct *work)
{
struct kva_invalidation_work_data *data =
container_of(work, struct kva_invalidation_work_data, work);
struct iommu_mm_data *iommu_mm;

guard(mutex)(&iommu_sva_lock);
list_for_each_entry(iommu_mm, &iommu_sva_mms, mm_list_elm)
mmu_notifier_arch_invalidate_secondary_tlbs(iommu_mm->mm,
data->start, data->end);

if (data->free_on_completion)
kfree(data);
}

void iommu_sva_invalidate_kva_range(unsigned long start, unsigned long end)
{
struct kva_invalidation_work_data stack_data;

if (!static_branch_unlikely(&iommu_sva_present))
return;

/*
* Since iommu_sva_mms is an unbound list, iterating it in an atomic
* context could introduce significant latency issues.
*/
if (in_atomic()) {
struct kva_invalidation_work_data *data =
kzalloc(sizeof(*data), GFP_ATOMIC);

if (!data)
return;

data->start = start;
data->end = end;
INIT_WORK(&data->work, invalidate_kva_func);
data->free_on_completion = true;
schedule_work(&data->work);
return;
}

stack_data.start = start;
stack_data.end = end;
invalidate_kva_func(&stack_data.work);
}

Thanks,
baolu