Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

From: James Houghton
Date: Thu Jan 19 2023 - 17:54:09 EST


On Thu, Jan 19, 2023 at 2:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Thu, Jan 19, 2023 at 02:00:32PM -0800, Mike Kravetz wrote:
> > I do not know much about the (primary) live migration use case. My
> > guess is that page table lock contention may be an issue? In this use
> > case, HGM is only enabled for the duration the live migration operation,
> > then a MADV_COLLAPSE is performed. If contention is likely to be an
> > issue during this time, then yes we would need to pass around with
> > something like hugetlb_pte.
>
> I'm not aware of any such contention issue. IMHO the migration problem is
> majorly about being too slow transferring a page being so large. Shrinking
> the page size should resolve the major problem already here IIUC.

This will be problematic if you scale up VMs to be quite large. Google
upstreamed the "TDP MMU" for KVM/x86 that removed the need to take the
MMU lock for writing in the EPT violation path. We found that this
change is required for VMs >200 or so vCPUs to consistently avoid CPU
soft lockups in the guest.

Requiring each UFFDIO_CONTINUE (in the post-copy path) to serialize on
the same PTL would be problematic in the same way.

>
> AFAIU 4K-only solution should only reduce any lock contention because locks
> will always be pte-level if VM_HUGETLB_HGM set. When walking and creating
> the intermediate pgtable entries we can use atomic ops just like generic
> mm, so no lock needed at all. With uncertainty on the size of mappings,
> we'll need to take any of the multiple layers of locks.
>

Other than taking the HugeTLB VMA lock for reading, walking/allocating
page tables won't need any additional locking.

We take the PTL to allocate the next level down, but so does generic
mm (look at __pud_alloc, __pmd_alloc for example). Maybe I am
misunderstanding.

- James