Re: [PATCH v2 0/9] Add support for shared PTEs across processes

From: David Hildenbrand
Date: Wed Jul 13 2022 - 10:01:55 EST


On 08.07.22 21:36, Khalid Aziz wrote:
> On 7/8/22 05:47, David Hildenbrand wrote:
>> On 02.07.22 06:24, Andrew Morton wrote:
>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@xxxxxxxxxx> wrote:
>>>
>>>> This patch series implements a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. It adds a new
>>>> in-memory filesystem - msharefs.
>>>
>>> Dumb question: why do we need a new filesystem for this? Is it not
>>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>>
>>
>> IIRC, the general opinion at LSF/MM was that this approach at hand is
>> makes people nervous and I at least am not convinced that we really want
>> to have this upstream.
>
> Hi David,

Hi Khalid,

>
> You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and
> this just extends that concept to processes.

They share a *mm* including a consistent virtual memory layout (VMA
list). Page table sharing is just a side product of that. You could even
call page tables just an implementation detail to produce that
consistent virtual memory layout -- described for that MM via a
different data structure.

> A number of people have commented on potential usefulness of this concept
> and implementation.

... and a lot of people raised concerns. Yes, page table sharing to
reduce memory consumption/tlb misses/... is something reasonable to
have. But that doesn't require mshare, as hugetlb has proven.

The design might be useful for a handful of corner (!) cases, but as the
cover letter only talks about memory consumption of page tables, I'll
not care about those. Once these corner cases are explained and deemed
important, we might want to think of possible alternatives to explore
the solution space.

> There were concerns raised about being able to make this safe and reliable.
> I had agreed to send a
> second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The

Okay, most of the changes I saw are related to the user interface, not
to any of the actual dirty implementation-detail concerns. And the cover
letter is not really clear what's actually happening under the hood and
what the (IMHO) weird semantics of the design imply (as can be seen from
Andrews reply).

> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion
> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is
> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that
> wrong.

Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
and asked the honest question if we can just remove it and replace it by
something generic in the future. And as I learned, we most probably
cannot rip that out without affecting existing user space. Even
replacing it by mshare() would degrade existing user space.

So the natural thing to reduce page table consumption (again, what this
cover letter talks about) for user space (semi- ?)automatically for
MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
code to cache and reuse page tables (PTE and PMD tables should be
sufficient) where suitable.

For reasonably aligned mappings and mapping sizes, it shouldn't be too
hard (I know, locking ...), to cache and reuse page tables attached to
files -- similar to what hugetlb does, just in a generic way. We might
want a mechanism to enable/disable this for specific processes and/or
VMAs, but these are minor details.

And that could come for free for existing user space, because page
tables, and how they are handled, would just be an implementation detail.


I'd be really interested into what the major roadblocks/downsides
file-based page table sharing has. Because I am not convinced that a
mechanism like mshare() -- that has to be explicitly implemented+used by
user space -- is required for that.

--
Thanks,

David / dhildenb