Re: [PATCH RFC] mm: add MAP_EXCLUSIVE to create exclusive user mappings

From: Dave Hansen
Date: Mon Oct 28 2019 - 13:12:47 EST


On 10/27/19 3:17 AM, Mike Rapoport wrote:
> The pages in these mappings are removed from the kernel direct map and
> marked with PG_user_exclusive flag. When the exclusive area is unmapped,
> the pages are mapped back into the direct map.

This looks fun. It's certainly simple.

But, the description is not really calling out the pros and cons very
well. I'm also not sure that folks will use an interface like this that
requires up-front, special code to do an allocation instead of something
like madvise(). That's why protection keys ended up the way it did: if
you do this as a mmap() replacement, you need to modify all *allocators*
to be enabled for this. If you do it with mprotect()-style, you can
apply it to existing allocations.

Some other random thoughts:

* The page flag is probably not a good idea. It would be probably
better to set _PAGE_SPECIAL on the PTE and force get_user_pages()
into the slow path.
* This really stops being "normal" memory. You can't do futexes on it,
cant splice it. Probably need a more fleshed-out list of
incompatible features.
* As Kirill noted, each 4k page ends up with a potential 1GB "blast
radius" of demoted pages in the direct map. Not cool. This is
probably a non-starter as it stands.
* The global TLB flushes are going to eat you alive. They probably
border on a DoS on larger systems.
* Do we really want this user interface to dictate the kernel
implementation? In other words, do we really want MAP_EXCLUSIVE,
or do we want MAP_SECRET? One tells the kernel what do *do*, the
other tells the kernel what the memory *IS*.
* There's a lot of other stuff going on in this area: XPFO, SEV, MKTME,
Persistent Memory, where the kernel direct map is a liability in some
way. We probably need some kind of overall, architected solution
rather than five or ten things all poking at the direct map.