Re: [PATCH v1 00/14] Add support for shared PTEs across processes

From: Khalid Aziz
Date: Mon Apr 11 2022 - 18:22:52 EST


On 4/11/22 14:10, Eric W. Biederman wrote:
Khalid Aziz <khalid.aziz@xxxxxxxxxx> writes:

Page tables in kernel consume some of the memory and as long as number
of mappings being maintained is small enough, this space consumed by
page tables is not objectionable. When very few memory pages are
shared between processes, the number of page table entries (PTEs) to
maintain is mostly constrained by the number of pages of memory on the
system. As the number of shared pages and the number of times pages
are shared goes up, amount of memory consumed by page tables starts to
become significant.

Some of the field deployments commonly see memory pages shared across
1000s of processes. On x86_64, each page requires a PTE that is only 8
bytes long which is very small compared to the 4K page size. When 2000
processes map the same page in their address space, each one of them
requires 8 bytes for its PTE and together that adds up to 8K of memory
just to hold the PTEs for one 4K page. On a database server with 300GB
SGA, a system carsh was seen with out-of-memory condition when 1500+
clients tried to share this SGA even though the system had 512GB of
memory. On this server, in the worst case scenario of all 1500
processes mapping every page from SGA would have required 878GB+ for
just the PTEs. If these PTEs could be shared, amount of memory saved
is very significant.

This patch series implements a mechanism in kernel to allow userspace
processes to opt into sharing PTEs. It adds two new system calls - (1)
mshare(), which can be used by a process to create a region (we will
call it mshare'd region) which can be used by other processes to map
same pages using shared PTEs, (2) mshare_unlink() which is used to
detach from the mshare'd region. Once an mshare'd region is created,
other process(es), assuming they have the right permissions, can make
the mashare() system call to map the shared pages into their address
space using the shared PTEs. When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.



API
===

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.


Please don't add system calls that take names.
Please just open objects on the filesystem and allow multi-instances of
the filesystem. Otherwise someone is going to have to come along later
and implement namespace magic to deal with your new system calls. You
already have a filesystem all that is needed to avoid having to
introduce namespace magic is to simply allow multiple instances of the
filesystem to be mounted.

Hi Eric,

Thanks for taking the time to provide feedback.

That sounds reasonable. Is something like this more in line with what you are thinking:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
mshare(fd, start, end, O_RDWR);

This sequence creates the mshare'd region and assigns it an address range (which may or may not be empty). Then a client process would do something like:

fd = open("/sys/fs/mshare/testregion", O_RDONLY);
mshare(fd, start, end, O_RDWR);

which maps the mshare'd region into client process's address space.


On that note. Since you have a filesystem, introduce a well known
name for a directory and in that directory place all of the information
and possibly control files for your filesystem. No need to for proc
files and the like, and if at somepoint you have mount options that
allow the information to be changed you can have different mounts
with different values present.

So have the kernel mount msharefs at /sys/fs/mshare and create a file /sys/fs/mshare/info. A read from /sys/fs/mshare/info returns what /proc/sys/vm//mshare_size in patch 12 returns. Did I understand that correctly?



This is must me. But I find it weird that you don't use mmap
to place the shared area from the mshare fd into your address space.

I think I would do:
// Establish the mshare region
addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_MSHARE, msharefd, 0);

// Remove the mshare region
addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
MAP_FIXED | MAP_MUNSHARE, msharefd, 0);

I could see a point of using separate system calls instead of
adding MAP_SHARE and MAP_UNSHARE flags.

In that case, we can just do away with syscalls completely. For instance, to create mshare'd region, one would:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
addr = mmap(NULL, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
or
addr = mmap(myaddr, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0);

First mmap using this fd sets the address range which stays the same through out the life of region. To populate data in this address range, the process could use mmap/mremap and other mechanisms subsequently.

To map this mshare'd region into its address space, a process would:

struct mshare_info {
unsigned long start;
unsigned long size;
} minfo;
fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
read(fd, &minfo, sizeof(struct mshare_info));
addr = mmap(NULL, minfo.size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

When done with mshare'd region, process would unlink("/sys/fs/mshare/testregion").

This changes API significantly but if it results in a cleaner and more maintainable API, I will make the changes.


What are the locking implications of taking a page fault in the shared
region?

Is it a noop or is it going to make some of the nasty locking we have in
the kernel for things like truncates worse?

Eric

Handling page fault would require locking mm for the faulting process and host mm which would synchronize any other process taking a page fault in the shared region at the same time.

Thanks,
Khalid