Re: [RFC PATCH 0/6] Add support for shared PTEs across processes

From: Khalid Aziz
Date: Fri Jan 21 2022 - 11:42:17 EST


On 1/21/22 07:47, Matthew Wilcox wrote:
On Fri, Jan 21, 2022 at 08:35:17PM +1300, Barry Song wrote:
On Fri, Jan 21, 2022 at 3:13 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
On Fri, Jan 21, 2022 at 09:08:06AM +0800, Barry Song wrote:
A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.

I am still struggling with the user scenarios of these new APIs. This patch
supposes multiple processes will have same virtual address for the shared
area? How can this be guaranteed while different processes can map different
stack, heap, libraries, files?

The two processes choose to share a chunk of their address space.
They can map anything they like in that shared area, and then also
anything they like in the areas that aren't shared. They can choose
for that shared area to have the same address in both processes
or different locations in each process.

If two processes want to put a shared library in that shared address
space, that should work. They probably would need to agree to use
the same virtual address for the shared page tables for that to work.

we are depending on an elf loader and ld to map the library
dynamically , so hardly
can we find a chance in users' code to call mshare() to map libraries
in application
level?

If somebody wants to modify ld.so to take advantage of mshare(), they
could. That wasn't our primary motivation here, so if it turns out to
not work for that usecase, well, that's a shame.

Think of this like hugetlbfs, only instead of sharing hugetlbfs
memory, you can share _anything_ that's mmapable.

yep, we can call mshare() on any kind of memory. for example, if multiple
processes use SYSV shmem, posix shmem or mmap the same file. but
it seems it is more sensible to let kernel do it automatically rather than
depending on calling mshare() from users? It is difficult for users to
decide which areas should be applied mshare(). users might want to call
mshare() for all shared areas to save memory coming from duplicated PTEs?
unlike SYSV shmem and POSIX shmem which are a feature for inter-processes
communications, mshare() looks not like a feature for applications,
but like a feature
for the whole system level? why would applications have to call something which
doesn't directly help them? without mshare(), those applications
will still work without any problem, right? is there anything in
mshare() which is
a must-have for applications? or mshare() is only a suggestion from applications
like madvise()?

Our use case is that we have some very large files stored on persistent
memory which we want to mmap in thousands of processes. So the first
one shares a chunk of its address space and mmaps all the files into
that chunk of address space. Subsequent processes find that a suitable
address space already exists and use it, sharing the page tables and
avoiding the calls to mmap.

Sharing page tables is akin to running multiple threads in a single
address space; except that only part of the address space is the same.
There does need to be a certain amount of trust between the processes
sharing the address space. You don't want to do it to an unsuspecting
process.


Hello Barry,

mshare() is really meant for sharing data across unrelated processes by sharing address space explicitly and hence opt-in is required. As Matthew said, the processes sharing this virtual address space need to have a level of trust.
Permissions on the msharefs files control who can access this shared address space. It is possible to adapt this
mechanism to share stack, libraries etc but that is not the intent. This feature will be used by applications that share
data with multiple processes using shared mapping normally and it helps them avoid the overhead of large number of
duplicated PTEs which consume memory. This extra memory consumed by PTEs reduces amount of memory available for
applications and can result in out-of-memory condition. An example from the patch 0/6:

"On a database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant."

--
Khalid