Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

From: Roland Dreier
Date: Mon Apr 25 2005 - 08:16:57 EST

Timur> With mlock(), we don't need to use get_user_pages() at all.
Timur> Arjan tells me the only time an mlocked page can move is
Timur> with hot (un)plug of memory, but that isn't supported on
Timur> the systems that we support. We actually prefer mlock()
Timur> over get_user_pages(), because if the process dies, the
Timur> locks automatically go away too.

There actually is another way pages can move, with both
get_user_pages() and mlock(): copy-on-write after a fork(). If
userspace does a fork(), then all PTEs are marked read-only, and if
the original process touches the page after the fork(), a new page
will be allocated and mapped at the original virtual address.

This is actually a pretty big pain, because the only good solution
seems to be for the kernel to mark these registered regions as
VM_DONTCOPY. Right now this means that driver code ends up monkeying
with vm_flags for user vmas.

Does it seem reasonable to add a new system call to let userspace mark
memory it doesn't want copied into forked processes? Something like

long sys_mark_nocopy(unsigned long addr, size_t len, int mark)

which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0.
A better name would be gratefully accepted...

Then to register memory for RDMA, userspace would call
sys_mark_nocopy() (with appropriate accounting to handle possibly
overlapping regions) and the kernel would call get_user_pages(). The
get_user_pages() is of course required because the kernel can't trust
userspace to keep the pages locked. mlock() would no longer be
necessary. We can trust userspace to call sys_mark_nocopy() as
needed, because a process can only hurt itself and its children by
misusing the sys_mark_nocopy() call.

If this seems reasonable then I can code a patch.

- R.
