Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support

From: Michael S. Tsirkin
Date: Tue Jul 26 2005 - 08:37:22 EST

Hi, Hugh!
Thanks for the comments.

Quoting Hugh Dickins <hugh@xxxxxxxxxxx>:
> Subject: Re: [PATCH repost] PROT_DONTCOPY: ifiniband uverbs fork support
> On Mon, 25 Jul 2005, Michael S. Tsirkin wrote:
> >
> > This patch adds PROT_DONTCOPY to mmap and mprotect, to set VM_DONTCOPY on vma.
> > This is needed for infiniband userspace i/o, where we need to protect against
> > - the child process accessing the parent hardware page
> > - the parent registered address (on which the driver did get_user_pages)
> > getting remapped to another page by COW
> > One can imagine other uses, e.g. combined with mlock for real-time or security.
> I don't much like it, but it does solve a real problem in an efficient way.
> Partly I don't like it because of "PROT_DONTCOPY" itself: I'm queasy
> about protection flags which are not protection flags, though I find
> you're not the first to go down that road.

Yes. Compare with PROT_GROWSDOWN and such.

> Is the patch tested? I've not tried, but suspect the newflags shift
> and mask won't work for it.

I tested this patch. I didnt test all thinkable configurations of
flags though - what do you mean by "newflags shift and mask"?

> And I don't look forward to your adding

We already have VM_DONTCOPY. Why would we need VM_MAYDONTCOPY and what
would it do?

> > @@ -246,7 +246,7 @@ sys_mprotect(unsigned long start, size_t
> > goto out;
> > }
> >
> > - newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));
> > + newflags = vm_flags | (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC | VM_DONTCOPY));
> >
> > if ((newflags & ~(newflags >> 4)) & 0xf) {
> > error = -EACCES;
> I rather think it would all be more cleanly handled by dropping the mmap
> and mprotect changes,

Well, mmap would be much better off if VM_DONTCOPY is set atomically, since
a process may fork after mmap is called but before madvise.

> adding an madvise instead.

I'm not opposed to this, on principle. But see below.

> Though you may object
> that madvise is for optional behaviours, and this should be mandatory.

What about a new system call?
Or a flag for mprotect that effectively turns it into a new system call?
Something like PROT_EXTENDED?

> The other reason I dislike the patch is that the problem it fixes is
> an old one, and I'd much rather have get_user_pages fix it for itself,

Please note that the problem this attempts to solve is not limited
to pages locked by get_user_pages: in an infiniband userspace initiator,
a hardware page is mapped into process memory and must not be inherited
by a child processes, otherwise hardware protection breaks.

> than ask the developer to do some additional magic to get around it.
> But I've failed to work out a simple efficient alternative, which won't
> burden the vast majority of get_user_pages usages which never hit the
> issue.

They dont hit it if they keep the mm semaphore, or if they only lock
pages for read.

> So your way is probably appropriate, but I'd prefer madvise.

The difficulty with changing get_user_pages, as I see it, is that
you wont be able to get away with a single DONTCOPY bit - you'll need
a full reference count for each page, no less.

> (Sorry, I won't be able to discuss further for a couple of days.)
> Hugh

Well, madvise currently cant break/merge VMAs, which is required
for VM_DONTCOPY. And it seems like making madvise do this opens
a whole cans of worms.

Hugh, so the patch is likely to be bigger in the madvise approach.
Considering this, and the fact that a full solution has to add
a flag to mmap, anyway, do you still think madvise is really the best way
to do it?

Regarding solving the problem automagically by get_user_pages:

What about a new VM_COPYONFORK flag, to trigger the old unix
behaviour of copying the vma on fork and a flag for get_user_pages that sets it?
Only users that dont keep the mm semaphore around
the get_user_pages/put_page operation would use this flag, others
would be unaffected. The flag will stay on until the VMA is destroyed.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at