Re: [PATCH] Add notification of page becoming writable to VMA ops

From: Hugh Dickins
Date: Mon Oct 24 2005 - 14:12:35 EST


On Mon, 24 Oct 2005, David Howells wrote:
>
> The attached patch adds a new VMA operation to notify a filesystem or other
> driver about the MMU generating a fault because userspace attempted to write
> to a page mapped through a read-only PTE.
>
> This facility permits the filesystem or driver to:
>
> (*) Implement storage allocation/reservation on attempted write, and so to
> deal with problems such as ENOSPC more gracefully (perhaps by generating
> SIGBUS).
>
> (*) Delay making the page writable until the contents have been written to a
> backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
> It permits the filesystem to have some guarantee about the state of the
> cache.

I've only given it a quick look, it looks pretty good, but too hastily
thrown together, without understanding of the intervening changes:

> --- linux-2.6.14-rc4-mm1/mm/memory.c 2005-10-17 14:26:44.000000000 +0100
> +++ linux-2.6.14-rc4-mm1-cachefs/mm/memory.c 2005-10-20 18:53:04.000000000 +0100
> @@ -1261,19 +1261,53 @@ static int do_wp_page(struct mm_struct *
> + if (unlikely(vma->vm_flags & VM_SHARED)) {
> + if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
> + /*
> + * Notify the page owner without the lock held,
> + * so they can sleep if they want to.
> + */
> + pte_unmap(page_table);
> + if (!PageReserved(old_page))
> + page_cache_get(old_page);
> + spin_unlock(&mm->page_table_lock);

No, you need to pay attention to Nick's PageReserved removal, and
my pte lock stuff, throughout do_wp_page - there shouldn't be any
references to PageReserved or page_table_lock there now (and you'll
need to recheck the mapping/locking/unlocking/unmapping). Sorry,
I don't have the time to spare to do it myself right now.

> + page_table = pte_offset_map(pmd, address);
> + if (!pte_same(*page_table, orig_pte)) {
> + ret |= VM_FAULT_WRITE;

No, don't add VM_FAULT_WRITE in this case: you should only do that
when you've gone through the maybe_mkwrite yourself; this case
should remain the default VM_FAULT_MINOR.

> @@ -1847,18 +1890,28 @@ retry:
> + } else {
> + /* if the page will be shareable, see if the backing
> + * address space wants to know that the page is about
> + * to become writable */
> + if (vma->vm_ops->page_mkwrite &&
> + vma->vm_ops->page_mkwrite(vma, new_page) < 0)
> + return VM_FAULT_SIGBUS;
> + }
> }

This isn't necessarily wrong, and may be exactly how it was before,
I don't remember. But it implies that when page_mkwrite fails,
it page_cache_releases the page. Is that desirable? Or should
that be left to the caller?

> @@ -1945,7 +1998,7 @@ static int do_file_page(struct mm_struct

Drop all those changes to do_file_page (which I added), they're no
longer necessary. A case appeared which made it clear that we cannot
rely on resolving this issue for get_user_pages in a single call to
handle_mm_fault, and that's why the VM_FAULT_WRITE stuff got added.

This complication of do_file_page was always ugly, and I'm delighted
to drop it. Whereas the call to do_wp_page from do_swap_page is less
obtrusive and may still be a worthwhile optimization, though I added
it for the same disgraced reason a year or more back.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/