Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

From: Nick Piggin
Date: Thu Jun 25 2009 - 12:18:23 EST


On Mon, Jun 15, 2009 at 07:59:49PM +0200, Jan Kara wrote:
> page_mkwrite() is meant to be used by filesystems to allocate blocks under a
> page which is becoming writeably mmapped in some process address space. This
> allows a filesystem to return a page fault if there is not enough space
> available, user exceeds quota or similar problem happens, rather than silently
> discarding data later when writepage is called.
>
> On filesystems where blocksize < pagesize the situation is more complicated.
> Think for example that blocksize = 1024, pagesize = 4096 and a process does:
> ftruncate(fd, 0);
> pwrite(fd, buf, 1024, 0);
> map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
> map[0] = 'a'; ----> page_mkwrite() for index 0 is called
> ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
> fsync(fd); ----> writepage() for index 0 is called
>
> At the moment page_mkwrite() is called, filesystem can allocate only one block
> for the page because i_size == 1024. Otherwise it would create blocks beyond
> i_size which is generally undesirable. But later at writepage() time, we would
> like to have blocks allocated for the whole page (and in principle we have to
> allocate them because user could have filled the page with data after the
> second ftruncate()). This patch introduces a framework which allows filesystems
> to handle this with a reasonable effort.
>
> The idea is following: Before we extend i_size, we obtain a special lock blocking
> page_mkwrite() on the page straddling i_size. Then we writeprotect the page,
> change i_size and unlock the special lock. This way, page_mkwrite() is called for
> a page each time a number of blocks needed to be allocated for a page increases.


Sorry for late reply here, I'm not sure if the ptach was ready for this
merge window anyway if it has not been in Andrew or Al's trees.

Well... I can't really find any hole in your code, but I'm not completely
sure I like the design. I have done some thinking about the problem
when working on fsblock.

This is adding a whole new layer of synchronisation, which isn't exactly
trivial. What I had been thinking about is doing just page based
synchronisation. Now that page_mkwrite has been changed to allow page
lock held, I think it can work cleanly from the vm/pagecache perspective.

The biggest problem I ran into is the awful structuring of truncate from
below the vfs (so I gave up then).

I have been working to clean up and complete (at least to an RFC stage)
patches to improve this, at which point, doing the page_mkclean thing
on the last partial page should be quite trivial I think.

Basically the problems is that i_op->truncate a) cannot return an error
(which is causing problems missing -EIO today anyway), and b) is called
after i_size update which makes it not possible to error-out without
races anyway, and c) does not get the old i_size so you can't unmap the
last partial page with it.

My patch is basically moving ->truncate call into setattr, and have
the filesystem call vmtruncate. I've jt to clean up loose ends.

Now I may be speaking too soon. It might trun out that my fix is
complex as well, but let me just give you an RFC and we can discuss.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/