tmpfs and the VM (Was: Re: questions on page cache and buffers)

Mark Hemment (markhe@nextd.demon.co.uk)
Tue, 19 Aug 1997 15:28:23 +0100 (BST)


Hi,

On Mon, 18 Aug 1997, Darin Johnson wrote:
> Looking closer, how to do what I want seems murky. Basically, I
> want a memory based filesystem, for a 'tempfs' style thing.

Implementing a tmpfs with the current page-cache and vm-operations is
not easy (impossible?). There are a few 'features' missing;
1) The page-cache cannot hold pages which are dirty but have
no vm-mapping. That is, when a vm_area [of named pages,
shared] is unmapped the dirty pages must be written to the file
they were loaded from (look at file_unmap() in fs/filemap.c).
This 'feature' also prevents write()s from going straight
through the page-cache - they must use buffers, and keep any
cached pages in sync (this is what update_vm_cache() does).
2) Assuming tmpfs pages compete for physical memory in the same
manner as other pages in the page-cache, there needs to be
a method of saving them to backing store.
Obviously, this backing store is swap. But swap only supports
locating pages from a local/inherited name-space, not global.
That is, when a page is written to swap, the PTE referencing
the page is loaded with an identifier to indicate where the
page is to be re-loaded from.
The identifier is inherited across fork()s (when the
page-tables are copied), but there is no way an unrelated task
can know where the page is located! This is fine for anonymous
pages, which have a limited name-space. For pages from a
tmpfs, this is not good enough.

OK. The solutions....

First, the page-cache needs to be modified to allow dirty, named pages
to exist without a VM mapping. I've got a hacked up version of this
working at the moment. Each page (or rather, pagable-maps or 'pmaps') has
a link list of PTEs which reference the page, plus some global data. It's
a bit too complicated to go into detail here, but it is working and allows
some nice hacks^H^H^H^H^H features.

Second, there needs to be an writepage inode operation implemented.
This allows for easy writing of these dirty pages. I've got a basic
generic_writepage() implemented. It is similar to generic_readpage (in
fs/buffer.c), but is still has a few holes....

Third, write()s need to be directed through the page-cache. This is
difficult - more on this below.

Finally, we need to be able to find those pages [from the tmpfs] which
have been written to swap.
This is similar to another feature; using swap as a 'bounce' device for
slow devices. When pages are read from a slow device, such as a
CD-ROM, they are not simply disgarded from the page-cache but are written
to swap. This allows fast(er) re-loading, but makes swap space a bastard
to manage.
I've separated the information used to manage a page, in the page-cache
from the 'struct page', into a dynamically allocated structure. This
structure is the "pmap" I mentioned earlier. It only has a weak binding
to a physical page. The binding states where the page is [a
physical-page, a swap-page, remote-memory???, etc].
This seperation allows the actual page which is containing the data for
the pmap to be relocated with only a small update to the pmap.
When named pages are written to swap, the physical page is released back
to the memory pool, but the updated pmap stays in the page-cache. This
gives the necessary global name-space to named-pages on swap.
I've implemented 'pmaps' already. I just need to get the writing of
named pages to swap working (I'm not too concerned with this feature at
the moment).
Note: _All_ user-pages now have pmaps. This allows a physcial page to
be copied to another incore page [to create a contigious memory area for
high order page allocations] without updating all the management
structures which refer to the page - the pmap stays at the same memory
location.
The pmaps are also linked into chains [active and inactive] which is how
my new kswapd selects a page.

Back to write()ing directly to the page-cache, and avoiding buffers....

This is difficult! The steps below are wrong, but they are a starting
point;
1) sys_write() calls the f_op->write() function (as it currently
does).
2) For file_shared_mmap mappings (the only mapping I'm currently
interested in for ext2), this would call a generic_file_write()
function in fs/filemap.c
3) generic_file_write() would check for page in the page-cache.
If not present, the page is loaded as in
generic_file_read() (added as a locked, not upto-date entry
to the page-cache and call i_op->readpage() to load it).
4) The page is now present. The new data is copied into the
page, and the page is marked dirty for later writing.

Now for [a _few_ of] the problems
1) When the page is not present, this could be because the
page does not exist - ie. the write is past the current
end-of-file.
The file-system needs to allocate new blocks, but not
necessarily a full page's worth - this might only be a
write of a few bytes, where a single block is sufficient.
2) A write to only one block of a page, should not cause
the entire page to be written back out.
3) We want to delay the writing of the modified blocks of
a page (as is done with buffers), but we do not want to
delay it for too long - eg. we should not wait for the
page-reaper to write the page out....
4) Loads of other problems.....

The solution to may of the problems seems to be temporary buffer heads,
similar [but not identical] in their use to those used for reading a page.
These heads point to memory in a page. Remember, I'm working with
'pmaps'....

When a page is being written to via the write() interface, buffer-heads
are allocated for each portion of a page that is written to. These heads
are attached to the 'pmap' [which describes the page] as well as being in
the linked into the appropiate lists in fs/buffer.c.
If the VM sub-system decides it needs to write the pmap [ie. it has
been dirtied via writes to an mmap()ed file and memory is low], then
generic_writepage() (in fs/buffer.c) may already have some of the requried
buffer-heads for the write operation, and will only need to allocate a few
more. Remember, under mmap() the whole page is written out [as we don't
know which block has been dirtied].
When the pagewrite is finished, _all_ the buffer heads attached to the
pmap are released. (The pmap is Locked during I/O, so any new write()s to
the page block. Also all refering PTE's are marked COW, so if they try to
write to the page, do_wp_page() will block - it will not give write perms
to an I/O locked pmap, unless it is copying the page. A few small changes
to filemap_nopage() to allow new mappings, which do not immediately need
write-perms, to preceed even if the pmap is I/O locked provided it is
marked uptodate - a write does not clear this attribute).
If the VM sub-system needs to release a pmap [as for unmapping an area],
it notices there are buffer-heads attached to the pmap and it gives them a
kick. I'm not sure on the interface for this, but say buffer_kick(pmap)
which calls the block-device layer for all the buffer-heads attached to
the pmap.

However, if the buffer sub-system decides to write the buffers attached
to the pmap [that is, not a VM initiaed write] things become more
complicated....
First, all the buffer-heads which refer to the same pmap need to
performed as a group for efficiency. This requires some extra linkage in
the buffers heads (I believe, haven't checked this through - there might
be something there already which can be re-used).
Second, the pmap needs to be I/O locked during the operation, and all
refering PTEs need to marked COW (as with a VM initiated write).
Third, after the _all_ the I/O has finished the page needs to be
unlocked.
To handle this I/O locking/unlocking, each pmap has an I/O lock count
and the buffer-heads will have pre and post call-out functions.
The pre call-out will call a function under mm/ to increment the I/O
lock just before the buffer-heads are kicked down to the block-device
layer. The post-call out function will be copied into the request
structure, and called at the end of the request to dec the I/O lock (and
wake up any pmap waiters).
Why not perform the lock count management inline? Simply put, I do not
want any code outside of mm/ (or arch/) to know the structure of pmaps.
(I could let them know, but then I'd have to kill them....).
Also, with post request call-out functions, we can have asynchronous
swapping to "swap-files" (this is, at the moment, is only possible with
"swap-devices").

The addition of buffer-heads for write operations is non-trivial. The
block/buffer allocation functions (such as ext2's ext2_getblk) all need
changing, for example.
All of this still leaves many problems unresolved, such as the handling
of write()s to O_SYNC files.....

A couple of notes....

I'm mentioned marking the PTEs as COW before starting a write operation.
This is needed for some RAID support, and improves the consistency of data
on backing-store.
The removing of the COW is handled by do_wp_page(), which is not as
inefficent as it sounds. Each pmap has a ring of the PTEs which reference
the pmap, this ring containing pte_maps - one pte_map for each PTE.
The pte_map contains, amongst other info, a pointer to the vm_area
structure in which the pmap is a member. During a do_wp_page() fault, the
vma's are checkd. If they are all sharable and writable, then all the
PTEs are marked writable. Some sub-sets of this are also possbile....it's
interesting stuff.
To make do_wp_page() lighter, it no longer immediately allocates a page
upon entry. This was to avoid a race, which is now handled differently.

Infact, the pmaps and pte_maps allow for some nice VM hacking....

For a read(), where the user-buffer and f_pos is page-aligned and the
size is greater than [or equal to a page] can be handled with zero
copying! The PTE which points to the user-buffer is changed to point to
the page in the page-cache, and all PTE's refering to the page are marked
COW. If the task tries to modify the buffer, we get a page-fault and can
then copy the page (actually, the page-fault handling is a bit more
difficult, but possible). Whether this delaying of copying is a
performance win depends on what the task does with the buffer [and also
what other tasks which have read()/mmap() this page do]. If the write()
through-the-page-cache stuff ever works, this can use a similar technique.

All this 'experimenting' (as that is all this implemention is; an
experiment) is going a bit slow due to other commitments, holidays,
the Ashes test series (sob), etc...
I would like to finish of the VM stuff first, before hacking the
buffer/request handling.

Regards,

markhe

------------------------------------------------------------------
Mark Hemment, Unix/C Software Engineer (Contractor)
markhe@nextd.demon.co.uk http://www.nextd.demon.co.uk/
"Success has many fathers, failure is a B**TARD!" - anon
------------------------------------------------------------------